1
|
Jin X, He W, Liu M, Wang L, Zhang Y, Xu Y, Ma L, Huang Y, Xie M. Mining functional gene modules by multi-view NMF of phenome-genome association. BMC Genomics 2025; 23:868. [PMID: 39789452 PMCID: PMC11720361 DOI: 10.1186/s12864-024-11120-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2021] [Accepted: 12/02/2024] [Indexed: 01/12/2025] Open
Abstract
BACKGROUND Mining functional gene modules from genomic data is an important step to detect gene members of pathways or other relations such as protein-protein interactions. This work explores the plausibility of detecting functional gene modules by factorizing gene-phenotype association matrix from the phenotype ontology data rather than the conventionally used gene expression data. Recently, the hierarchical structure of phenotype ontologies has not been sufficiently utilized in gene clustering while functionally related genes are consistently associated with phenotypes on the same path in phenotype ontologies. RESULTS This work demonstrates a hierarchical Nonnegative Matrix Factorization (NMF) framework, called Consistent Multi-view Nonnegative Matrix Factorization (CMNMF), which factorizes genome-phenome association matrix at consecutive levels of the hierarchical structure in phenotype ontology to mine functional gene modules. CMNMF constrains the gene clusters from the association matrices at two consecutive levels to be consistent since the genes are annotated with both the child-level phenotypes and the parent-level phenotypes in two levels. CMNMF also restricts the identified gene clusters to be densely connected in the phenotype ontology hierarchy. In the experiments on mining functionally related genes from mouse phenotype ontology and human phenotype ontology, CMNMF effectively improves clustering performance over the baseline methods. Gene ontology enrichment analysis is also conducted to verify its practical effectiveness to reveal meaningful gene modules. CONCLUSIONS Utilizing the information in the hierarchical structure of phenotype ontology, CMNMF can identify functional gene modules with more biological significance than conventional methods. CMNMF can also be a better tool for predicting members of gene pathways and protein-protein interactions.
Collapse
Affiliation(s)
- Xu Jin
- College of Software, Nankai University, TianJin, China
| | - WenQian He
- College of Software, Nankai University, TianJin, China
| | - MingMing Liu
- College of Software, Nankai University, TianJin, China
| | - Lin Wang
- College of Software, Nankai University, TianJin, China
| | - YaoGong Zhang
- College of Software, Nankai University, TianJin, China
| | - YingJie Xu
- College of Software, Nankai University, TianJin, China
| | - Ling Ma
- College of Software, Nankai University, TianJin, China
| | - YaLou Huang
- TianJin International Joint Academy of Biomedicine, TianJin, China
| | - MaoQiang Xie
- College of Software, Nankai University, TianJin, China.
| |
Collapse
|
2
|
Broadbent C, Song T, Kuang R. Deciphering high-order structures in spatial transcriptomes with graph-guided Tucker decomposition. Bioinformatics 2024; 40:i529-i538. [PMID: 38940176 PMCID: PMC11256919 DOI: 10.1093/bioinformatics/btae245] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/29/2024] Open
Abstract
Spatial transcripome (ST) profiling can reveal cells' structural organizations and functional roles in tissues. However, deciphering the spatial context of gene expressions in ST data is a challenge-the high-order structure hiding in whole transcriptome space over 2D/3D spatial coordinates requires modeling and detection of interpretable high-order elements and components for further functional analysis and interpretation. This paper presents a new method GraphTucker-graph-regularized Tucker tensor decomposition for learning high-order factorization in ST data. GraphTucker is based on a nonnegative Tucker decomposition algorithm regularized by a high-order graph that captures spatial relation among spots and functional relation among genes. In the experiments on several Visium and Stereo-seq datasets, the novelty and advantage of modeling multiway multilinear relationships among the components in Tucker decomposition are demonstrated as opposed to the Canonical Polyadic Decomposition and conventional matrix factorization models by evaluation of detecting spatial components of gene modules, clustering spatial coefficients for tissue segmentation and imputing complete spatial transcriptomes. The results of visualization show strong evidence that GraphTucker detect more interpretable spatial components in the context of the spatial domains in the tissues. AVAILABILITY AND IMPLEMENTATION https://github.com/kuanglab/GraphTucker.
Collapse
Affiliation(s)
- Charles Broadbent
- Department of Computer Science and Engineering, University of Minnesota Twin Cities, Minneapolis, MN, 55455, United States
| | - Tianci Song
- Department of Computer Science and Engineering, University of Minnesota Twin Cities, Minneapolis, MN, 55455, United States
| | - Rui Kuang
- Department of Computer Science and Engineering, University of Minnesota Twin Cities, Minneapolis, MN, 55455, United States
| |
Collapse
|
3
|
Mihajlović K, Ceddia G, Malod-Dognin N, Novak G, Kyriakis D, Skupin A, Pržulj N. Multi-omics integration of scRNA-seq time series data predicts new intervention points for Parkinson's disease. Sci Rep 2024; 14:10983. [PMID: 38744869 PMCID: PMC11094121 DOI: 10.1038/s41598-024-61844-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2024] [Accepted: 05/10/2024] [Indexed: 05/16/2024] Open
Abstract
Parkinson's disease (PD) is a complex neurodegenerative disorder without a cure. The onset of PD symptoms corresponds to 50% loss of midbrain dopaminergic (mDA) neurons, limiting early-stage understanding of PD. To shed light on early PD development, we study time series scRNA-seq datasets of mDA neurons obtained from patient-derived induced pluripotent stem cell differentiation. We develop a new data integration method based on Non-negative Matrix Tri-Factorization that integrates these datasets with molecular interaction networks, producing condition-specific "gene embeddings". By mining these embeddings, we predict 193 PD-related genes that are largely supported (49.7%) in the literature and are specific to the investigated PINK1 mutation. Enrichment analysis in Kyoto Encyclopedia of Genes and Genomes pathways highlights 10 PD-related molecular mechanisms perturbed during early PD development. Finally, investigating the top 20 prioritized genes reveals 12 previously unrecognized genes associated with PD that represent interesting drug targets.
Collapse
Affiliation(s)
| | - Gaia Ceddia
- Barcelona Supercomputing Center (BSC), 08034, Barcelona, Spain
| | | | - Gabriela Novak
- The Integrative Cell Signalling Group, Centre for Systems Biomedicine (LCSB), University of Luxembourg, Esch-sur-Alzette, Luxembourg
- Luxembourg Institute of Health (LIH), Esch-sur-Alzette, Luxembourg
| | - Dimitrios Kyriakis
- The Integrative Cell Signalling Group, Centre for Systems Biomedicine (LCSB), University of Luxembourg, Esch-sur-Alzette, Luxembourg
| | - Alexander Skupin
- The Integrative Cell Signalling Group, Centre for Systems Biomedicine (LCSB), University of Luxembourg, Esch-sur-Alzette, Luxembourg
- Luxembourg Institute of Health (LIH), Esch-sur-Alzette, Luxembourg
- University of California San Diego, La Jolla, CA, 92093, USA
| | - Nataša Pržulj
- Barcelona Supercomputing Center (BSC), 08034, Barcelona, Spain.
- Department of Computer Science, University College London, WC1E 6BT, London, UK.
- ICREA, Pg. Lluís Companys 23, 08010, Barcelona, Spain.
| |
Collapse
|
4
|
Tan H, Guo M, Chen J, Wang J, Yu G. HetFCM: functional co-module discovery by heterogeneous network co-clustering. Nucleic Acids Res 2024; 52:e16. [PMID: 38088228 PMCID: PMC10853805 DOI: 10.1093/nar/gkad1174] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2023] [Revised: 10/31/2023] [Accepted: 11/23/2023] [Indexed: 02/10/2024] Open
Abstract
Functional molecular module (i.e., gene-miRNA co-modules and gene-miRNA-lncRNA triple-layer modules) analysis can dissect complex regulations underlying etiology or phenotypes. However, current module detection methods lack an appropriate usage and effective model of multi-omics data and cross-layer regulations of heterogeneous molecules, causing the loss of critical genetic information and corrupting the detection performance. In this study, we propose a heterogeneous network co-clustering framework (HetFCM) to detect functional co-modules. HetFCM introduces an attributed heterogeneous network to jointly model interplays and multi-type attributes of different molecules, and applies multiple variational graph autoencoders on the network to generate cross-layer association matrices, then it performs adaptive weighted co-clustering on association matrices and attribute data to identify co-modules of heterogeneous molecules. Empirical study on Human and Maize datasets reveals that HetFCM can find out co-modules characterized with denser topology and more significant functions, which are associated with human breast cancer (subtypes) and maize phenotypes (i.e., lipid storage, drought tolerance and oil content). HetFCM is a useful tool to detect co-modules and can be applied to multi-layer functional modules, yielding novel insights for analyzing molecular mechanisms. We also developed a user-friendly module detection and analysis tool and shared it at http://www.sdu-idea.cn/FMDTool.
Collapse
Affiliation(s)
- Haojiang Tan
- School of Software, Shandong University, Jinan 250101, Shandong, China
- Joint SDU-NTU Centre for Artificial Intelligence Research, Shandong University, Jinan 250101, Shandong, China
| | - Maozu Guo
- College of Electrical and Information Engineering, Beijing Uni. of Civil Eng. and Arch., Beijing 100044, China
| | - Jian Chen
- College of Agronomy & Biotechnolog, China Agricultural University, Beijing 100193, China
| | - Jun Wang
- Joint SDU-NTU Centre for Artificial Intelligence Research, Shandong University, Jinan 250101, Shandong, China
| | - Guoxian Yu
- School of Software, Shandong University, Jinan 250101, Shandong, China
- Joint SDU-NTU Centre for Artificial Intelligence Research, Shandong University, Jinan 250101, Shandong, China
| |
Collapse
|
5
|
Wang YC, Wu Y, Choi J, Allington G, Zhao S, Khanfar M, Yang K, Fu PY, Wrubel M, Yu X, Mekbib KY, Ocken J, Smith H, Shohfi J, Kahle KT, Lu Q, Jin SC. Computational Genomics in the Era of Precision Medicine: Applications to Variant Analysis and Gene Therapy. J Pers Med 2022; 12:175. [PMID: 35207663 PMCID: PMC8878256 DOI: 10.3390/jpm12020175] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2021] [Revised: 01/18/2022] [Accepted: 01/24/2022] [Indexed: 02/04/2023] Open
Abstract
Rapid methodological advances in statistical and computational genomics have enabled researchers to better identify and interpret both rare and common variants responsible for complex human diseases. As we continue to see an expansion of these advances in the field, it is now imperative for researchers to understand the resources and methodologies available for various data types and study designs. In this review, we provide an overview of recent methods for identifying rare and common variants and understanding their roles in disease etiology. Additionally, we discuss the strategy, challenge, and promise of gene therapy. As computational and statistical approaches continue to improve, we will have an opportunity to translate human genetic findings into personalized health care.
Collapse
Affiliation(s)
- Yung-Chun Wang
- Department of Genetics, School of Medicine, Washington University, St. Louis, MO 63110, USA; (Y.-C.W.); (J.C.); (S.Z.); (M.K.); (K.Y.); (P.-Y.F.); (M.W.); (X.Y.)
| | - Yuchang Wu
- Department of Biostatistics & Medical Informatics, University of Wisconsin-Madison, Madison, WI 53706, USA;
| | - Julie Choi
- Department of Genetics, School of Medicine, Washington University, St. Louis, MO 63110, USA; (Y.-C.W.); (J.C.); (S.Z.); (M.K.); (K.Y.); (P.-Y.F.); (M.W.); (X.Y.)
| | - Garrett Allington
- Department of Pathology, Yale School of Medicine, New Haven, CT 06510, USA;
- Department of Neurosurgery, Massachusetts General Hospital, Boston, MA 02114, USA; (H.S.); (K.T.K.)
| | - Shujuan Zhao
- Department of Genetics, School of Medicine, Washington University, St. Louis, MO 63110, USA; (Y.-C.W.); (J.C.); (S.Z.); (M.K.); (K.Y.); (P.-Y.F.); (M.W.); (X.Y.)
| | - Mariam Khanfar
- Department of Genetics, School of Medicine, Washington University, St. Louis, MO 63110, USA; (Y.-C.W.); (J.C.); (S.Z.); (M.K.); (K.Y.); (P.-Y.F.); (M.W.); (X.Y.)
| | - Kuangying Yang
- Department of Genetics, School of Medicine, Washington University, St. Louis, MO 63110, USA; (Y.-C.W.); (J.C.); (S.Z.); (M.K.); (K.Y.); (P.-Y.F.); (M.W.); (X.Y.)
| | - Po-Ying Fu
- Department of Genetics, School of Medicine, Washington University, St. Louis, MO 63110, USA; (Y.-C.W.); (J.C.); (S.Z.); (M.K.); (K.Y.); (P.-Y.F.); (M.W.); (X.Y.)
| | - Max Wrubel
- Department of Genetics, School of Medicine, Washington University, St. Louis, MO 63110, USA; (Y.-C.W.); (J.C.); (S.Z.); (M.K.); (K.Y.); (P.-Y.F.); (M.W.); (X.Y.)
| | - Xiaobing Yu
- Department of Genetics, School of Medicine, Washington University, St. Louis, MO 63110, USA; (Y.-C.W.); (J.C.); (S.Z.); (M.K.); (K.Y.); (P.-Y.F.); (M.W.); (X.Y.)
- Department of Computer Science & Engineering, Washington University, St. Louis, MO 63130, USA
| | - Kedous Y. Mekbib
- Department of Neurosurgery, Yale University School of Medicine, New Haven, CT 06510, USA; (K.Y.M.); (J.O.); (J.S.)
| | - Jack Ocken
- Department of Neurosurgery, Yale University School of Medicine, New Haven, CT 06510, USA; (K.Y.M.); (J.O.); (J.S.)
| | - Hannah Smith
- Department of Neurosurgery, Massachusetts General Hospital, Boston, MA 02114, USA; (H.S.); (K.T.K.)
- Department of Neurosurgery, Yale University School of Medicine, New Haven, CT 06510, USA; (K.Y.M.); (J.O.); (J.S.)
| | - John Shohfi
- Department of Neurosurgery, Yale University School of Medicine, New Haven, CT 06510, USA; (K.Y.M.); (J.O.); (J.S.)
| | - Kristopher T. Kahle
- Department of Neurosurgery, Massachusetts General Hospital, Boston, MA 02114, USA; (H.S.); (K.T.K.)
- Division of Genetics and Genomics, Boston Children’s Hospital, Boston, MA 02115, USA
- Departments of Pediatrics and Neurology, Harvard Medical School, Boston, MA 02115, USA
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Qiongshi Lu
- Department of Biostatistics & Medical Informatics, University of Wisconsin-Madison, Madison, WI 53706, USA;
| | - Sheng Chih Jin
- Department of Genetics, School of Medicine, Washington University, St. Louis, MO 63110, USA; (Y.-C.W.); (J.C.); (S.Z.); (M.K.); (K.Y.); (P.-Y.F.); (M.W.); (X.Y.)
- Department of Pediatrics, School of Medicine, Washington University, St. Louis, MO 63110, USA
| |
Collapse
|
6
|
Zambrana C, Xenos A, Böttcher R, Malod-Dognin N, Pržulj N. Network neighbors of viral targets and differentially expressed genes in COVID-19 are drug target candidates. Sci Rep 2021; 11:18985. [PMID: 34556735 PMCID: PMC8460804 DOI: 10.1038/s41598-021-98289-x] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2021] [Accepted: 08/23/2021] [Indexed: 12/12/2022] Open
Abstract
The COVID-19 pandemic is raging. It revealed the importance of rapid scientific advancement towards understanding and treating new diseases. To address this challenge, we adapt an explainable artificial intelligence algorithm for data fusion and utilize it on new omics data on viral-host interactions, human protein interactions, and drugs to better understand SARS-CoV-2 infection mechanisms and predict new drug-target interactions for COVID-19. We discover that in the human interactome, the human proteins targeted by SARS-CoV-2 proteins and the genes that are differentially expressed after the infection have common neighbors central in the interactome that may be key to the disease mechanisms. We uncover 185 new drug-target interactions targeting 49 of these key genes and suggest re-purposing of 149 FDA-approved drugs, including drugs targeting VEGF and nitric oxide signaling, whose pathways coincide with the observed COVID-19 symptoms. Our integrative methodology is universal and can enable insight into this and other serious diseases.
Collapse
Affiliation(s)
| | | | | | - Noël Malod-Dognin
- Barcelona Supercomputing Center, Barcelona, Spain
- Department of Computer Science, University College London, London, WC1E 6BT, UK
| | - Nataša Pržulj
- Barcelona Supercomputing Center, Barcelona, Spain.
- Department of Computer Science, University College London, London, WC1E 6BT, UK.
- ICREA, Pg. Lluís Companys 23, Barcelona, Spain.
| |
Collapse
|
7
|
Ding P, Ouyang W, Luo J, Kwoh CK. Heterogeneous information network and its application to human health and disease. Brief Bioinform 2021; 21:1327-1346. [PMID: 31566212 DOI: 10.1093/bib/bbz091] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2019] [Revised: 06/29/2019] [Accepted: 06/30/2019] [Indexed: 12/11/2022] Open
Abstract
The molecular components with the functional interdependencies in human cell form complicated biological network. Diseases are mostly caused by the perturbations of the composite of the interaction multi-biomolecules, rather than an abnormality of a single biomolecule. Furthermore, new biological functions and processes could be revealed by discovering novel biological entity relationships. Hence, more and more biologists focus on studying the complex biological system instead of the individual biological components. The emergence of heterogeneous information network (HIN) offers a promising way to systematically explore complicated and heterogeneous relationships between various molecules for apparently distinct phenotypes. In this review, we first present the basic definition of HIN and the biological system considered as a complex HIN. Then, we discuss the topological properties of HIN and how these can be applied to detect network motif and functional module. Afterwards, methodologies of discovering relationships between disease and biomolecule are presented. Useful insights on how HIN aids in drug development and explores human interactome are provided. Finally, we analyze the challenges and opportunities for uncovering combinatorial patterns among pharmacogenomics and cell-type detection based on single-cell genomic data.
Collapse
Affiliation(s)
- Pingjian Ding
- School of Computer Science, University of South China, Hengyang, China
| | - Wenjue Ouyang
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, China
| | - Jiawei Luo
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, China
| | - Chee-Keong Kwoh
- School of Computer Science and Engineering, Nanyang Technological University, Singapore, Singapore
| |
Collapse
|
8
|
Díaz-Santiago E, Claros MG, Yahyaoui R, de Diego-Otero Y, Calvo R, Hoenicka J, Palau F, Ranea JAG, Perkins JR. Decoding Neuromuscular Disorders Using Phenotypic Clusters Obtained From Co-Occurrence Networks. Front Mol Biosci 2021; 8:635074. [PMID: 34046427 PMCID: PMC8147726 DOI: 10.3389/fmolb.2021.635074] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2020] [Accepted: 02/15/2021] [Indexed: 12/19/2022] Open
Abstract
Neuromuscular disorders (NMDs) represent an important subset of rare diseases associated with elevated morbidity and mortality whose diagnosis can take years. Here we present a novel approach using systems biology to produce functionally-coherent phenotype clusters that provide insight into the cellular functions and phenotypic patterns underlying NMDs, using the Human Phenotype Ontology as a common framework. Gene and phenotype information was obtained for 424 NMDs in OMIM and 126 NMDs in Orphanet, and 335 and 216 phenotypes were identified as typical for NMDs, respectively. ‘Elevated serum creatine kinase’ was the most specific to NMDs, in agreement with the clinical test of elevated serum creatinine kinase that is conducted on NMD patients. The approach to obtain co-occurring NMD phenotypes was validated based on co-mention in PubMed abstracts. A total of 231 (OMIM) and 150 (Orphanet) clusters of highly connected co-occurrent NMD phenotypes were obtained. In parallel, a tripartite network based on phenotypes, diseases and genes was used to associate NMD phenotypes with functions, an approach also validated by literature co-mention, with KEGG pathways showing proportionally higher overlap than Gene Ontology and Reactome. Phenotype-function pairs were crossed with the co-occurrent NMD phenotype clusters to obtain 40 (OMIM) and 72 (Orphanet) functionally coherent phenotype clusters. As expected, many of these overlapped with known diseases and confirmed existing knowledge. Other clusters revealed interesting new findings, indicating informative phenotypes for differential diagnosis, providing deeper knowledge of NMDs, and pointing towards specific cell dysfunction caused by pleiotropic genes. This work is an example of reproducible research that i) can help better understand NMDs and support their diagnosis by providing a new tool that exploits existing information to obtain novel clusters of functionally-related phenotypes, and ii) takes us another step towards personalised medicine for NMDs.
Collapse
Affiliation(s)
- Elena Díaz-Santiago
- Department of Molecular Biology and Biochemistry, Universidad de Málaga, Málaga, Spain
| | - M Gonzalo Claros
- Department of Molecular Biology and Biochemistry, Universidad de Málaga, Málaga, Spain.,CIBER de Enfermedades Raras (CIBERER), Madrid, Spain.,Institute of Biomedical Research in Malaga (IBIMA), IBIMA-RARE, Málaga, Spain.,Institute for Mediterranean and Subtropical Horticulture "La Mayora" (IHSM-UMA-CSIC), Málaga, Spain
| | - Raquel Yahyaoui
- Institute of Biomedical Research in Malaga (IBIMA), IBIMA-RARE, Málaga, Spain.,Laboratory of Metabolopathies and Neonatal Screening, Málaga Regional University Hospital, Málaga, Spain
| | | | - Rocío Calvo
- Institute of Biomedical Research in Malaga (IBIMA), IBIMA-RARE, Málaga, Spain.,Laboratory of Metabolopathies and Neonatal Screening, Málaga Regional University Hospital, Málaga, Spain
| | - Janet Hoenicka
- CIBER de Enfermedades Raras (CIBERER), Madrid, Spain.,Sant Joan de Déu Hospital and Research Institute, Barcelona, Spain
| | - Francesc Palau
- CIBER de Enfermedades Raras (CIBERER), Madrid, Spain.,Sant Joan de Déu Hospital and Research Institute, Barcelona, Spain.,Hospital Clínic and University of Barcelona School of Medicine and Health Sciences, Barcelona, Spain
| | - Juan A G Ranea
- Department of Molecular Biology and Biochemistry, Universidad de Málaga, Málaga, Spain.,CIBER de Enfermedades Raras (CIBERER), Madrid, Spain.,Institute of Biomedical Research in Malaga (IBIMA), IBIMA-RARE, Málaga, Spain
| | - James R Perkins
- Department of Molecular Biology and Biochemistry, Universidad de Málaga, Málaga, Spain.,CIBER de Enfermedades Raras (CIBERER), Madrid, Spain.,Institute of Biomedical Research in Malaga (IBIMA), IBIMA-RARE, Málaga, Spain
| |
Collapse
|
9
|
Banerjee A, Chen S, Fatemifar G, Zeina M, Lumbers RT, Mielke J, Gill S, Kotecha D, Freitag DF, Denaxas S, Hemingway H. Machine learning for subtype definition and risk prediction in heart failure, acute coronary syndromes and atrial fibrillation: systematic review of validity and clinical utility. BMC Med 2021; 19:85. [PMID: 33820530 PMCID: PMC8022365 DOI: 10.1186/s12916-021-01940-7] [Citation(s) in RCA: 29] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/01/2020] [Accepted: 02/12/2021] [Indexed: 02/08/2023] Open
Abstract
BACKGROUND Machine learning (ML) is increasingly used in research for subtype definition and risk prediction, particularly in cardiovascular diseases. No existing ML models are routinely used for cardiovascular disease management, and their phase of clinical utility is unknown, partly due to a lack of clear criteria. We evaluated ML for subtype definition and risk prediction in heart failure (HF), acute coronary syndromes (ACS) and atrial fibrillation (AF). METHODS For ML studies of subtype definition and risk prediction, we conducted a systematic review in HF, ACS and AF, using PubMed, MEDLINE and Web of Science from January 2000 until December 2019. By adapting published criteria for diagnostic and prognostic studies, we developed a seven-domain, ML-specific checklist. RESULTS Of 5918 studies identified, 97 were included. Across studies for subtype definition (n = 40) and risk prediction (n = 57), there was variation in data source, population size (median 606 and median 6769), clinical setting (outpatient, inpatient, different departments), number of covariates (median 19 and median 48) and ML methods. All studies were single disease, most were North American (n = 61/97) and only 14 studies combined definition and risk prediction. Subtype definition and risk prediction studies respectively had limitations in development (e.g. 15.0% and 78.9% of studies related to patient benefit; 15.0% and 15.8% had low patient selection bias), validation (12.5% and 5.3% externally validated) and impact (32.5% and 91.2% improved outcome prediction; no effectiveness or cost-effectiveness evaluations). CONCLUSIONS Studies of ML in HF, ACS and AF are limited by number and type of included covariates, ML methods, population size, country, clinical setting and focus on single diseases, not overlap or multimorbidity. Clinical utility and implementation rely on improvements in development, validation and impact, facilitated by simple checklists. We provide clear steps prior to safe implementation of machine learning in clinical practice for cardiovascular diseases and other disease areas.
Collapse
Affiliation(s)
- Amitava Banerjee
- Institute of Health Informatics, University College London, 222 Euston Road, London, NW1 2DA, UK.
- Health Data Research UK, University College London, London, UK.
- University College London Hospitals NHS Trust, 235 Euston Road, London, UK.
- Barts Health NHS Trust, The Royal London Hospital, Whitechapel Rd, London, UK.
| | - Suliang Chen
- Institute of Health Informatics, University College London, 222 Euston Road, London, NW1 2DA, UK
- Health Data Research UK, University College London, London, UK
| | - Ghazaleh Fatemifar
- Institute of Health Informatics, University College London, 222 Euston Road, London, NW1 2DA, UK
- Health Data Research UK, University College London, London, UK
| | | | - R Thomas Lumbers
- Institute of Health Informatics, University College London, 222 Euston Road, London, NW1 2DA, UK
- Health Data Research UK, University College London, London, UK
- University College London Hospitals NHS Trust, 235 Euston Road, London, UK
| | - Johanna Mielke
- Bayer AG, Division Pharmaceuticals, Open Innovation & Digital Technologies, Wuppertal, Germany
| | - Simrat Gill
- University of Birmingham Institute of Cardiovascular Sciences and University Hospitals Birmingham NHS Foundation Trust, Birmingham, UK
| | - Dipak Kotecha
- University of Birmingham Institute of Cardiovascular Sciences and University Hospitals Birmingham NHS Foundation Trust, Birmingham, UK
- Department of Cardiology, University Medical Centre Utrecht, Utrecht, the Netherlands
| | - Daniel F Freitag
- Bayer AG, Division Pharmaceuticals, Open Innovation & Digital Technologies, Wuppertal, Germany
| | - Spiros Denaxas
- Institute of Health Informatics, University College London, 222 Euston Road, London, NW1 2DA, UK
- Health Data Research UK, University College London, London, UK
- The Alan Turing Institute, London, UK
| | - Harry Hemingway
- Institute of Health Informatics, University College London, 222 Euston Road, London, NW1 2DA, UK
- Health Data Research UK, University College London, London, UK
- University College London Hospitals Biomedical Research Centre (UCLH BRC), London, UK
| |
Collapse
|
10
|
Lim H, Xie L. A New Weighted Imputed Neighborhood-Regularized Tri-Factorization One-Class Collaborative Filtering Algorithm: Application to Target Gene Prediction of Transcription Factors. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:126-137. [PMID: 31995498 PMCID: PMC7382975 DOI: 10.1109/tcbb.2020.2968442] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Identifying target genes of transcription factors (TFs) is crucial to understand transcriptional regulation. However, our understanding of genome-wide TF targeting profile is limited due to the cost of large-scale experiments and intrinsic complexity of gene regulation. Thus, computational prediction methods are useful to predict unobserved TF-gene associations. Here, we develop a new Weighted Imputed Neighborhood-regularized Tri-Factorization one-class collaborative filtering algorithm, WINTF. It predicts unobserved target genes for TFs using known but noisy, incomplete, and biased TF-gene associations and protein-protein interaction networks. Our benchmark study shows that WINTF significantly outperforms its counterpart matrix factorization-based algorithms and tri-factorization methods that do not include weight, imputation, and neighbor-regularization, for TF-gene association prediction. When evaluated by independent datasets, accuracy is 37.8 percent on the top 495 predicted associations, an enrichment factor of 4.19 compared with random guess. Furthermore, many predicted novel associations are supported by literature evidence. Although we only use canonical TF-gene interaction data, WINTF can directly be applied to tissue-specific data when available. Thus, WINTF provides a potentially useful framework to integrate multiple omics data for further improvement of TF-gene prediction and applications to other sparse and noisy biological data. The benchmark dataset and source code are freely available at https://github.com/XieResearchGroup/WINTF.
Collapse
|
11
|
Ren X, Wang S, Huang T. Decipher the connections between proteins and phenotypes. BIOCHIMICA ET BIOPHYSICA ACTA-PROTEINS AND PROTEOMICS 2020; 1868:140503. [PMID: 32707349 DOI: 10.1016/j.bbapap.2020.140503] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/31/2020] [Revised: 06/30/2020] [Accepted: 07/16/2020] [Indexed: 10/23/2022]
Abstract
As the outward-most representation of life, phenotype is the fundamental basis with which humans understand life and disease. But with the advent of molecular and sequencing technique and research, a growing portion of science research focuses primarily on the molecular level of life. Our understanding in molecular variations and mechanisms can only be fully utilized when they are translated into the phenotypic level. In this study, we constructed similarity network for phenotype ontology, and then applied network analysis methods to discover phenotype/disease clusters. Then, we used machine learning models to predict protein-phenotype associations. Each protein was characterized by the functional profiles of its interaction neighbors on the protein-protein interaction network. Our methods can not only predict protein-phenotype associations, but also reveal the underlying mechanisms from protein to phenotype.
Collapse
Affiliation(s)
- Xiaohui Ren
- Shanghai Institute of Nutrition and Health, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Steven Wang
- Department of Molecular Biology, Columbia University, New York, USA
| | - Tao Huang
- Shanghai Institute of Nutrition and Health, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China.
| |
Collapse
|
12
|
Johnson EO, Hung DT. A Point of Inflection and Reflection on Systems Chemical Biology. ACS Chem Biol 2019; 14:2497-2511. [PMID: 31613592 DOI: 10.1021/acschembio.9b00714] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Abstract
For the past several decades, chemical biologists have been leveraging chemical principles for understanding biology, tackling disease, and biomanufacturing, while systems biologists have holistically applied computation and genome-scale experimental tools to the same problems. About a decade ago, the benefit of combining the philosophies of chemical biology with systems biology into systems chemical biology was advocated, with the potential to systematically understand the way small molecules affect biological systems. Recently, there has been an explosion in new technologies that permit massive expansion in the scale of biological experimentation, increase access to more diverse chemical space, and enable powerful computational interpretation of large datasets. Fueled by these rapidly increasing capabilities, systems chemical biology is now at an inflection point, poised to enter a new era of more holistic and integrated scientific discovery. Systems chemical biology is primed to reveal an integrated understanding of fundamental biology and to discover new chemical probes to comprehensively dissect and systematically understand that biology, thereby providing a path to novel strategies for discovering therapeutics, designing drug combinations, avoiding toxicity, and harnessing beneficial polypharmacology. In this Review, we examine the emergence of new capabilities driving us to this inflection point in systems chemical biology, and highlight holistic approaches and opportunities that are arising from integrating chemical biology with a systems-level understanding of the intersection of biology and chemistry.
Collapse
Affiliation(s)
- Eachan O. Johnson
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, United States
- Department of Molecular Biology and Center for Computational and Integrative Biology, Massachusetts General Hospital, Boston, Massachusetts 02114, United States
- Department of Genetics, Harvard Medical School, Boston, Massachusetts 02115, United States
| | - Deborah T. Hung
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, United States
- Department of Molecular Biology and Center for Computational and Integrative Biology, Massachusetts General Hospital, Boston, Massachusetts 02114, United States
- Department of Genetics, Harvard Medical School, Boston, Massachusetts 02115, United States
| |
Collapse
|
13
|
Leal LG, David A, Jarvelin MR, Sebert S, Männikkö M, Karhunen V, Seaby E, Hoggart C, Sternberg MJE. Identification of disease-associated loci using machine learning for genotype and network data integration. Bioinformatics 2019; 35:5182-5190. [PMID: 31070705 PMCID: PMC6954643 DOI: 10.1093/bioinformatics/btz310] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2018] [Revised: 03/28/2019] [Accepted: 04/25/2019] [Indexed: 01/19/2023] Open
Abstract
MOTIVATION Integration of different omics data could markedly help to identify biological signatures, understand the missing heritability of complex diseases and ultimately achieve personalized medicine. Standard regression models used in Genome-Wide Association Studies (GWAS) identify loci with a strong effect size, whereas GWAS meta-analyses are often needed to capture weak loci contributing to the missing heritability. Development of novel machine learning algorithms for merging genotype data with other omics data is highly needed as it could enhance the prioritization of weak loci. RESULTS We developed cNMTF (corrected non-negative matrix tri-factorization), an integrative algorithm based on clustering techniques of biological data. This method assesses the inter-relatedness between genotypes, phenotypes, the damaging effect of the variants and gene networks in order to identify loci-trait associations. cNMTF was used to prioritize genes associated with lipid traits in two population cohorts. We replicated 129 genes reported in GWAS world-wide and provided evidence that supports 85% of our findings (226 out of 265 genes), including recent associations in literature (NLGN1), regulators of lipid metabolism (DAB1) and pleiotropic genes for lipid traits (CARM1). Moreover, cNMTF performed efficiently against strong population structures by accounting for the individuals' ancestry. As the method is flexible in the incorporation of diverse omics data sources, it can be easily adapted to the user's research needs. AVAILABILITY AND IMPLEMENTATION An R package (cnmtf) is available at https://lgl15.github.io/cnmtf_web/index.html. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Luis G Leal
- Department of Life Sciences, Centre for Integrative Systems Biology and Bioinformatics, Imperial College London, London SW7 2AZ, UK
| | - Alessia David
- Department of Life Sciences, Centre for Integrative Systems Biology and Bioinformatics, Imperial College London, London SW7 2AZ, UK
| | - Marjo-Riita Jarvelin
- Center for Life Course Health Research, Faculty of Medicine, University of Oulu, Oulu FI-90014, Finland
- Biocenter Oulu, University of Oulu, Oulu 90220, Finland
- Unit of Primary Health Care, Oulu University Hospital, Oulu 90220, Finland
- Department of Epidemiology and Biostatistics, MRC-PHE Centre for Environment and Health, School of Public Health, Imperial College London, London W2 1PG, UK
- Department of Life Sciences, College of Health and Life Sciences, Brunel University London, Middlesex UB8 3PH, UK
| | - Sylvain Sebert
- Center for Life Course Health Research, Faculty of Medicine, University of Oulu, Oulu FI-90014, Finland
- Biocenter Oulu, University of Oulu, Oulu 90220, Finland
| | - Minna Männikkö
- Center for Life Course Health Research, Faculty of Medicine, University of Oulu, Oulu FI-90014, Finland
| | - Ville Karhunen
- Center for Life Course Health Research, Faculty of Medicine, University of Oulu, Oulu FI-90014, Finland
- Biocenter Oulu, University of Oulu, Oulu 90220, Finland
- Unit of Primary Health Care, Oulu University Hospital, Oulu 90220, Finland
- Department of Epidemiology and Biostatistics, MRC-PHE Centre for Environment and Health, School of Public Health, Imperial College London, London W2 1PG, UK
- Department of Life Sciences, College of Health and Life Sciences, Brunel University London, Middlesex UB8 3PH, UK
| | - Eleanor Seaby
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Clive Hoggart
- Department of Medicine, Imperial College London, London W2 1PG, UK
| | - Michael J E Sternberg
- Department of Life Sciences, Centre for Integrative Systems Biology and Bioinformatics, Imperial College London, London SW7 2AZ, UK
| |
Collapse
|
14
|
Liu D, Davila-Velderrain J, Zhang Z, Kellis M. Integrative construction of regulatory region networks in 127 human reference epigenomes by matrix factorization. Nucleic Acids Res 2019; 47:7235-7246. [PMID: 31265076 PMCID: PMC6698807 DOI: 10.1093/nar/gkz538] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2018] [Revised: 04/19/2019] [Accepted: 06/09/2019] [Indexed: 01/14/2023] Open
Abstract
Despite large experimental and computational efforts aiming to dissect the mechanisms underlying disease risk, mapping cis-regulatory elements to target genes remains a challenge. Here, we introduce a matrix factorization framework to integrate physical and functional interaction data of genomic segments. The framework was used to predict a regulatory network of chromatin interaction edges linking more than 20 000 promoters and 1.8 million enhancers across 127 human reference epigenomes, including edges that are present in any of the input datasets. Our network integrates functional evidence of correlated activity patterns from epigenomic data and physical evidence of chromatin interactions. An important contribution of this work is the representation of heterogeneous data with different qualities as networks. We show that the unbiased integration of independent data sources suggestive of regulatory interactions produces meaningful associations supported by existing functional and physical evidence, correlating with expected independent biological features.
Collapse
Affiliation(s)
- Dianbo Liu
- MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA 02139, USA.,Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA.,Division of Computational Biology, School of Life Sciences, University of Dundee, Dundee, DD1 5HL, Scotland, UK
| | - Jose Davila-Velderrain
- MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA 02139, USA.,Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Zhizhuo Zhang
- MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA 02139, USA.,Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Manolis Kellis
- MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA 02139, USA.,Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| |
Collapse
|
15
|
Hyung D, Mallon AM, Kyung DS, Cho SY, Seong JK. TarGo: network based target gene selection system for human disease related mouse models. Lab Anim Res 2019; 35:23. [PMID: 32257911 PMCID: PMC7081697 DOI: 10.1186/s42826-019-0023-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2019] [Accepted: 10/21/2019] [Indexed: 11/25/2022] Open
Abstract
Genetically engineered mouse models are used in high-throughput phenotyping screens to understand genotype-phenotype associations and their relevance to human diseases. However, not all mutant mouse lines with detectable phenotypes are associated with human diseases. Here, we propose the “Target gene selection system for Genetically engineered mouse models” (TarGo). Using a combination of human disease descriptions, network topology, and genotype-phenotype correlations, novel genes that are potentially related to human diseases are suggested. We constructed a gene interaction network using protein-protein interactions, molecular pathways, and co-expression data. Several repositories for human disease signatures were used to obtain information on human disease-related genes. We calculated disease- or phenotype-specific gene ranks using network topology and disease signatures. In conclusion, TarGo provides many novel features for gene function prediction.
Collapse
Affiliation(s)
- Daejin Hyung
- 1National Cancer Center, 323 Ilsan-ro, Goyang-si, Kyeonggi-do 10408 Republic of Korea
| | - Ann-Marie Mallon
- 2MRC Harwell Institute, Mammalian Genetics Unit, Oxfordshire, OX11 0RD UK
| | - Dong Soo Kyung
- 3Laboratory of Developmental Biology and Genomics, Research Institute for Veterinary Science, and BK21 Plus Program for Creative Veterinary Science, College of Veterinary Medicine, Seoul National University, Seoul, 08826 Republic of Korea.,4Korea Mouse Phenotyping Center (KMPC), Seoul National University, Seoul, 08826 Republic of Korea.,5Interdisciplinary Program for Bioinformatics, Program for Cancer Biology and BIO-MAX institute, Seoul National University, Seoul, 08826 Republic of Korea
| | - Soo Young Cho
- 1National Cancer Center, 323 Ilsan-ro, Goyang-si, Kyeonggi-do 10408 Republic of Korea.,4Korea Mouse Phenotyping Center (KMPC), Seoul National University, Seoul, 08826 Republic of Korea
| | - Je Kyung Seong
- 3Laboratory of Developmental Biology and Genomics, Research Institute for Veterinary Science, and BK21 Plus Program for Creative Veterinary Science, College of Veterinary Medicine, Seoul National University, Seoul, 08826 Republic of Korea.,4Korea Mouse Phenotyping Center (KMPC), Seoul National University, Seoul, 08826 Republic of Korea.,5Interdisciplinary Program for Bioinformatics, Program for Cancer Biology and BIO-MAX institute, Seoul National University, Seoul, 08826 Republic of Korea
| |
Collapse
|
16
|
Lim H, Xie L. Target Gene Prediction of Transcription Factor Using a New Neighborhood-regularized Tri-factorization One-class Collaborative Filtering Algorithm. ACM-BCB ... ... : THE ... ACM CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY AND BIOMEDICINE. ACM CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY AND BIOMEDICINE 2019; 2018:1-10. [PMID: 31061989 DOI: 10.1145/3233547.3233551] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
Abstract
Identifying the target genes of transcription factors (TFs) is one of the key factors to understand transcriptional regulation. However, our understanding of genome-wide TF targeting profile is limited due to the cost of large scale experiments and intrinsic complexity. Thus, computational prediction methods are useful to predict the unobserved associations. Here, we developed a new one-class collaborative filtering algorithm tREMAP that is based on regularized, weighted nonnegative matrix tri-factorization. The algorithm predicts unobserved target genes for TFs using known gene-TF associations and protein-protein interaction network. Our benchmark study shows that tREMAP significantly outperforms its counterpart REMAP, a bi-factorization-based algorithm, for transcription factor target gene prediction in all four performance metrics AUC, MAP, MPR, and HLU. When evaluated by independent data sets, the prediction accuracy is 37.8% on the top 495 predicted associations, an enrichment factor of 4.19 compared with the random guess. Furthermore, many of the predicted novel associations by tREMAP are supported by evidence from literature. Although we only use canonical TF-target gene interaction data in this study, tREMAP can be directly applied to tissue-specific data sets. tREMAP provides a framework to integrate multiple omics data for the further improvement of TF target gene prediction. Thus, tREMAP is a potentially useful tool in studying gene regulatory networks. The benchmark data set and the source code of tREMAP are freely available at https://github.com/hansaimlim/REMAP/tree/master/TriFacREMAP.
Collapse
Affiliation(s)
- Hansaim Lim
- PhD program in Biochemistry, Graduate Center of the City University of New York NY 10016 United States
| | - Lei Xie
- Department of Computer Science, Hunter College and Graduate Center, the City University of New York NY 10065 United States
| |
Collapse
|
17
|
Predicting disease-genes based on network information loss and protein complexes in heterogeneous network. Inf Sci (N Y) 2019. [DOI: 10.1016/j.ins.2018.12.008] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023]
|
18
|
Iranzo J, Martincorena I, Koonin EV. Cancer-mutation network and the number and specificity of driver mutations. Proc Natl Acad Sci U S A 2018; 115:E6010-E6019. [PMID: 29895694 PMCID: PMC6042135 DOI: 10.1073/pnas.1803155115] [Citation(s) in RCA: 80] [Impact Index Per Article: 11.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022] Open
Abstract
Cancer genomics has produced extensive information on cancer-associated genes, but the number and specificity of cancer-driver mutations remains a matter of debate. We constructed a bipartite network in which 7,665 tumors from 30 cancer types are connected via shared mutations in 198 previously identified cancer genes. We show that about 27% of the tumors can be assigned to statistically supported modules, most of which encompass one or two cancer types. The rest of the tumors belong to a diffuse network component suggesting lower gene specificity of driver mutations. Linear regression of the mutational loads in cancer genes was used to estimate the number of drivers required for the onset of different cancers. The mean number of drivers in known cancer genes is approximately two, with a range of one to five. Cancers that are associated with modules had more drivers than those from the diffuse network component, suggesting that unidentified and/or interchangeable drivers exist in the latter.
Collapse
Affiliation(s)
- Jaime Iranzo
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894;
| | - Iñigo Martincorena
- Wellcome Trust Sanger Institute, CB10 1SA Hinxton, Cambridgeshire, United Kingdom
| | - Eugene V Koonin
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894;
| |
Collapse
|
19
|
Abstract
Background Matrix factorization is a well established pattern discovery tool that has seen numerous applications in biomedical data analytics, such as gene expression co-clustering, patient stratification, and gene-disease association mining. Matrix factorization learns a latent data model that takes a data matrix and transforms it into a latent feature space enabling generalization, noise removal and feature discovery. However, factorization algorithms are numerically intensive, and hence there is a pressing challenge to scale current algorithms to work with large datasets. Our focus in this paper is matrix tri-factorization, a popular method that is not limited by the assumption of standard matrix factorization about data residing in one latent space. Matrix tri-factorization solves this by inferring a separate latent space for each dimension in a data matrix, and a latent mapping of interactions between the inferred spaces, making the approach particularly suitable for biomedical data mining. Results We developed a block-wise approach for latent factor learning in matrix tri-factorization. The approach partitions a data matrix into disjoint submatrices that are treated independently and fed into a parallel factorization system. An appealing property of the proposed approach is its mathematical equivalence with serial matrix tri-factorization. In a study on large biomedical datasets we show that our approach scales well on multi-processor and multi-GPU architectures. On a four-GPU system we demonstrate that our approach can be more than 100-times faster than its single-processor counterpart. Conclusions A general approach for scaling non-negative matrix tri-factorization is proposed. The approach is especially useful parallel matrix factorization implemented in a multi-GPU environment. We expect the new approach will be useful in emerging procedures for latent factor analysis, notably for data integration, where many large data matrices need to be collectively factorized.
Collapse
|
20
|
Doostparast Torshizi A, Petzold LR. Graph-based semi-supervised learning with genomic data integration using condition-responsive genes applied to phenotype classification. J Am Med Inform Assoc 2018; 25:99-108. [PMID: 28505320 PMCID: PMC7647127 DOI: 10.1093/jamia/ocx032] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2016] [Revised: 02/08/2017] [Accepted: 03/14/2017] [Indexed: 11/14/2022] Open
Abstract
Objective Data integration methods that combine data from different molecular levels such as genome, epigenome, transcriptome, etc., have received a great deal of interest in the past few years. It has been demonstrated that the synergistic effects of different biological data types can boost learning capabilities and lead to a better understanding of the underlying interactions among molecular levels. Methods In this paper we present a graph-based semi-supervised classification algorithm that incorporates latent biological knowledge in the form of biological pathways with gene expression and DNA methylation data. The process of graph construction from biological pathways is based on detecting condition-responsive genes, where 3 sets of genes are finally extracted: all condition responsive genes, high-frequency condition-responsive genes, and P-value-filtered genes. Results The proposed approach is applied to ovarian cancer data downloaded from the Human Genome Atlas. Extensive numerical experiments demonstrate superior performance of the proposed approach compared to other state-of-the-art algorithms, including the latest graph-based classification techniques. Conclusions Simulation results demonstrate that integrating various data types enhances classification performance and leads to a better understanding of interrelations between diverse omics data types. The proposed approach outperforms many of the state-of-the-art data integration algorithms.
Collapse
Affiliation(s)
| | - Linda R Petzold
- Department of Computer Science, University of California, Santa Barbara, CA, USA
| |
Collapse
|
21
|
Chung K, Richards T, Nicot R, Vieira AR, Cruz CV, Raoul G, Ferri J, Sciote JJ. ENPP1 and ESR1 genotypes associated with subclassifications of craniofacial asymmetry and severity of temporomandibular disorders. Am J Orthod Dentofacial Orthop 2017; 152:631-645. [PMID: 29103441 DOI: 10.1016/j.ajodo.2017.03.024] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2016] [Revised: 03/01/2017] [Accepted: 03/01/2017] [Indexed: 12/13/2022]
Abstract
INTRODUCTION We investigated whether ACTN3, ENPP1, ESR1, PITX1, and PITX2 genes which contribute to sagittal and vertical malocclusions also contribute to facial asymmetries and temporomandibular disorders (TMD) before and after orthodontic and orthognathic surgery treatment. METHODS One hundred seventy-four patients with a dentofacial deformity were diagnosed as symmetric or subdivided into 4 asymmetric groups according to posteroanterior cephalometric measurements. TMD examination diagnosis and jaw pain and function (JPF) questionnaires assessed the presence and severity of TMD. RESULTS Fifty-two percent of the patients were symmetric, and 48% were asymmetric. The asymmetry classification demonstrated significant cephalometric differences between the symmetric and asymmetric groups, and across the 4 asymmetric subtypes: group 1, mandibular body asymmetry; group 2, ramus asymmetry; group 3, atypical asymmetry; and group 4, C-shaped asymmetry. ENPP1 SNP-rs6569759 was associated with group 1 (P = 0.004), and rs858339 was associated with group 3 (P = 0.002). ESR1 SNP-rs164321 was associated with group 4 (P = 0.019). These results were confirmed by principal component analysis that showed 3 principal components explaining almost 80% of the variations in the studied groups. Principal components 1 and 2 were associated with ESR1 SNP-rs3020318 (P <0.05). Diagnoses of disc displacement with reduction, masticatory muscle myalgia, and arthralgia were highly prevalent in the asymmetry groups, and all had strong statistical associations with ENPP1 rs858339. The average JPF scores for asymmetric subjects before surgery (JPF, 7) were significantly higher than for symmetric subjects (JPF, 2). Patients in group 3 had the highest preoperative JPF scores, and groups 2 and 3 were most likely to be cured of TMD 1 year after treatment. CONCLUSIONS Posteroanterior cephalometrics can classify asymmetry into distinct groups and identify the probability of TMD and genotype associations. Orthodontic and orthognathic treatments of facial asymmetry are effective at eliminating TMD in most patients.
Collapse
Affiliation(s)
- Kay Chung
- Department of Orthodontics, Temple University, Philadelphia, Pa
| | | | - Romain Nicot
- Department of Oral and Maxillofacial Surgery, Roger Salengro Hospital, Université Lille Nord de France, Lille, France
| | - Alexandre R Vieira
- Department of Oral Biology, School of Dental Medicine, University of Pittsburgh, Pittsburgh, Pa
| | - Christiane V Cruz
- Department of Pediatric Dentistry and Orthodontics, School of Dentistry, Universidade Federal do Rio de Janeiro, Rio de Janeiro, Brazil
| | - Gwénaël Raoul
- Department of Oral and Maxillofacial Surgery, Roger Salengro Hospital, Université Lille Nord de France, Lille, France
| | - Joel Ferri
- Department of Oral and Maxillofacial Surgery, Roger Salengro Hospital, Université Lille Nord de France, Lille, France
| | - James J Sciote
- Department of Orthodontics, Temple University, Philadelphia, Pa.
| |
Collapse
|
22
|
Zhang W, Chien J, Yong J, Kuang R. Network-based machine learning and graph theory algorithms for precision oncology. NPJ Precis Oncol 2017; 1:25. [PMID: 29872707 PMCID: PMC5871915 DOI: 10.1038/s41698-017-0029-7] [Citation(s) in RCA: 49] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2017] [Revised: 06/28/2017] [Accepted: 06/29/2017] [Indexed: 01/07/2023] Open
Abstract
Network-based analytics plays an increasingly important role in precision oncology. Growing evidence in recent studies suggests that cancer can be better understood through mutated or dysregulated pathways or networks rather than individual mutations and that the efficacy of repositioned drugs can be inferred from disease modules in molecular networks. This article reviews network-based machine learning and graph theory algorithms for integrative analysis of personal genomic data and biomedical knowledge bases to identify tumor-specific molecular mechanisms, candidate targets and repositioned drugs for personalized treatment. The review focuses on the algorithmic design and mathematical formulation of these methods to facilitate applications and implementations of network-based analysis in the practice of precision oncology. We review the methods applied in three scenarios to integrate genomic data and network models in different analysis pipelines, and we examine three categories of network-based approaches for repositioning drugs in drug-disease-gene networks. In addition, we perform a comprehensive subnetwork/pathway analysis of mutations in 31 cancer genome projects in the Cancer Genome Atlas and present a detailed case study on ovarian cancer. Finally, we discuss interesting observations, potential pitfalls and future directions in network-based precision oncology.
Collapse
Affiliation(s)
- Wei Zhang
- 1Department of Computer Science and Engineering, University of Minnesota Twin Cities, Minneapolis, MN USA
| | - Jeremy Chien
- 2Department of Cancer Biology, University of Kansas Medical Center, Kansas City, KS USA
| | - Jeongsik Yong
- 3Department of Biochemistry, Molecular Biology and Biophysics, University of Minnesota Twin Cities, Minneapolis, MN USA
| | - Rui Kuang
- 1Department of Computer Science and Engineering, University of Minnesota Twin Cities, Minneapolis, MN USA
| |
Collapse
|
23
|
Disease genes prioritizing mechanisms: a comprehensive and systematic literature review. ACTA ACUST UNITED AC 2017. [DOI: 10.1007/s13721-017-0154-9] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
24
|
Chen Y, Xu R. Context-sensitive network-based disease genetics prediction and its implications in drug discovery. Bioinformatics 2017; 33:1031-1039. [PMID: 28062449 DOI: 10.1093/bioinformatics/btw737] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2016] [Accepted: 11/19/2016] [Indexed: 01/05/2023] Open
Abstract
Motivation Disease phenotype networks play an important role in computational approaches to identifying new disease-gene associations. Current disease phenotype networks often model disease relationships based on pairwise similarities, therefore ignore the specific context on how two diseases are connected. In this study, we propose a new strategy to model disease associations using context-sensitive networks (CSNs). We developed a CSN-based phenome-driven approach for disease genetics prediction, and investigated the translational potential of the predicted genes in drug discovery. Results We constructed CSNs by directly connecting diseases with associated phenotypes. Here, we constructed two CSNs using different data sources; the two networks contain 26 790 and 13 822 nodes respectively. We integrated the CSNs with a genetic functional relationship network and predicted disease genes using a network-based ranking algorithm. For comparison, we built Similarity-Based disease Networks (SBN) using the same disease phenotype data. In a de novo cross validation for 3324 diseases, the CSN-based approach significantly increased the average rank from top 12.6 to top 8.8% for all tested genes comparing with the SBN-based approach ( p<e-22 ). The area under the receiver operating characteristic curve for the CSN approach was also significantly higher than the SBN approach (0.91 versus 0.87, p<e-3 ). In addition, we predicted genes for Parkinson's disease using CSNs, and demonstrated that the top-ranked genes are highly relevant to PD pathologenesis. We pin-pointed a top-ranked drug target gene for PD, and found its association with neurodegeneration supported by literature. In summary, CSNs lead to significantly improve the disease genetics prediction comparing with SBNs and provide leads for potential drug targets. Availability and Implementation nlp.case.edu/public/data/. Contact rxx@case.edu.
Collapse
|
25
|
Abstract
Paralleling the increasing availability of protein-protein interaction (PPI) network data, several network alignment methods have been proposed. Network alignments have been used to uncover functionally conserved network parts and to transfer annotations. However, due to the computational intractability of the network alignment problem, aligners are heuristics providing divergent solutions and no consensus exists on a gold standard, or which scoring scheme should be used to evaluate them. We comprehensively evaluate the alignment scoring schemes and global network aligners on large scale PPI data and observe that three methods, HUBALIGN, L-GRAAL and NATALIE, regularly produce the most topologically and biologically coherent alignments. We study the collective behaviour of network aligners and observe that PPI networks are almost entirely aligned with a handful of aligners that we unify into a new tool, Ulign. Ulign enables complete alignment of two networks, which traditional global and local aligners fail to do. Also, multiple mappings of Ulign define biologically relevant soft clusterings of proteins in PPI networks, which may be used for refining the transfer of annotations across networks. Hence, PPI networks are already well investigated by current aligners, so to gain additional biological insights, a paradigm shift is needed. We propose such a shift come from aligning all available data types collectively rather than any particular data type in isolation from others.
Collapse
|
26
|
Xu Y, Guo M, Liu X, Wang C, Liu Y, Liu G. Identify bilayer modules via pseudo-3D clustering: applications to miRNA-gene bilayer networks. Nucleic Acids Res 2016; 44:e152. [PMID: 27484480 PMCID: PMC5741208 DOI: 10.1093/nar/gkw679] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2015] [Revised: 06/30/2016] [Accepted: 07/18/2016] [Indexed: 12/11/2022] Open
Abstract
Module identification is a frequently used approach for mining local structures with more significance in global networks. Recently, a wide variety of bilayer networks are emerging to characterize the more complex biological processes. In the light of special topological properties of bilayer networks and the accompanying challenges, there is yet no effective method aiming at bilayer module identification to probe the modular organizations from the more inspiring bilayer networks. To this end, we proposed the pseudo-3D clustering algorithm, which starts from extracting initial non-hierarchically organized modules and then iteratively deciphers the hierarchical organization of modules according to a bottom-up strategy. Specifically, a modularity function for bilayer modules was proposed to facilitate the algorithm reporting the optimal partition that gives the most accurate characterization of the bilayer network. Simulation studies demonstrated its robustness and outperformance against alternative competing methods. Specific applications to both the soybean and human miRNA-gene bilayer networks demonstrated that the pseudo-3D clustering algorithm successfully identified the overlapping, hierarchically organized and highly cohesive bilayer modules. The analyses on topology, functional and human disease enrichment and the bilayer subnetwork involved in soybean fat biosynthesis provided both the theoretical and biological evidence supporting the effectiveness and robustness of pseudo-3D clustering algorithm.
Collapse
Affiliation(s)
- Yungang Xu
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
- School of Life Science and Technology, Harbin Institute of Technology, Harbin 150001, China
| | - Maozu Guo
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
| | - Xiaoyan Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
| | - Chunyu Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
| | - Yang Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
| | - Guojun Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
| |
Collapse
|
27
|
Pouladi N, Achour I, Li H, Berghout J, Kenost C, Gonzalez-Garay ML, Lussier YA. Biomechanisms of Comorbidity: Reviewing Integrative Analyses of Multi-omics Datasets and Electronic Health Records. Yearb Med Inform 2016; 25:194-206. [PMID: 27830251 PMCID: PMC5171562 DOI: 10.15265/iy-2016-040] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022] Open
Abstract
OBJECTIVES Disease comorbidity is a pervasive phenomenon impacting patients' health outcomes, disease management, and clinical decisions. This review presents past, current and future research directions leveraging both phenotypic and molecular information to uncover disease similarity underpinning the biology and etiology of disease comorbidity. METHODS We retrieved ~130 publications and retained 59, ranging from 2006 to 2015, that comprise a minimum number of five diseases and at least one type of biomolecule. We surveyed their methods, disease similarity metrics, and calculation of comorbidities in the electronic health records, if present. RESULTS Among the surveyed studies, 44% generated or validated disease similarity metrics in context of comorbidity, with 60% being published in the last two years. As inputs, 87% of studies utilized intragenic loci and proteins while 13% employed RNA (mRNA, LncRNA or miRNA). Network modeling was predominantly used (35%) followed by statistics (28%) to impute similarity between these biomolecules and diseases. Studies with large numbers of biomolecules and diseases used network models or naïve overlap of disease-molecule associations, while machine learning, statistics, and information retrieval were utilized in smaller and moderate sized studies. Multiscale computations comprising shared function, network topology, and phenotypes were performed exclusively on proteins. CONCLUSION This review highlighted the growing methods for identifying the molecular mechanisms underpinning comorbidities that leverage multiscale molecular information and patterns from electronic health records. The survey unveiled that intergenic polymorphisms have been overlooked for similarity imputation compared to their intragenic counterparts, offering new opportunities to bridge the mechanistic and similarity gaps of comorbidity.
Collapse
Affiliation(s)
| | | | | | | | | | | | - Y A Lussier
- Dr. Yves A. Lussier, The University of Arizona, Bio5 Building, 1657 East Helen Street, Tucson, AZ 85721, USA, Fax: +1 520 626 4824, E-Mail:
| |
Collapse
|
28
|
Jeong CS, Kim D. Inferring Crohn's disease association from exome sequences by integrating biological knowledge. BMC Med Genomics 2016; 9 Suppl 1:35. [PMID: 27535358 PMCID: PMC4989895 DOI: 10.1186/s12920-016-0189-2] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022] Open
Abstract
Background Exome sequencing has been emerged as a primary method to identify detailed sequence variants associated with complex diseases including Crohn’s disease in the protein-coding regions of human genome. However, constructing an interpretable model for exome sequencing data is challenging because of the huge diversity of genomic variation. In addition, it has been known that utilizing biologically relevant information in a rigorous manner is essential for effectively extracting disease-associated information. Results In this paper, we incorporate three different types of biological knowledge such as predicted pathogenicity, disease gene annotation, and functional interaction network of human genes, and integrate them with exome sequence data in non-negative matrix tri-factorization framework. Based on the proposed method, we successfully identified Crohn’s disease patients from exome sequencing data and achieved the area under the receiver operating characteristics curve (AUC) of 0.816, while other clustering methods not using biological information achieved the AUC of 0.786. Moreover, the disease association score derived from our method showed higher correlation with Crohn’s disease genes than other unrelated genes. Conclusions As a consequence, by integrating biological information across multiple levels such as variant, gene, and systems, our method could be useful for identifying disease susceptibility and its associated genes from exome sequencing data.
Collapse
Affiliation(s)
- Chan-Seok Jeong
- Department of Bio and Brain Engineering, Korea Advanced Institute of Science and Technology (KAIST), 291 Daehak-ro, Yuseong-gu, 34141 Daejeon, Republic of Korea
| | - Dongsup Kim
- Department of Bio and Brain Engineering, Korea Advanced Institute of Science and Technology (KAIST), 291 Daehak-ro, Yuseong-gu, 34141 Daejeon, Republic of Korea.
| |
Collapse
|
29
|
Thomas J, Seo D, Sael L. Review on Graph Clustering and Subgraph Similarity Based Analysis of Neurological Disorders. Int J Mol Sci 2016; 17:ijms17060862. [PMID: 27258269 PMCID: PMC4926396 DOI: 10.3390/ijms17060862] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2016] [Revised: 05/10/2016] [Accepted: 05/24/2016] [Indexed: 01/03/2023] Open
Abstract
How can complex relationships among molecular or clinico-pathological entities of neurological disorders be represented and analyzed? Graphs seem to be the current answer to the question no matter the type of information: molecular data, brain images or neural signals. We review a wide spectrum of graph representation and graph analysis methods and their application in the study of both the genomic level and the phenotypic level of the neurological disorder. We find numerous research works that create, process and analyze graphs formed from one or a few data types to gain an understanding of specific aspects of the neurological disorders. Furthermore, with the increasing number of data of various types becoming available for neurological disorders, we find that integrative analysis approaches that combine several types of data are being recognized as a way to gain a global understanding of the diseases. Although there are still not many integrative analyses of graphs due to the complexity in analysis, multi-layer graph analysis is a promising framework that can incorporate various data types. We describe and discuss the benefits of the multi-layer graph framework for studies of neurological disease.
Collapse
Affiliation(s)
- Jaya Thomas
- Department of Computer Science, Stony Brook University, Stony Brook, NY 11794, USA.
- Department of Computer Science, State University New York Korea, Incheon 406-840, Korea.
| | - Dongmin Seo
- Korea Institute of Science and Technology Information, 245 Daehak-ro, Yuseong-gu, Daejeon 34141, Korea.
| | - Lee Sael
- Department of Computer Science, Stony Brook University, Stony Brook, NY 11794, USA.
- Department of Computer Science, State University New York Korea, Incheon 406-840, Korea.
| |
Collapse
|
30
|
Neighborhood Regularized Logistic Matrix Factorization for Drug-Target Interaction Prediction. PLoS Comput Biol 2016; 12:e1004760. [PMID: 26872142 PMCID: PMC4752318 DOI: 10.1371/journal.pcbi.1004760] [Citation(s) in RCA: 205] [Impact Index Per Article: 22.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2015] [Accepted: 01/14/2016] [Indexed: 12/19/2022] Open
Abstract
In pharmaceutical sciences, a crucial step of the drug discovery process is the identification of drug-target interactions. However, only a small portion of the drug-target interactions have been experimentally validated, as the experimental validation is laborious and costly. To improve the drug discovery efficiency, there is a great need for the development of accurate computational approaches that can predict potential drug-target interactions to direct the experimental verification. In this paper, we propose a novel drug-target interaction prediction algorithm, namely neighborhood regularized logistic matrix factorization (NRLMF). Specifically, the proposed NRLMF method focuses on modeling the probability that a drug would interact with a target by logistic matrix factorization, where the properties of drugs and targets are represented by drug-specific and target-specific latent vectors, respectively. Moreover, NRLMF assigns higher importance levels to positive observations (i.e., the observed interacting drug-target pairs) than negative observations (i.e., the unknown pairs). Because the positive observations are already experimentally verified, they are usually more trustworthy. Furthermore, the local structure of the drug-target interaction data has also been exploited via neighborhood regularization to achieve better prediction accuracy. We conducted extensive experiments over four benchmark datasets, and NRLMF demonstrated its effectiveness compared with five state-of-the-art approaches. This work introduces a computational approach, namely neighborhood regularized logistic matrix factorization (NRLMF), to predicting potential interactions between drugs and targets. The novelty of NRLMF lies in integrating logistic matrix factorization with neighborhood regularization for drug-target interaction prediction. In NRLMF, we model the interaction probability for each drug-target pair using logistic matrix factorization. As the observed interacting drug-target pairs are experimentally verified, they are more trustworthy than the unknown pairs. We propose to assign higher importance levels to interaction pairs and lower importance levels to unknown pairs. In addition, we further improve the prediction accuracy by neighborhood regularization, which considers the neighborhood influences from most similar drugs and most similar targets. To evaluate the performance of NRLMF, we conducted extensive experiments on four benchmark datasets. The experimental results demonstrated that NRLMF usually outperformed five state-of-the-art methods under three different cross-validation settings, in terms of the area under the ROC curve (AUC) and the area under the precision-recall curve (AUPR). In addition, we confirmed the practical prediction ability of NRLMF by mapping with the latest version of four online biological databases, including ChEMBL, DrugBank, KEGG, and Matador.
Collapse
|
31
|
Computational Methods for Integration of Biological Data. Per Med 2016. [DOI: 10.1007/978-3-319-39349-0_8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
32
|
Abstract
MOTIVATION Discerning genetic contributions to diseases not only enhances our understanding of disease mechanisms, but also leads to translational opportunities for drug discovery. Recent computational approaches incorporate disease phenotypic similarities to improve the prediction power of disease gene discovery. However, most current studies used only one data source of human disease phenotype. We present an innovative and generic strategy for combining multiple different data sources of human disease phenotype and predicting disease-associated genes from integrated phenotypic and genomic data. RESULTS To demonstrate our approach, we explored a new phenotype database from biomedical ontologies and constructed Disease Manifestation Network (DMN). We combined DMN with mimMiner, which was a widely used phenotype database in disease gene prediction studies. Our approach achieved significantly improved performance over a baseline method, which used only one phenotype data source. In the leave-one-out cross-validation and de novo gene prediction analysis, our approach achieved the area under the curves of 90.7% and 90.3%, which are significantly higher than 84.2% (P < e(-4)) and 81.3% (P < e(-12)) for the baseline approach. We further demonstrated that our predicted genes have the translational potential in drug discovery. We used Crohn's disease as an example and ranked the candidate drugs based on the rank of drug targets. Our gene prediction approach prioritized druggable genes that are likely to be associated with Crohn's disease pathogenesis, and our rank of candidate drugs successfully prioritized the Food and Drug Administration-approved drugs for Crohn's disease. We also found literature evidence to support a number of drugs among the top 200 candidates. In summary, we demonstrated that a novel strategy combining unique disease phenotype data with system approaches can lead to rapid drug discovery. AVAILABILITY AND IMPLEMENTATION nlp. CASE edu/public/data/DMN
Collapse
Affiliation(s)
- Yang Chen
- Department of Electrical Engineering and Computer Science, Department of Epidemiology and Biostatistics and Department of Family Medicine and Community Health, Case Western Reserve University, Cleveland, OH 44106, USA
| | - Li Li
- Department of Electrical Engineering and Computer Science, Department of Epidemiology and Biostatistics and Department of Family Medicine and Community Health, Case Western Reserve University, Cleveland, OH 44106, USA Department of Electrical Engineering and Computer Science, Department of Epidemiology and Biostatistics and Department of Family Medicine and Community Health, Case Western Reserve University, Cleveland, OH 44106, USA
| | - Guo-Qiang Zhang
- Department of Electrical Engineering and Computer Science, Department of Epidemiology and Biostatistics and Department of Family Medicine and Community Health, Case Western Reserve University, Cleveland, OH 44106, USA
| | - Rong Xu
- Department of Electrical Engineering and Computer Science, Department of Epidemiology and Biostatistics and Department of Family Medicine and Community Health, Case Western Reserve University, Cleveland, OH 44106, USA
| |
Collapse
|
33
|
Park S, Kim SJ, Yu D, Peña-Llopis S, Gao J, Park JS, Chen B, Norris J, Wang X, Chen M, Kim M, Yong J, Wardak Z, Choe K, Story M, Starr T, Cheong JH, Hwang TH. An integrative somatic mutation analysis to identify pathways linked with survival outcomes across 19 cancer types. Bioinformatics 2015; 32:1643-51. [PMID: 26635139 DOI: 10.1093/bioinformatics/btv692] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2015] [Accepted: 11/09/2015] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION Identification of altered pathways that are clinically relevant across human cancers is a key challenge in cancer genomics. Precise identification and understanding of these altered pathways may provide novel insights into patient stratification, therapeutic strategies and the development of new drugs. However, a challenge remains in accurately identifying pathways altered by somatic mutations across human cancers, due to the diverse mutation spectrum. We developed an innovative approach to integrate somatic mutation data with gene networks and pathways, in order to identify pathways altered by somatic mutations across cancers. RESULTS We applied our approach to The Cancer Genome Atlas (TCGA) dataset of somatic mutations in 4790 cancer patients with 19 different types of tumors. Our analysis identified cancer-type-specific altered pathways enriched with known cancer-relevant genes and targets of currently available drugs. To investigate the clinical significance of these altered pathways, we performed consensus clustering for patient stratification using member genes in the altered pathways coupled with gene expression datasets from 4870 patients from TCGA, and multiple independent cohorts confirmed that the altered pathways could be used to stratify patients into subgroups with significantly different clinical outcomes. Of particular significance, certain patient subpopulations with poor prognosis were identified because they had specific altered pathways for which there are available targeted therapies. These findings could be used to tailor and intensify therapy in these patients, for whom current therapy is suboptimal. AVAILABILITY AND IMPLEMENTATION The code is available at: http://www.taehyunlab.org CONTACT jhcheong@yuhs.ac or taehyun.hwang@utsouthwestern.edu or taehyun.cs@gmail.com SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Sunho Park
- Department of Clinical Sciences, University of Texas Southwestern Medical Center, Dallas, TX, USA
| | - Seung-Jun Kim
- Department of Computer Science and Electrical Engineering, University of Maryland at Baltimore County, Baltimore, MD, USA
| | - Donghyeon Yu
- Department of Statistics, Keimyung University, Daegu, South Korea
| | - Samuel Peña-Llopis
- Internal Medicine and Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, Dallas, TX, USA
| | - Jianjiong Gao
- Center for Molecular Biology, Memorial Sloan Kettering Cancer Center, New York, NY, USA
| | - Jin Suk Park
- Department of Clinical Sciences, University of Texas Southwestern Medical Center, Dallas, TX, USA
| | - Beibei Chen
- Department of Clinical Sciences, University of Texas Southwestern Medical Center, Dallas, TX, USA
| | - Jessie Norris
- Department of Clinical Sciences, University of Texas Southwestern Medical Center, Dallas, TX, USA
| | - Xinlei Wang
- Department of Statistical Science, Southern Methodist University, Dallas, TX, USA
| | - Min Chen
- Department of Mathematical Sciences, University of Texas at Dallas, Dallas, TX, USA
| | - Minsoo Kim
- Department of Clinical Sciences, University of Texas Southwestern Medical Center, Dallas, TX, USA
| | - Jeongsik Yong
- Department of Biochemistry, Molecular Biology and Biophysics, Obstetrics, Gynecology & Women's Health, University of Minnesota Twin Cities, Minneapolis, MN, USA
| | - Zabi Wardak
- Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, Dallas, TX, USA, Radiation Oncology, University of Texas Southwestern Medical Center, Dallas, TX, USA
| | - Kevin Choe
- Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, Dallas, TX, USA, Radiation Oncology, University of Texas Southwestern Medical Center, Dallas, TX, USA
| | - Michael Story
- Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, Dallas, TX, USA, Radiation Oncology, University of Texas Southwestern Medical Center, Dallas, TX, USA
| | - Timothy Starr
- Genetics, Cell Biology, University of Minnesota Twin Cities, Minneapolis, MN, USA, Masonic Cancer Center, University of Minnesota Twin Cities, Minneapolis, MN, USA
| | - Jae-Ho Cheong
- Department of Surgery, Yonsei University College of Medicine, Seoul, South Korea and Open NBI Convergence Technology Research Laboratory, Yonsei University College of Medicine, Seoul, South Korea
| | - Tae Hyun Hwang
- Department of Clinical Sciences, University of Texas Southwestern Medical Center, Dallas, TX, USA, Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, Dallas, TX, USA
| |
Collapse
|
34
|
Gligorijević V, Pržulj N. Methods for biological data integration: perspectives and challenges. J R Soc Interface 2015; 12:20150571. [PMID: 26490630 PMCID: PMC4685837 DOI: 10.1098/rsif.2015.0571] [Citation(s) in RCA: 136] [Impact Index Per Article: 13.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2015] [Accepted: 09/25/2015] [Indexed: 12/17/2022] Open
Abstract
Rapid technological advances have led to the production of different types of biological data and enabled construction of complex networks with various types of interactions between diverse biological entities. Standard network data analysis methods were shown to be limited in dealing with such heterogeneous networked data and consequently, new methods for integrative data analyses have been proposed. The integrative methods can collectively mine multiple types of biological data and produce more holistic, systems-level biological insights. We survey recent methods for collective mining (integration) of various types of networked biological data. We compare different state-of-the-art methods for data integration and highlight their advantages and disadvantages in addressing important biological problems. We identify the important computational challenges of these methods and provide a general guideline for which methods are suited for specific biological problems, or specific data types. Moreover, we propose that recent non-negative matrix factorization-based approaches may become the integration methodology of choice, as they are well suited and accurate in dealing with heterogeneous data and have many opportunities for further development.
Collapse
Affiliation(s)
| | - Nataša Pržulj
- Department of Computing, Imperial College London, London SW7 2AZ, UK
| |
Collapse
|
35
|
Liu R, Cheng W, Tong H, Wang W, Zhang X. Robust Multi-Network Clustering via Joint Cross-Domain Cluster Alignment. PROCEEDINGS. IEEE INTERNATIONAL CONFERENCE ON DATA MINING 2015; 2015:291-300. [PMID: 27239167 PMCID: PMC4880426 DOI: 10.1109/icdm.2015.13] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/26/2022]
Abstract
Network clustering is an important problem that has recently drawn a lot of attentions. Most existing work focuses on clustering nodes within a single network. In many applications, however, there exist multiple related networks, in which each network may be constructed from a different domain and instances in one domain may be related to instances in other domains. In this paper, we propose a robust algorithm, MCA, for multi-network clustering that takes into account cross-domain relationships between instances. MCA has several advantages over the existing single network clustering methods. First, it is able to detect associations between clusters from different domains, which, however, is not addressed by any existing methods. Second, it achieves more consistent clustering results on multiple networks by leveraging the duality between clustering individual networks and inferring cross-network cluster alignment. Finally, it provides a multi-network clustering solution that is more robust to noise and errors. We perform extensive experiments on a variety of real and synthetic networks to demonstrate the effectiveness and efficiency of MCA.
Collapse
Affiliation(s)
- Rui Liu
- Department of Electrical Engineering and Computer Science, Case Western Reserve University, Cleveland, OH 44106
| | - Wei Cheng
- Department of Computer Science, University of North Carolina at Chapel Hill, NC 27599
| | - Hanghang Tong
- School of Computing, Informatics, Decision Systems Engineering, Arizona State University, Tempe, AZ 85281
| | - Wei Wang
- Department of Computer Science, University of California, Los Angeles, CA 90095
| | - Xiang Zhang
- Department of Electrical Engineering and Computer Science, Case Western Reserve University, Cleveland, OH 44106
| |
Collapse
|
36
|
Prioritizing Clinically Relevant Copy Number Variation from Genetic Interactions and Gene Function Data. PLoS One 2015; 10:e0139656. [PMID: 26437450 PMCID: PMC4593641 DOI: 10.1371/journal.pone.0139656] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2014] [Accepted: 09/16/2015] [Indexed: 11/19/2022] Open
Abstract
It is becoming increasingly necessary to develop computerized methods for identifying the few disease-causing variants from hundreds discovered in each individual patient. This problem is especially relevant for Copy Number Variants (CNVs), which can be cheaply interrogated via low-cost hybridization arrays commonly used in clinical practice. We present a method to predict the disease relevance of CNVs that combines functional context and clinical phenotype to discover clinically harmful CNVs (and likely causative genes) in patients with a variety of phenotypes. We compare several feature and gene weighing systems for classifying both genes and CNVs. We combined the best performing methodologies and parameters on over 2,500 Agilent CGH 180k Microarray CNVs derived from 140 patients. Our method achieved an F-score of 91.59%, with 87.08% precision and 97.00% recall. Our methods are freely available at https://github.com/compbio-UofT/cnv-prioritization. Our dataset is included with the supplementary information.
Collapse
|
37
|
Development and mining of a volatile organic compound database. BIOMED RESEARCH INTERNATIONAL 2015; 2015:139254. [PMID: 26495281 PMCID: PMC4606137 DOI: 10.1155/2015/139254] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/14/2015] [Accepted: 06/14/2015] [Indexed: 12/16/2022]
Abstract
Volatile organic compounds (VOCs) are small molecules that exhibit high vapor pressure under ambient conditions and have low boiling points. Although VOCs contribute only a small proportion of the total metabolites produced by living organisms, they play an important role in chemical ecology specifically in the biological interactions between organisms and ecosystems. VOCs are also important in the health care field as they are presently used as a biomarker to detect various human diseases. Information on VOCs is scattered in the literature until now; however, there is still no available database describing VOCs and their biological activities. To attain this purpose, we have developed KNApSAcK Metabolite Ecology Database, which contains the information on the relationships between VOCs and their emitting organisms. The KNApSAcK Metabolite Ecology is also linked with the KNApSAcK Core and KNApSAcK Metabolite Activity Database to provide further information on the metabolites and their biological activities. The VOC database can be accessed online.
Collapse
|
38
|
Pendergrass SA, Verma A, Okula A, Hall MA, Crawford DC, Ritchie MD. Phenome-Wide Association Studies: Embracing Complexity for Discovery. Hum Hered 2015. [PMID: 26201697 DOI: 10.1159/000381851] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
Abstract
The inherent complexity of biological systems can be leveraged for a greater understanding of the impact of genetic architecture on outcomes, traits, and pharmacological response. The genome-wide association study (GWAS) approach has well-developed methods and relatively straight-forward methodologies; however, the bigger picture of the impact of genetic architecture on phenotypic outcome still remains to be elucidated even with an ever-growing number of GWAS performed. Greater consideration of the complexity of biological processes, using more data from the phenome, exposome, and diverse -omic resources, including considering the interplay of pleiotropy and genetic interactions, may provide additional leverage for making the most of the incredible wealth of information available for study. Here, we describe how incorporating greater complexity into analyses through the use of additional phenotypic data and widespread deployment of phenome-wide association studies may provide new insights into genetic factors influencing diseases, traits, and pharmacological response.
Collapse
Affiliation(s)
- Sarah A Pendergrass
- Biomedical and Translational Informatics Program, Geisinger Health System, Danville, Pa., USA
| | | | | | | | | | | |
Collapse
|
39
|
Ullah MZ, Aono M, Seddiqui MH. Estimating a ranked list of human hereditary diseases for clinical phenotypes by using weighted bipartite network. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2015; 2013:3475-8. [PMID: 24110477 DOI: 10.1109/embc.2013.6610290] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
With the availability of the huge medical knowledge data on the Internet such as the human disease network, protein-protein interaction (PPI) network, and phenotypegene, gene-disease bipartite networks, it becomes practical to help doctors by suggesting plausible hereditary diseases for a set of clinical phenotypes. However, identifying candidate diseases that best explain a set of clinical phenotypes by considering various heterogeneous networks is still a challenging task. In this paper, we propose a new method for estimating a ranked list of plausible diseases by associating phenotypegene with gene-disease bipartite networks. Our approach is to count the frequency of all the paths from a phenotype to a disease through their associated causative genes, and link the phenotype to the disease with path frequency in a new phenotype-disease bipartite (PDB) network. After that, we generate the candidate weights for the edges of phenotypes with diseases in PDB network. We evaluate our proposed method in terms of Normalized Discounted Cumulative Gain (NDCG), and demonstrate that we outperform the previously known disease ranking method called Phenomizer.
Collapse
|
40
|
Oren Y, Nachshon A, Frishberg A, Wilentzik R, Gat-Viks I. Linking traits based on their shared molecular mechanisms. eLife 2015; 4. [PMID: 25781485 PMCID: PMC4362207 DOI: 10.7554/elife.04346] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2014] [Accepted: 02/20/2015] [Indexed: 12/29/2022] Open
Abstract
There is growing recognition that co-morbidity and co-occurrence of disease traits are often determined by shared genetic and molecular mechanisms. In most cases, however, the specific mechanisms that lead to such trait-trait relationships are yet unknown. Here we present an analysis of a broad spectrum of behavioral and physiological traits together with gene-expression measurements across genetically diverse mouse strains. We develop an unbiased methodology that constructs potentially overlapping groups of traits and resolves their underlying combination of genetic loci and molecular mechanisms. For example, our method predicts that genetic variation in the Klf7 gene may influence gene transcripts in bone marrow-derived myeloid cells, which in turn affect 17 behavioral traits following morphine injection; this predicted effect of Klf7 is consistent with an in vitro perturbation of Klf7 in bone marrow cells. Our analysis demonstrates the utility of studying hidden causative mechanisms that lead to relationships between complex traits.
Collapse
Affiliation(s)
- Yael Oren
- Department of Cell Research and Immunology, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv, Israel
| | - Aharon Nachshon
- Department of Cell Research and Immunology, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv, Israel
| | - Amit Frishberg
- Department of Cell Research and Immunology, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv, Israel
| | - Roni Wilentzik
- Department of Cell Research and Immunology, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv, Israel
| | - Irit Gat-Viks
- Department of Cell Research and Immunology, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv, Israel
| |
Collapse
|
41
|
A general framework for a reliable multivariate analysis and pattern recognition in high-dimensional epidemiological data, based on cluster robustness: a tutorial to enrich the epidemiologists' toolkit. Rev Epidemiol Sante Publique 2015; 63:9-19. [PMID: 25604830 DOI: 10.1016/j.respe.2014.12.017] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2014] [Accepted: 12/01/2014] [Indexed: 11/21/2022] Open
Abstract
BACKGROUND In an epidemiologist's toolbox, three main types of statistical tools can be found: means and proportions comparisons, linear or logistic regression models and Cox-type regression models. All these techniques have their own multivariate formulations, so that biases can be accounted for. Nonetheless, there is an entire set of natively massive multivariate techniques, which are based on weaker assumptions than classical statistical techniques are, and which seem to be underestimated or remain unknown to most epidemiologists. These techniques are used for pattern recognition or clustering – that is, for retrieving homogeneous groups in data without any a priori about these groups. They are widely used in connex domains such as genetics or biomolecular studies. METHODS Most clustering techniques require tuning specific parameters so that groups can be identified in data. A critical parameter to set is the number of groups the technique needs to discover. Different approaches to find the optimal number of groups are available, such as the silhouette approach and the robustness approach. This article presents the key aspects of clustering techniques (how proximity between observations is defined and how to find the number of groups), two archetypal techniques (namely the k-means and PAM algorithms) and how they relate to more classical statistical approaches. RESULTS Through a theoretical, simple example and a real data application, we provide a complete framework within which classical epidemiological concerns can be reconsidered. We show how to (i) identify whether distinct groups exist in data, (ii) identify the optimal number of groups in data, (iii) label each observation according to its own group and (iv) analyze the groups identified according to separate and explicative data. In addition, how to achieve consistent results while removing sensitivity to initial conditions is explained. CONCLUSIONS Clustering techniques, in conjunction with methods for parameter tuning, provide the epidemiologist with substantial additional tools. They differ from the usual approaches based on hypothesis-testing because no assumptions are made on the data and these clustering techniques are natively multivariate.
Collapse
|
42
|
Abstract
Motivation: Recently, a shift was made from using Gene Ontology (GO) to evaluate molecular network data to using these data to construct and evaluate GO. Dutkowski et al. provide the first evidence that a large part of GO can be reconstructed solely from topologies of molecular networks. Motivated by this work, we develop a novel data integration framework that integrates multiple types of molecular network data to reconstruct and update GO. We ask how much of GO can be recovered by integrating various molecular interaction data. Results: We introduce a computational framework for integration of various biological networks using penalized non-negative matrix tri-factorization (PNMTF). It takes all network data in a matrix form and performs simultaneous clustering of genes and GO terms, inducing new relations between genes and GO terms (annotations) and between GO terms themselves. To improve the accuracy of our predicted relations, we extend the integration methodology to include additional topological information represented as the similarity in wiring around non-interacting genes. Surprisingly, by integrating topologies of bakers’ yeasts protein–protein interaction, genetic interaction (GI) and co-expression networks, our method reports as related 96% of GO terms that are directly related in GO. The inclusion of the wiring similarity of non-interacting genes contributes 6% to this large GO term association capture. Furthermore, we use our method to infer new relationships between GO terms solely from the topologies of these networks and validate 44% of our predictions in the literature. In addition, our integration method reproduces 48% of cellular component, 41% of molecular function and 41% of biological process GO terms, outperforming the previous method in the former two domains of GO. Finally, we predict new GO annotations of yeast genes and validate our predictions through GIs profiling. Availability and implementation: Supplementary Tables of new GO term associations and predicted gene annotations are available at http://bio-nets.doc.ic.ac.uk/GO-Reconstruction/. Contact:natasha@imperial.ac.uk Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Vuk Janjić
- Department of Computing, Imperial College London SW7 2AZ, UK
| | - Nataša Pržulj
- Department of Computing, Imperial College London SW7 2AZ, UK
| |
Collapse
|
43
|
Applying multivariate clustering techniques to health data: the 4 types of healthcare utilization in the Paris metropolitan area. PLoS One 2014; 9:e115064. [PMID: 25506916 PMCID: PMC4266672 DOI: 10.1371/journal.pone.0115064] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2014] [Accepted: 11/11/2014] [Indexed: 11/19/2022] Open
Abstract
BACKGROUND Cost containment policies and the need to satisfy patients' health needs and care expectations provide major challenges to healthcare systems. Identification of homogeneous groups in terms of healthcare utilisation could lead to a better understanding of how to adjust healthcare provision to society and patient needs. METHODS This study used data from the third wave of the SIRS cohort study, a representative, population-based, socio-epidemiological study set up in 2005 in the Paris metropolitan area, France. The data were analysed using a cross-sectional design. In 2010, 3000 individuals were interviewed in their homes. Non-conventional multivariate clustering techniques were used to determine homogeneous user groups in data. Multinomial models assessed a wide range of potential associations between user characteristics and their pattern of healthcare utilisation. RESULTS We identified four distinct patterns of healthcare use. Patterns of consumption and the socio-demographic characteristics of users differed qualitatively and quantitatively between these four profiles. Extensive and intensive use by older, wealthier and unhealthier people contrasted with narrow and parsimonious use by younger, socially deprived people and immigrants. Rare, intermittent use by young healthy men contrasted with regular targeted use by healthy and wealthy women. CONCLUSION The use of an original technique of massive multivariate analysis allowed us to characterise different types of healthcare users, both in terms of resource utilisation and socio-demographic variables. This method would merit replication in different populations and healthcare systems.
Collapse
|
44
|
Chen Y, Zhang X, Zhang GQ, Xu R. Comparative analysis of a novel disease phenotype network based on clinical manifestations. J Biomed Inform 2014; 53:113-20. [PMID: 25277758 DOI: 10.1016/j.jbi.2014.09.007] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2014] [Revised: 08/18/2014] [Accepted: 09/21/2014] [Indexed: 12/21/2022]
Abstract
Systems approaches to analyzing disease phenotype networks in combination with protein functional interaction networks have great potential in illuminating disease pathophysiological mechanisms. While many genetic networks are readily available, disease phenotype networks remain largely incomplete. In this study, we built a large-scale Disease Manifestation Network (DMN) from 50,543 highly accurate disease-manifestation semantic relationships in the United Medical Language System (UMLS). Our new phenotype network contains 2305 nodes and 373,527 weighted edges to represent the disease phenotypic similarities. We first compared DMN with the networks representing genetic relationships among diseases, and demonstrated that the phenotype clustering in DMN reflects common disease genetics. Then we compared DMN with a widely-used disease phenotype network in previous gene discovery studies, called mimMiner, which was extracted from the textual descriptions in Online Mendelian Inheritance in Man (OMIM). We demonstrated that DMN contains different knowledge from the existing phenotype data source. Finally, a case study on Marfan syndrome further proved that DMN contains useful information and can provide leads to discover unknown disease causes. Integrating DMN in systems approaches with mimMiner and other data offers the opportunities to predict novel disease genetics. We made DMN publicly available at nlp/case.edu/public/data/DMN.
Collapse
Affiliation(s)
- Yang Chen
- Department of Electrical Engineering and Computer Science, Case Western Reserve University, Cleveland, OH 44106, United States; Division of Medical Informatics, School of Medicine, Case Western Reserve University, Cleveland, OH 44106, United States
| | - Xiang Zhang
- Department of Electrical Engineering and Computer Science, Case Western Reserve University, Cleveland, OH 44106, United States
| | - Guo-Qiang Zhang
- Department of Electrical Engineering and Computer Science, Case Western Reserve University, Cleveland, OH 44106, United States; Division of Medical Informatics, School of Medicine, Case Western Reserve University, Cleveland, OH 44106, United States
| | - Rong Xu
- Division of Medical Informatics, School of Medicine, Case Western Reserve University, Cleveland, OH 44106, United States.
| |
Collapse
|
45
|
Zemojtel T, Köhler S, Mackenroth L, Jäger M, Hecht J, Krawitz P, Graul-Neumann L, Doelken S, Ehmke N, Spielmann M, Oien NC, Schweiger MR, Krüger U, Frommer G, Fischer B, Kornak U, Flöttmann R, Ardeshirdavani A, Moreau Y, Lewis SE, Haendel M, Smedley D, Horn D, Mundlos S, Robinson PN. Effective diagnosis of genetic disease by computational phenotype analysis of the disease-associated genome. Sci Transl Med 2014; 6:252ra123. [PMID: 25186178 PMCID: PMC4512639 DOI: 10.1126/scitranslmed.3009262] [Citation(s) in RCA: 189] [Impact Index Per Article: 17.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
Less than half of patients with suspected genetic disease receive a molecular diagnosis. We have therefore integrated next-generation sequencing (NGS), bioinformatics, and clinical data into an effective diagnostic workflow. We used variants in the 2741 established Mendelian disease genes [the disease-associated genome (DAG)] to develop a targeted enrichment DAG panel (7.1 Mb), which achieves a coverage of 20-fold or better for 98% of bases. Furthermore, we established a computational method [Phenotypic Interpretation of eXomes (PhenIX)] that evaluated and ranked variants based on pathogenicity and semantic similarity of patients' phenotype described by Human Phenotype Ontology (HPO) terms to those of 3991 Mendelian diseases. In computer simulations, ranking genes based on the variant score put the true gene in first place less than 5% of the time; PhenIX placed the correct gene in first place more than 86% of the time. In a retrospective test of PhenIX on 52 patients with previously identified mutations and known diagnoses, the correct gene achieved a mean rank of 2.1. In a prospective study on 40 individuals without a diagnosis, PhenIX analysis enabled a diagnosis in 11 cases (28%, at a mean rank of 2.4). Thus, the NGS of the DAG followed by phenotype-driven bioinformatic analysis allows quick and effective differential diagnostics in medical genetics.
Collapse
Affiliation(s)
- Tomasz Zemojtel
- Institute for Medical Genetics and Human Genetics, Charité Universitätsmedizin Berlin, Augustenburger Platz 1, 13353 Berlin, Germany. Institute of Bioorganic Chemistry, Polish Academy of Sciences, 61-704 Poznan, Poland. Labor Berlin-Charité Vivantes GmbH, Humangenetik, Föhrer Straße 15, 13353 Berlin, Germany
| | - Sebastian Köhler
- Institute for Medical Genetics and Human Genetics, Charité Universitätsmedizin Berlin, Augustenburger Platz 1, 13353 Berlin, Germany
| | - Luisa Mackenroth
- Institute for Medical Genetics and Human Genetics, Charité Universitätsmedizin Berlin, Augustenburger Platz 1, 13353 Berlin, Germany
| | - Marten Jäger
- Institute for Medical Genetics and Human Genetics, Charité Universitätsmedizin Berlin, Augustenburger Platz 1, 13353 Berlin, Germany
| | - Jochen Hecht
- Max Planck Institute for Molecular Genetics, Ihnestr. 63-73, 14195 Berlin, Germany. Berlin-Brandenburg Center for Regenerative Therapies, Charité Universitätsmedizin Berlin, 13353 Berlin, Germany
| | - Peter Krawitz
- Institute for Medical Genetics and Human Genetics, Charité Universitätsmedizin Berlin, Augustenburger Platz 1, 13353 Berlin, Germany. Max Planck Institute for Molecular Genetics, Ihnestr. 63-73, 14195 Berlin, Germany
| | - Luitgard Graul-Neumann
- Institute for Medical Genetics and Human Genetics, Charité Universitätsmedizin Berlin, Augustenburger Platz 1, 13353 Berlin, Germany
| | - Sandra Doelken
- Institute for Medical Genetics and Human Genetics, Charité Universitätsmedizin Berlin, Augustenburger Platz 1, 13353 Berlin, Germany
| | - Nadja Ehmke
- Institute for Medical Genetics and Human Genetics, Charité Universitätsmedizin Berlin, Augustenburger Platz 1, 13353 Berlin, Germany
| | - Malte Spielmann
- Institute for Medical Genetics and Human Genetics, Charité Universitätsmedizin Berlin, Augustenburger Platz 1, 13353 Berlin, Germany. Max Planck Institute for Molecular Genetics, Ihnestr. 63-73, 14195 Berlin, Germany
| | - Nancy Christine Oien
- Institute for Medical Genetics and Human Genetics, Charité Universitätsmedizin Berlin, Augustenburger Platz 1, 13353 Berlin, Germany. Max Delbrück Center for Molecular Medicine, Robert-Rössle-Str. 10, 13125 Berlin, Germany
| | - Michal R Schweiger
- Institute for Medical Genetics and Human Genetics, Charité Universitätsmedizin Berlin, Augustenburger Platz 1, 13353 Berlin, Germany. Max Planck Institute for Molecular Genetics, Ihnestr. 63-73, 14195 Berlin, Germany. Cologne Center for Genomics, University of Cologne, D-50931 Cologne, Germany
| | - Ulrike Krüger
- Institute for Medical Genetics and Human Genetics, Charité Universitätsmedizin Berlin, Augustenburger Platz 1, 13353 Berlin, Germany
| | - Götz Frommer
- Agilent Technologies, Hewlett-Packard-Straße 8, 76337 Waldbronn, Germany
| | - Björn Fischer
- Institute for Medical Genetics and Human Genetics, Charité Universitätsmedizin Berlin, Augustenburger Platz 1, 13353 Berlin, Germany. Max Planck Institute for Molecular Genetics, Ihnestr. 63-73, 14195 Berlin, Germany
| | - Uwe Kornak
- Institute for Medical Genetics and Human Genetics, Charité Universitätsmedizin Berlin, Augustenburger Platz 1, 13353 Berlin, Germany. Max Planck Institute for Molecular Genetics, Ihnestr. 63-73, 14195 Berlin, Germany
| | - Ricarda Flöttmann
- Institute for Medical Genetics and Human Genetics, Charité Universitätsmedizin Berlin, Augustenburger Platz 1, 13353 Berlin, Germany
| | - Amin Ardeshirdavani
- Department of Electrical Engineering, STADIUS Center for Dynamical Systems, Signal Processing and Data Analytics, KU Leuven, 3001 Leuven, Belgium
| | - Yves Moreau
- Department of Electrical Engineering, STADIUS Center for Dynamical Systems, Signal Processing and Data Analytics, KU Leuven, 3001 Leuven, Belgium
| | - Suzanna E Lewis
- Genomics Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Melissa Haendel
- University Library and Department of Medical Informatics and Clinical Epidemiology, Oregon Health & Sciences University, Portland, OR 97327, USA
| | - Damian Smedley
- Mouse Informatics Group, Wellcome Trust Sanger Institute, CB10 1SA Hinxton, UK
| | - Denise Horn
- Institute for Medical Genetics and Human Genetics, Charité Universitätsmedizin Berlin, Augustenburger Platz 1, 13353 Berlin, Germany
| | - Stefan Mundlos
- Institute for Medical Genetics and Human Genetics, Charité Universitätsmedizin Berlin, Augustenburger Platz 1, 13353 Berlin, Germany. Max Planck Institute for Molecular Genetics, Ihnestr. 63-73, 14195 Berlin, Germany. Berlin-Brandenburg Center for Regenerative Therapies, Charité Universitätsmedizin Berlin, 13353 Berlin, Germany
| | - Peter N Robinson
- Institute for Medical Genetics and Human Genetics, Charité Universitätsmedizin Berlin, Augustenburger Platz 1, 13353 Berlin, Germany. Max Planck Institute for Molecular Genetics, Ihnestr. 63-73, 14195 Berlin, Germany. Berlin-Brandenburg Center for Regenerative Therapies, Charité Universitätsmedizin Berlin, 13353 Berlin, Germany. Institute for Bioinformatics, Department of Mathematics and Computer Science, Freie Universität Berlin, Takustr. 9, 14195 Berlin, Germany.
| |
Collapse
|
46
|
Itin PH. Etiology and pathogenesis of ectodermal dysplasias. Am J Med Genet A 2014; 164A:2472-7. [PMID: 24715647 DOI: 10.1002/ajmg.a.36550] [Citation(s) in RCA: 36] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2014] [Accepted: 02/28/2014] [Indexed: 02/04/2023]
Abstract
Ectodermal dysplasias are a large group of heterogeneous heritable conditions characterized by congenital defects of one or more ectodermal structures and their appendages. The skin and its appendages are mainly composed by ectodermal components but development initiation of appendages is orchestrated by signals of the mesoderm with the help of placodes. A complex network of signaling pathways coordinates the formation and function of ectodermal structures. In recent years much has been discovered regarding the molecular mechanisms of ectodermal embryogenesis and this facilitates a rational basis for classification of ectodermal dysplasia. Interestingly, not only complex ectodermal syndromes but also mono- or oligosymptomatic ectodermal malformations may result from a mutation in a gene that is critical for ectodermal development. Mesodermal, and occasionally endodermal malformations may coexist. Embryogenesis occurs in distinct tissue organizational fields and specific interactions among the germ layers exist that may lead to a wide range of ectodermal dysplasias. Of the approximately 200 different ectodermal dysplasias, about 80 have been characterized at the molecular level with identification of the genes that are mutated in these disorders. Modern molecular genetics will increasingly elucidate the basic defects of these distinct syndromes and shed more light into the regulatory mechanisms of embryology. The upcoming classification of ectodermal dysplasias will combine detailed clinical and molecular knowledge.
Collapse
Affiliation(s)
- Peter H Itin
- Department of Dermatology, University Hospital Basel, Basel, Switzerland; Research Group of Dermatology, Department of Biomedicine, University Hospital Basel, Basel, Switzerland
| |
Collapse
|
47
|
Liu Y, Gu Q, Hou JP, Han J, Ma J. A network-assisted co-clustering algorithm to discover cancer subtypes based on gene expression. BMC Bioinformatics 2014; 15:37. [PMID: 24491042 PMCID: PMC3916445 DOI: 10.1186/1471-2105-15-37] [Citation(s) in RCA: 43] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2013] [Accepted: 01/31/2014] [Indexed: 01/07/2023] Open
Abstract
BACKGROUND Cancer subtype information is critically important for understanding tumor heterogeneity. Existing methods to identify cancer subtypes have primarily focused on utilizing generic clustering algorithms (such as hierarchical clustering) to identify subtypes based on gene expression data. The network-level interaction among genes, which is key to understanding the molecular perturbations in cancer, has been rarely considered during the clustering process. The motivation of our work is to develop a method that effectively incorporates molecular interaction networks into the clustering process to improve cancer subtype identification. RESULTS We have developed a new clustering algorithm for cancer subtype identification, called "network-assisted co-clustering for the identification of cancer subtypes" (NCIS). NCIS combines gene network information to simultaneously group samples and genes into biologically meaningful clusters. Prior to clustering, we assign weights to genes based on their impact in the network. Then a new weighted co-clustering algorithm based on a semi-nonnegative matrix tri-factorization is applied. We evaluated the effectiveness of NCIS on simulated datasets as well as large-scale Breast Cancer and Glioblastoma Multiforme patient samples from The Cancer Genome Atlas (TCGA) project. NCIS was shown to better separate the patient samples into clinically distinct subtypes and achieve higher accuracy on the simulated datasets to tolerate noise, as compared to consensus hierarchical clustering. CONCLUSIONS The weighted co-clustering approach in NCIS provides a unique solution to incorporate gene network information into the clustering process. Our tool will be useful to comprehensively identify cancer subtypes that would otherwise be obscured by cancer heterogeneity, using high-throughput and high-dimensional gene expression data.
Collapse
Affiliation(s)
- Yiyi Liu
- Department of Bioengineering, University of Illinois at Urbana-Champaign, Urbana, Illinois, USA
| | - Quanquan Gu
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, Illinois, USA
| | - Jack P Hou
- Department of Bioengineering, University of Illinois at Urbana-Champaign, Urbana, Illinois, USA
- Medical Scholars Program, University of Illinois at Urbana-Champaign, Urbana, Illinois, USA
| | - Jiawei Han
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, Illinois, USA
| | - Jian Ma
- Department of Bioengineering, University of Illinois at Urbana-Champaign, Urbana, Illinois, USA
- Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois, USA
| |
Collapse
|
48
|
Network Analysis of Human Disease Comorbidity Patterns Based on Large-Scale Data Mining. BIOINFORMATICS RESEARCH AND APPLICATIONS 2014. [DOI: 10.1007/978-3-319-08171-7_22] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
|
49
|
Imboden M, Probst-Hensch NM. Biobanking across the phenome - at the center of chronic disease research. BMC Public Health 2013; 13:1094. [PMID: 24274136 PMCID: PMC4222669 DOI: 10.1186/1471-2458-13-1094] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2012] [Accepted: 09/25/2013] [Indexed: 11/10/2022] Open
Abstract
Background Recognized public health relevant risk factors such as obesity, physical inactivity, smoking or air pollution are common to many non-communicable diseases (NCDs). NCDs cluster and co-morbidities increase in parallel to age. Pleiotropic genes and genetic variants have been identified by genome-wide association studies (GWAS) linking NCD entities hitherto thought to be distant in etiology. These different lines of evidence suggest that NCD disease mechanisms are in part shared. Discussion Identification of common exogenous and endogenous risk patterns may promote efficient prevention, an urgent need in the light of the global NCD epidemic. The prerequisite to investigate causal risk patterns including biologic, genetic and environmental factors across different NCDs are well characterized cohorts with associated biobanks. Prospectively collected data and biospecimen from subjects of various age, sociodemographic, and cultural groups, both healthy and affected by one or more NCD, are essential for exploring biologic mechanisms and susceptibilities interlinking different environmental and lifestyle exposures, co-morbidities, as well as cellular senescence and aging. A paradigm shift in the research activities can currently be observed, moving from focused investigations on the effect of a single risk factor on an isolated health outcome to a more comprehensive assessment of risk patterns and a broader phenome approach. Though important methodological and analytical challenges need to be resolved, the ongoing international efforts to establish large-scale population-based biobank cohorts are a critical basis for moving NCD disease etiology forward. Summary Future epidemiologic and public health research should aim at sustaining a comprehensive systems view on health and disease. The political and public discussions about the utilitarian aspect of investing in and contributing to cohort and biobank research are essential and are indirectly linked to the achievement of public health programs effectively addressing the global NCD epidemic.
Collapse
Affiliation(s)
- Medea Imboden
- Swiss Tropical and Public Health Institute, Basel, Switzerland.
| | | |
Collapse
|
50
|
Wang P, Lai WF, Li MJ, Xu F, Yalamanchili HK, Lovell-Badge R, Wang J. Inference of gene-phenotype associations via protein-protein interaction and orthology. PLoS One 2013; 8:e77478. [PMID: 24194887 PMCID: PMC3806783 DOI: 10.1371/journal.pone.0077478] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2013] [Accepted: 08/30/2013] [Indexed: 01/23/2023] Open
Abstract
One of the fundamental goals of genetics is to understand gene functions and their associated phenotypes. To achieve this goal, in this study we developed a computational algorithm that uses orthology and protein-protein interaction information to infer gene-phenotype associations for multiple species. Furthermore, we developed a web server that provides genome-wide phenotype inference for six species: fly, human, mouse, worm, yeast, and zebrafish. We evaluated our inference method by comparing the inferred results with known gene-phenotype associations. The high Area Under the Curve values suggest a significant performance of our method. By applying our method to two human representative diseases, Type 2 Diabetes and Breast Cancer, we demonstrated that our method is able to identify related Gene Ontology terms and Kyoto Encyclopedia of Genes and Genomes pathways. The web server can be used to infer functions and putative phenotypes of a gene along with the candidate genes of a phenotype, and thus aids in disease candidate gene discovery. Our web server is available at http://jjwanglab.org/PhenoPPIOrth.
Collapse
Affiliation(s)
- Panwen Wang
- Department of Biochemistry, LKS Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China
- Shenzhen Institute of Research and Innovation, The University of Hong Kong, Shenzhen, China
| | - Wing-Fu Lai
- Department of Biochemistry, LKS Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China
| | - Mulin Jun Li
- Department of Biochemistry, LKS Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China
- Shenzhen Institute of Research and Innovation, The University of Hong Kong, Shenzhen, China
| | - Feng Xu
- Department of Biochemistry, LKS Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China
- Shenzhen Institute of Research and Innovation, The University of Hong Kong, Shenzhen, China
| | - Hari Krishna Yalamanchili
- Department of Biochemistry, LKS Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China
- Shenzhen Institute of Research and Innovation, The University of Hong Kong, Shenzhen, China
| | - Robin Lovell-Badge
- Division of Developmental Genetics, MRC National Institute for Medical Research, The Ridgeway, Mill Hill, London, United Kingdom
| | - Junwen Wang
- Department of Biochemistry, LKS Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China
- Shenzhen Institute of Research and Innovation, The University of Hong Kong, Shenzhen, China
- Centre for Genomic Sciences, LKS Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China
- * E-mail:
| |
Collapse
|