1
|
Zong N, Chowdhury S, Zhou S, Rajaganapathy S, Yu Y, Wang L, Dai Q, Li P, Liu X, Bielinski SJ, Chen J, Chen Y, Cerhan JR. Advancing Efficacy Prediction for EHR-based Emulated Trials in Repurposing Heart Failure Therapies. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2023.05.25.23290531. [PMID: 37398384 PMCID: PMC10312819 DOI: 10.1101/2023.05.25.23290531] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/04/2023]
Abstract
Introduction The High mortality rates associated with heart failure (HF) have propelled the strategy of drug repurposing, which seeks new therapeutic uses for existing, approved drugs to enhance the management of HF symptoms effectively. An emerging trend focuses on utilizing real-world data, like EHR, to mimic randomized controlled trials (RCTs) for evaluating treatment outcomes through what are known as emulated trials (ET). Nonetheless, the intricacies inherent in EHR data-comprising detailed patient histories in databases, the omission of certain biomarkers or specific diagnostic tests, and partial records of symptoms-introduce notable discrepancies between EHR data and the stringent standards of RCTs. This gap poses a substantial challenge in conducting an ET to accurately predict treatment efficacy. Objective The objective of this research is to predict the efficacy of drugs repurposed for HF in randomized trials by leveraging EHR in ET. Methods We proposed an ET framework to predict drug efficacy, integrating target prediction based on biomedical databases with statistical analysis using EHR data. Specifically, we developed a novel target prediction model that learns low-dimensional representations of drug molecules, protein sequences, and diverse biomedical associations from a knowledge graph. Additionally, we crafted strategies to improve the prediction by considering the interactions between HF drugs and biological factors in the context of HF prognostic markers. Results Our validation of the drug-target prediction model against the BETA benchmark demonstrated superior performance, with an average AUCROC of 97.7%, PRAUC of 97.4%, F1 score of 93.1%, and a General Score of 96.1%, surpassing existing baseline algorithms. Further analysis of our ET framework on identifying 17 repurposed drugs-derived from 266 phase 3 HF RCTs-using data from 59,000 patients at the Mayo Clinic highlighted the framework's remarkable predictive accuracy. This analysis took into account various factors such as biological variables (e.g., gender, age, ethnicity), HF medications (e.g., ACE inhibitors, Beta-blockers, ARBs, Loop Diuretics), types of HF (HFpEF and HFrEF), confounders, and prognostic markers (e.g., NT-proBNP, bUn, creatinine, and hemoglobin). The ET framework significantly improved the accuracy compared to the baseline efficacy analysis that utilized EHR data. Notably, the best results were improved in AUC-ROC from 75.71% to 93.57% and in PRAUC from 78.66% to 90.34%, compared to the baseline models. Conclusion Our study presents an ET framework that significantly enhances drug efficacy emulation by integrating EHR-based analysis with target prediction. We demonstrated substantial success in predicting the efficacy of 17 HF drugs repurposed for phase 3 RCTs, showcasing the framework's potential in advancing HF treatment strategies.
Collapse
Affiliation(s)
- Nansu Zong
- Department of Artificial Intelligence and Informatics Research, Mayo Clinic, Rochester, MN, USA
| | - Shaika Chowdhury
- Department of Artificial Intelligence and Informatics Research, Mayo Clinic, Rochester, MN, USA
| | - Shibo Zhou
- Department of Artificial Intelligence and Informatics Research, Mayo Clinic, Rochester, MN, USA
| | - Sivaraman Rajaganapathy
- Department of Artificial Intelligence and Informatics Research, Mayo Clinic, Rochester, MN, USA
| | - Yue Yu
- Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN, USA
| | - Liewei Wang
- Department of Molecular Pharmacology and Experimental Therapeutics, Mayo Clinic, Rochester, MN, USA
| | - Qiying Dai
- Department of Cardiovascular Medicine, Mayo Clinic, Rochester, MN, USA
| | - Pengyang Li
- Division of Cardiology, Pauley Heart Center, Virginia Commonwealth University, Richmond, Virginia, VA, USA
| | - Xiaoke Liu
- Division of Community Cardiology, Department of Cardiovascular Medicine, La Crosse, Wisconsin, WI, USA
| | | | - Jun Chen
- Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN, USA
| | - Yongbin Chen
- Department of Biochemistry and Molecular Biology, Mayo Clinic, Rochester, MN, USA
| | - James R. Cerhan
- Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN, USA
| |
Collapse
|
2
|
Zhao Y, Bollegala D, Hirose S, Jin Y, Kozu T. Community knowledge graph abstraction for enhanced link prediction: A study on PubMed knowledge graph. J Biomed Inform 2024; 158:104725. [PMID: 39265815 DOI: 10.1016/j.jbi.2024.104725] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2024] [Revised: 09/05/2024] [Accepted: 09/07/2024] [Indexed: 09/14/2024]
Abstract
OBJECTIVE As new knowledge is produced at a rapid pace in the biomedical field, existing biomedical Knowledge Graphs (KGs) cannot be manually updated in a timely manner. Previous work in Natural Language Processing (NLP) has leveraged link prediction to infer the missing knowledge in general-purpose KGs. Inspired by this, we propose to apply link prediction to existing biomedical KGs to infer missing knowledge. Although Knowledge Graph Embedding (KGE) methods are effective in link prediction tasks, they are less capable of capturing relations between communities of entities with specific attributes (Fanourakis et al., 2023). METHODS To address this challenge, we proposed an entity distance-based method for abstracting a Community Knowledge Graph (CKG) from a simplified version of the pre-existing PubMed Knowledge Graph (PKG) (Xu et al., 2020). For link prediction on the abstracted CKG, we proposed an extension approach for the existing KGE models by linking the information in the PKG to the abstracted CKG. The applicability of this extension was proved by employing six well-known KGE models: TransE, TransH, DistMult, ComplEx, SimplE, and RotatE. Evaluation metrics including Mean Rank (MR), Mean Reciprocal Rank (MRR), and Hits@k were used to assess the link prediction performance. In addition, we presented a backtracking process that traces the results of CKG link prediction back to the PKG scale for further comparison. RESULTS Six different CKGs were abstracted from the PKG by using embeddings of the six KGE methods. The results of link prediction in these abstracted CKGs indicate that our proposed extension can improve the existing KGE methods, achieving a top-10 accuracy of 0.69 compared to 0.5 for TransE, 0.7 compared to 0.54 for TransH, 0.67 compared to 0.6 for DistMult, 0.73 compared to 0.57 for ComplEx, 0.73 compared to 0.63 for SimplE, and 0.85 compared to 0.76 for RotatE on their CKGs, respectively. These improved performances also highlight the wide applicability of the extension approach. CONCLUSION This study proposed novel insights into abstracting CKGs from the PKG. The extension approach indicated enhanced performance of the existing KGE methods and has applicability. As an interesting future extension, we plan to conduct link prediction for entities that are newly introduced to the PKG.
Collapse
Affiliation(s)
- Yang Zhao
- Deloitte Analytics R&D, Deloitte Touche Tohmatsu LLC, 3-2-3 Marunouchi, Chiyoda-ku, Tokyo, 100-8360, Japan.
| | - Danushka Bollegala
- Department of Computer Science, University of Liverpool, Liverpool, L69 3BX, UK
| | - Shunsuke Hirose
- Deloitte Analytics R&D, Deloitte Touche Tohmatsu LLC, 3-2-3 Marunouchi, Chiyoda-ku, Tokyo, 100-8360, Japan
| | - Yingzi Jin
- Deloitte Analytics R&D, Deloitte Touche Tohmatsu LLC, 3-2-3 Marunouchi, Chiyoda-ku, Tokyo, 100-8360, Japan
| | - Tomotake Kozu
- Deloitte Analytics R&D, Deloitte Touche Tohmatsu LLC, 3-2-3 Marunouchi, Chiyoda-ku, Tokyo, 100-8360, Japan
| |
Collapse
|
3
|
Alvarez-Mamani E, Dechant R, Beltran-Castañón CA, Ibáñez AJ. Graph embedding on mass spectrometry- and sequencing-based biomedical data. BMC Bioinformatics 2024; 25:1. [PMID: 38166530 PMCID: PMC10763173 DOI: 10.1186/s12859-023-05612-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2023] [Accepted: 12/11/2023] [Indexed: 01/04/2024] Open
Abstract
Graph embedding techniques are using deep learning algorithms in data analysis to solve problems of such as node classification, link prediction, community detection, and visualization. Although typically used in the context of guessing friendships in social media, several applications for graph embedding techniques in biomedical data analysis have emerged. While these approaches remain computationally demanding, several developments over the last years facilitate their application to study biomedical data and thus may help advance biological discoveries. Therefore, in this review, we discuss the principles of graph embedding techniques and explore the usefulness for understanding biological network data derived from mass spectrometry and sequencing experiments, the current workhorses of systems biology studies. In particular, we focus on recent examples for characterizing protein-protein interaction networks and predicting novel drug functions.
Collapse
Affiliation(s)
- Edwin Alvarez-Mamani
- Engineering Department, Pontificia Universidad Católica del Perú, San Miguel, Lima, Peru
- Institute for Omics Sciences and Applied Biotechnology (ICOBA PUCP), Pontificia Universidad Católica del Perú, San Miguel, Lima, Peru
| | - Reinhard Dechant
- Institute for Omics Sciences and Applied Biotechnology (ICOBA PUCP), Pontificia Universidad Católica del Perú, San Miguel, Lima, Peru
- Calico Life Sciences, 1170 Veterans Blvd, San Francisco, CA, 94080, USA
| | | | - Alfredo J Ibáñez
- Institute for Omics Sciences and Applied Biotechnology (ICOBA PUCP), Pontificia Universidad Católica del Perú, San Miguel, Lima, Peru.
- Science Department, Pontificia Universidad Católica del Perú, San Miguel, Lima, Peru.
| |
Collapse
|
4
|
Veleiro U, de la Fuente J, Serrano G, Pizurica M, Casals M, Pineda-Lucena A, Vicent S, Ochoa I, Gevaert O, Hernaez M. GeNNius: an ultrafast drug-target interaction inference method based on graph neural networks. Bioinformatics 2024; 40:btad774. [PMID: 38134424 PMCID: PMC10766589 DOI: 10.1093/bioinformatics/btad774] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2023] [Revised: 11/20/2023] [Accepted: 12/21/2023] [Indexed: 12/24/2023] Open
Abstract
MOTIVATION Drug-target interaction (DTI) prediction is a relevant but challenging task in the drug repurposing field. In-silico approaches have drawn particular attention as they can reduce associated costs and time commitment of traditional methodologies. Yet, current state-of-the-art methods present several limitations: existing DTI prediction approaches are computationally expensive, thereby hindering the ability to use large networks and exploit available datasets and, the generalization to unseen datasets of DTI prediction methods remains unexplored, which could potentially improve the development processes of DTI inferring approaches in terms of accuracy and robustness. RESULTS In this work, we introduce GeNNius (Graph Embedding Neural Network Interaction Uncovering System), a Graph Neural Network (GNN)-based method that outperforms state-of-the-art models in terms of both accuracy and time efficiency across a variety of datasets. We also demonstrated its prediction power to uncover new interactions by evaluating not previously known DTIs for each dataset. We further assessed the generalization capability of GeNNius by training and testing it on different datasets, showing that this framework can potentially improve the DTI prediction task by training on large datasets and testing on smaller ones. Finally, we investigated qualitatively the embeddings generated by GeNNius, revealing that the GNN encoder maintains biological information after the graph convolutions while diffusing this information through nodes, eventually distinguishing protein families in the node embedding space. AVAILABILITY AND IMPLEMENTATION GeNNius code is available at https://github.com/ubioinformat/GeNNius.
Collapse
Affiliation(s)
- Uxía Veleiro
- CIMA University of Navarra, IdiSNA, 31008 Pamplona, Spain
| | - Jesús de la Fuente
- TECNUN, University of Navarra, 20016 San Sebastian, Spain
- Center for Data Science, New York University, New York, NY 10012, United States
| | - Guillermo Serrano
- CIMA University of Navarra, IdiSNA, 31008 Pamplona, Spain
- TECNUN, University of Navarra, 20016 San Sebastian, Spain
| | - Marija Pizurica
- Stanford Center for Biomedical Informatics Research, Department of Medicine and Department Biomedical Data Science, Stanford University, Stanford, CA 94305, United States
- Internet Technology and Data Science LAB (IDLab), Ghent University, Gent 9052, Belgium
| | - Mikel Casals
- TECNUN, University of Navarra, 20016 San Sebastian, Spain
| | | | - Silve Vicent
- CIMA University of Navarra, IdiSNA, 31008 Pamplona, Spain
| | - Idoia Ochoa
- TECNUN, University of Navarra, 20016 San Sebastian, Spain
- Instituto de Ciencia de los Datos e Inteligencia Artificial (DATAI), University of Navarra, 31008 Pamplona, Spain
| | - Olivier Gevaert
- Stanford Center for Biomedical Informatics Research, Department of Medicine and Department Biomedical Data Science, Stanford University, Stanford, CA 94305, United States
| | - Mikel Hernaez
- CIMA University of Navarra, IdiSNA, 31008 Pamplona, Spain
- Instituto de Ciencia de los Datos e Inteligencia Artificial (DATAI), University of Navarra, 31008 Pamplona, Spain
| |
Collapse
|
5
|
Zhang Y, Wu M, Wang S, Chen W. EFMSDTI: Drug-target interaction prediction based on an efficient fusion of multi-source data. Front Pharmacol 2022; 13:1009996. [PMID: 36210804 PMCID: PMC9538487 DOI: 10.3389/fphar.2022.1009996] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2022] [Accepted: 08/29/2022] [Indexed: 11/13/2022] Open
Abstract
Accurate identification of Drug Target Interactions (DTIs) is of great significance for understanding the mechanism of drug treatment and discovering new drugs for disease treatment. Currently, computational methods of DTIs prediction that combine drug and target multi-source data can effectively reduce the cost and time of drug development. However, in multi-source data processing, the contribution of different source data to DTIs is often not considered. Therefore, how to make full use of the contribution of different source data to predict DTIs for efficient fusion is the key to improving the prediction accuracy of DTIs. In this paper, considering the contribution of different source data to DTIs prediction, a DTIs prediction approach based on an effective fusion of drug and target multi-source data is proposed, named EFMSDTI. EFMSDTI first builds 15 similarity networks based on multi-source information networks classified as topological and semantic graphs of drugs and targets according to their biological characteristics. Then, the multi-networks are fused by selective and entropy weighting based on similarity network fusion (SNF) according to their contribution to DTIs prediction. The deep neural networks model learns the embedding of low-dimensional vectors of drugs and targets. Finally, the LightGBM algorithm based on Gradient Boosting Decision Tree (GBDT) is used to complete DTIs prediction. Experimental results show that EFMSDTI has better performance (AUROC and AUPR are 0.982) than several state-of-the-art algorithms. Also, it has a good effect on analyzing the top 1000 prediction results, while 990 of the first 1000DTIs were confirmed. Code and data are available at https://github.com/meng-jie/EFMSDTI.
Collapse
Affiliation(s)
- Yuanyuan Zhang
- School of Information and Control Engineering, Qingdao University of Technology, Qingdao, Shandong, China
- College of Computer science and Technology, China University of Petroleum (East China), Qingdao, Shandong, China
- *Correspondence: Yuanyuan Zhang,
| | - Mengjie Wu
- School of Information and Control Engineering, Qingdao University of Technology, Qingdao, Shandong, China
| | - Shudong Wang
- College of Computer science and Technology, China University of Petroleum (East China), Qingdao, Shandong, China
| | - Wei Chen
- School of Information and Control Engineering, Qingdao University of Technology, Qingdao, Shandong, China
| |
Collapse
|
6
|
Zong N, Wen A, Moon S, Fu S, Wang L, Zhao Y, Yu Y, Huang M, Wang Y, Zheng G, Mielke MM, Cerhan JR, Liu H. Computational drug repurposing based on electronic health records: a scoping review. NPJ Digit Med 2022; 5:77. [PMID: 35701544 PMCID: PMC9198008 DOI: 10.1038/s41746-022-00617-6] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2021] [Accepted: 05/19/2022] [Indexed: 11/30/2022] Open
Abstract
Computational drug repurposing methods adapt Artificial intelligence (AI) algorithms for the discovery of new applications of approved or investigational drugs. Among the heterogeneous datasets, electronic health records (EHRs) datasets provide rich longitudinal and pathophysiological data that facilitate the generation and validation of drug repurposing. Here, we present an appraisal of recently published research on computational drug repurposing utilizing the EHR. Thirty-three research articles, retrieved from Embase, Medline, Scopus, and Web of Science between January 2000 and January 2022, were included in the final review. Four themes, (1) publication venue, (2) data types and sources, (3) method for data processing and prediction, and (4) targeted disease, validation, and released tools were presented. The review summarized the contribution of EHR used in drug repurposing as well as revealed that the utilization is hindered by the validation, accessibility, and understanding of EHRs. These findings can support researchers in the utilization of medical data resources and the development of computational methods for drug repurposing.
Collapse
Affiliation(s)
- Nansu Zong
- Department of Artificial Intelligence and Informatics Research, Mayo Clinic, Rochester, MN, USA.
| | - Andrew Wen
- Department of Artificial Intelligence and Informatics Research, Mayo Clinic, Rochester, MN, USA
| | - Sungrim Moon
- Department of Artificial Intelligence and Informatics Research, Mayo Clinic, Rochester, MN, USA
| | - Sunyang Fu
- Department of Artificial Intelligence and Informatics Research, Mayo Clinic, Rochester, MN, USA
| | - Liwei Wang
- Department of Artificial Intelligence and Informatics Research, Mayo Clinic, Rochester, MN, USA
| | - Yiqing Zhao
- Department of Preventive Medicine, Northwestern Medicine, Northwestern University, Chicago, IL, USA
| | - Yue Yu
- Department of Artificial Intelligence and Informatics Research, Mayo Clinic, Rochester, MN, USA
| | - Ming Huang
- Department of Artificial Intelligence and Informatics Research, Mayo Clinic, Rochester, MN, USA
| | - Yanshan Wang
- Department of Health Information Management, School of Health and Rehabilitation Sciences, University of Pittsburgh, Pittsburgh, PA, USA
| | - Gang Zheng
- Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, MN, USA
| | | | - James R Cerhan
- Department of Artificial Intelligence and Informatics Research, Mayo Clinic, Rochester, MN, USA
| | - Hongfang Liu
- Department of Artificial Intelligence and Informatics Research, Mayo Clinic, Rochester, MN, USA
| |
Collapse
|
7
|
Zong N, Li N, Wen A, Ngo V, Yu Y, Huang M, Chowdhury S, Jiang C, Fu S, Weinshilboum R, Jiang G, Hunter L, Liu H. BETA: a comprehensive benchmark for computational drug-target prediction. Brief Bioinform 2022; 23:6596989. [PMID: 35649342 PMCID: PMC9294420 DOI: 10.1093/bib/bbac199] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2022] [Revised: 04/10/2022] [Accepted: 04/29/2022] [Indexed: 11/14/2022] Open
Abstract
Internal validation is the most popular evaluation strategy used for drug-target predictive models. The simple random shuffling in the cross-validation, however, is not always ideal to handle large, diverse and copious datasets as it could potentially introduce bias. Hence, these predictive models cannot be comprehensively evaluated to provide insight into their general performance on a variety of use-cases (e.g. permutations of different levels of connectiveness and categories in drug and target space, as well as validations based on different data sources). In this work, we introduce a benchmark, BETA, that aims to address this gap by (i) providing an extensive multipartite network consisting of 0.97 million biomedical concepts and 8.5 million associations, in addition to 62 million drug-drug and protein-protein similarities and (ii) presenting evaluation strategies that reflect seven cases (i.e. general, screening with different connectivity, target and drug screening based on categories, searching for specific drugs and targets and drug repurposing for specific diseases), a total of seven Tests (consisting of 344 Tasks in total) across multiple sampling and validation strategies. Six state-of-the-art methods covering two broad input data types (chemical structure- and gene sequence-based and network-based) were tested across all the developed Tasks. The best-worst performing cases have been analyzed to demonstrate the ability of the proposed benchmark to identify limitations of the tested methods for running over the benchmark tasks. The results highlight BETA as a benchmark in the selection of computational strategies for drug repurposing and target discovery.
Collapse
Affiliation(s)
- Nansu Zong
- Department of Artificial Intelligence and Informatics Research, Mayo Clinic, Rochester, MN
| | - Ning Li
- Center for Structure Biology, Center for Cancer Research, National Cancer Institute, Frederick, MD
| | - Andrew Wen
- Department of Artificial Intelligence and Informatics Research, Mayo Clinic, Rochester, MN
| | - Victoria Ngo
- Betty Irene Moore School of Nursing, University of California Davis Health, Sacramento, CA.,Stanford Health Policy, Stanford School of Medicine and Freeman Spogli Institute for International Studies, Palo Alto, CA
| | - Yue Yu
- Department of Artificial Intelligence and Informatics Research, Mayo Clinic, Rochester, MN
| | - Ming Huang
- Department of Artificial Intelligence and Informatics Research, Mayo Clinic, Rochester, MN
| | - Shaika Chowdhury
- Department of Artificial Intelligence and Informatics Research, Mayo Clinic, Rochester, MN
| | - Chao Jiang
- Department of Computer Science and Software Engineering, Auburn University, Auburn, AL
| | - Sunyang Fu
- Department of Artificial Intelligence and Informatics Research, Mayo Clinic, Rochester, MN
| | - Richard Weinshilboum
- Department of Molecular Pharmacology and Experimental Therapeutics, Mayo Clinic, Rochester, MN
| | - Guoqian Jiang
- Department of Artificial Intelligence and Informatics Research, Mayo Clinic, Rochester, MN
| | - Lawrence Hunter
- Department of Pharmacology, University of Colorado Denver, Aurora, CO
| | - Hongfang Liu
- Department of Artificial Intelligence and Informatics Research, Mayo Clinic, Rochester, MN
| |
Collapse
|
8
|
Jiang C, Ngo V, Chapman R, Yu Y, Liu H, Jiang G, Zong N. Deep Denoising of Raw Biomedical Knowledge Graph from COVID-19 Literature, LitCovid and Pubtator. J Med Internet Res 2022; 24:e38584. [PMID: 35658098 PMCID: PMC9301549 DOI: 10.2196/38584] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2022] [Revised: 05/20/2022] [Accepted: 05/30/2022] [Indexed: 12/05/2022] Open
Abstract
Background Multiple types of biomedical associations of knowledge graphs, including COVID-19–related ones, are constructed based on co-occurring biomedical entities retrieved from recent literature. However, the applications derived from these raw graphs (eg, association predictions among genes, drugs, and diseases) have a high probability of false-positive predictions as co-occurrences in the literature do not always mean there is a true biomedical association between two entities. Objective Data quality plays an important role in training deep neural network models; however, most of the current work in this area has been focused on improving a model’s performance with the assumption that the preprocessed data are clean. Here, we studied how to remove noise from raw knowledge graphs with limited labeled information. Methods The proposed framework used generative-based deep neural networks to generate a graph that can distinguish the unknown associations in the raw training graph. Two generative adversarial network models, NetGAN and Cross-Entropy Low-rank Logits (CELL), were adopted for the edge classification (ie, link prediction), leveraging unlabeled link information based on a real knowledge graph built from LitCovid and Pubtator. Results The performance of link prediction, especially in the extreme case of training data versus test data at a ratio of 1:9, demonstrated that the proposed method still achieved favorable results (area under the receiver operating characteristic curve >0.8 for the synthetic data set and 0.7 for the real data set), despite the limited amount of testing data available. Conclusions Our preliminary findings showed the proposed framework achieved promising results for removing noise during data preprocessing of the biomedical knowledge graph, potentially improving the performance of downstream applications by providing cleaner data.
Collapse
Affiliation(s)
| | - Victoria Ngo
- University of California Davis Health, Sacramento, US
| | | | - Yue Yu
- Mayo Clinic, Rochester, US
| | | | | | - Nansu Zong
- Mayo Clinic, 205 3rd Ave SW, Rochester, US
| |
Collapse
|
9
|
Gu Y, Zheng S, Xu Z, Yin Q, Li L, Li J. An efficient curriculum learning-based strategy for molecular graph learning. Brief Bioinform 2022; 23:6562682. [DOI: 10.1093/bib/bbac099] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2021] [Revised: 01/18/2022] [Accepted: 02/27/2022] [Indexed: 12/14/2022] Open
Abstract
Abstract
Computational methods have been widely applied to resolve various core issues in drug discovery, such as molecular property prediction. In recent years, a data-driven computational method-deep learning had achieved a number of impressive successes in various domains. In drug discovery, graph neural networks (GNNs) take molecular graph data as input and learn graph-level representations in non-Euclidean space. An enormous amount of well-performed GNNs have been proposed for molecular graph learning. Meanwhile, efficient use of molecular data during training process, however, has not been paid enough attention. Curriculum learning (CL) is proposed as a training strategy by rearranging training queue based on calculated samples' difficulties, yet the effectiveness of CL method has not been determined in molecular graph learning. In this study, inspired by chemical domain knowledge and task prior information, we proposed a novel CL-based training strategy to improve the training efficiency of molecular graph learning, called CurrMG. Consisting of a difficulty measurer and a training scheduler, CurrMG is designed as a plug-and-play module, which is model-independent and easy-to-use on molecular data. Extensive experiments demonstrated that molecular graph learning models could benefit from CurrMG and gain noticeable improvement on five GNN models and eight molecular property prediction tasks (overall improvement is 4.08%). We further observed CurrMG’s encouraging potential in resource-constrained molecular property prediction. These results indicate that CurrMG can be used as a reliable and efficient training strategy for molecular graph learning.
Availability: The source code is available in https://github.com/gu-yaowen/CurrMG.
Collapse
Affiliation(s)
- Yaowen Gu
- Institute of Medical Information (IMI), Chinese Academy of Medical Sciences and Peking Union Medical College (CAMS & PUMC), Beijing 100020, China
| | - Si Zheng
- Institute of Medical Information (IMI), Chinese Academy of Medical Sciences and Peking Union Medical College (CAMS & PUMC), Beijing 100020, China
- Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
| | - Zidu Xu
- Institute of Medical Information (IMI), Chinese Academy of Medical Sciences and Peking Union Medical College (CAMS & PUMC), Beijing 100020, China
| | - Qijin Yin
- Ministry of Education Key Laboratory of Bioinformatics, Bioinformatics Division at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Liang Li
- Key Laboratory of Antibiotic Bioengineering of National Health and Family Planning Commission (NHFPC), Institute of Medicinal Biotechnology (IMB), Chinese Academy of Medical Sciences and Peking Union Medical College (CAMS & PUMC), Beijing 100020, China
| | - Jiao Li
- Institute of Medical Information (IMI), Chinese Academy of Medical Sciences and Peking Union Medical College (CAMS & PUMC), Beijing 100020, China
| |
Collapse
|
10
|
Xuan P, Hu K, Cui H, Zhang T, Nakaguchi T. Learning multi-scale heterogeneous representations and global topology for drug-target interaction prediction. IEEE J Biomed Health Inform 2021; 26:1891-1902. [PMID: 34673498 DOI: 10.1109/jbhi.2021.3121798] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Identification of drug-target interactions (DTIs) plays a critical role in drug discovery and repositioning. Deep integration of inter-connections and intra-similarities between heterogeneous multi-source data related to drugs and targets, however, is a challenging issue. We propose a DTI prediction model by learning from drug and protein related multi-scale attributes and global topology formed by heterogeneous connections. A drug-protein-disease heterogeneous network (RPD-Net) is firstly constructed to associate diverse similarities, interactions and associations across nodes. Secondly, we propose a multi-scale pairwise deep representation learning module consisting of a new embedding strategy to integrate diverse inter-relations and intra-relations, and dilation convolutions for multi-scale deep representation extraction. A global topology learning module is proposed which is composed of strategy based on non-negative matrix factorization (NMF) to extract topology from RPD-Net, and a new relational-level attention mechanism for discriminative topology embedding. Experimental results using public dataset demonstrate improved performance over state-of-the-art methods and contributions of our major innovations. Evaluation results by top k recall rates and case studies on five drugs further show the effectiveness in retrieving potential target candidates for drugs.
Collapse
|
11
|
Yi HC, You ZH, Huang DS, Kwoh CK. Graph representation learning in bioinformatics: trends, methods and applications. Brief Bioinform 2021; 23:6361044. [PMID: 34471921 DOI: 10.1093/bib/bbab340] [Citation(s) in RCA: 47] [Impact Index Per Article: 15.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2021] [Revised: 07/18/2021] [Accepted: 08/02/2021] [Indexed: 12/12/2022] Open
Abstract
Graph is a natural data structure for describing complex systems, which contains a set of objects and relationships. Ubiquitous real-life biomedical problems can be modeled as graph analytics tasks. Machine learning, especially deep learning, succeeds in vast bioinformatics scenarios with data represented in Euclidean domain. However, rich relational information between biological elements is retained in the non-Euclidean biomedical graphs, which is not learning friendly to classic machine learning methods. Graph representation learning aims to embed graph into a low-dimensional space while preserving graph topology and node properties. It bridges biomedical graphs and modern machine learning methods and has recently raised widespread interest in both machine learning and bioinformatics communities. In this work, we summarize the advances of graph representation learning and its representative applications in bioinformatics. To provide a comprehensive and structured analysis and perspective, we first categorize and analyze both graph embedding methods (homogeneous graph embedding, heterogeneous graph embedding, attribute graph embedding) and graph neural networks. Furthermore, we summarize their representative applications from molecular level to genomics, pharmaceutical and healthcare systems level. Moreover, we provide open resource platforms and libraries for implementing these graph representation learning methods and discuss the challenges and opportunities of graph representation learning in bioinformatics. This work provides a comprehensive survey of emerging graph representation learning algorithms and their applications in bioinformatics. It is anticipated that it could bring valuable insights for researchers to contribute their knowledge to graph representation learning and future-oriented bioinformatics studies.
Collapse
Affiliation(s)
- Hai-Cheng Yi
- Chinese Academy of Sciences, Xinjiang Technical Institute of Physics and Chemistry, Urumqi 830011, China.,University of Chinese Academy of Sciences, Beijing 100049, China
| | - Zhu-Hong You
- School of Computer Science, Northwestern Polytechnical University, Xi'an 710129, China
| | - De-Shuang Huang
- Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, Shanghai 201804, China
| | - Chee Keong Kwoh
- School of Computer Science and Engineering, Nanyang Technological University, 50 Nanyang Avenue, Singapore
| |
Collapse
|
12
|
Zong N, Ngo V, Stone DJ, Wen A, Zhao Y, Yu Y, Liu S, Huang M, Wang C, Jiang G. Leveraging Genetic Reports and Electronic Health Records for the Prediction of Primary Cancers: Algorithm Development and Validation Study. JMIR Med Inform 2021; 9:e23586. [PMID: 34032581 PMCID: PMC8188315 DOI: 10.2196/23586] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2020] [Revised: 01/07/2021] [Accepted: 01/27/2021] [Indexed: 12/21/2022] Open
Abstract
BACKGROUND Precision oncology has the potential to leverage clinical and genomic data in advancing disease prevention, diagnosis, and treatment. A key research area focuses on the early detection of primary cancers and potential prediction of cancers of unknown primary in order to facilitate optimal treatment decisions. OBJECTIVE This study presents a methodology to harmonize phenotypic and genetic data features to classify primary cancer types and predict cancers of unknown primaries. METHODS We extracted genetic data elements from oncology genetic reports of 1011 patients with cancer and their corresponding phenotypical data from Mayo Clinic's electronic health records. We modeled both genetic and electronic health record data with HL7 Fast Healthcare Interoperability Resources. The semantic web Resource Description Framework was employed to generate the network-based data representation (ie, patient-phenotypic-genetic network). Based on the Resource Description Framework data graph, Node2vec graph-embedding algorithm was applied to generate features. Multiple machine learning and deep learning backbone models were compared for cancer prediction performance. RESULTS With 6 machine learning tasks designed in the experiment, we demonstrated the proposed method achieved favorable results in classifying primary cancer types (area under the receiver operating characteristic curve [AUROC] 96.56% for all 9 cancer predictions on average based on the cross-validation) and predicting unknown primaries (AUROC 80.77% for all 8 cancer predictions on average for real-patient validation). To demonstrate the interpretability, 17 phenotypic and genetic features that contributed the most to the prediction of each cancer were identified and validated based on a literature review. CONCLUSIONS Accurate prediction of cancer types can be achieved with existing electronic health record data with satisfactory precision. The integration of genetic reports improves prediction, illustrating the translational values of incorporating genetic tests early at the diagnosis stage for patients with cancer.
Collapse
Affiliation(s)
- Nansu Zong
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN, United States
| | - Victoria Ngo
- University of California Davis Health, Sacramento, CA, United States
| | - Daniel J Stone
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN, United States
| | - Andrew Wen
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN, United States
| | - Yiqing Zhao
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN, United States
| | - Yue Yu
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN, United States
| | - Sijia Liu
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN, United States
| | - Ming Huang
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN, United States
| | - Chen Wang
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN, United States
| | - Guoqian Jiang
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN, United States
| |
Collapse
|
13
|
Liang Y, Zhou R, Liang X, Kong X, Yang B. Pharmacological targets and molecular mechanisms of plumbagin to treat colorectal cancer: A systematic pharmacology study. Eur J Pharmacol 2020; 881:173227. [PMID: 32505664 DOI: 10.1016/j.ejphar.2020.173227] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2019] [Revised: 05/14/2020] [Accepted: 05/28/2020] [Indexed: 12/14/2022]
Abstract
Plumbagin (PL) pharmacologically plays the anti-proliferative effects in cancer cells, including effective suppression of colorectal cancer (CRC). However, the exact molecular mechanism of PL to treat CRC remains unclear. Using available SwissTargetPrediction and SuperPred databases, the anti-cancer biotargets of PL were identified, and the CRC-diseased targets were obtained through a DisGeNET database. The biological processes, and signaling pathways of PL to treat CRC were identified and visualized. Further, clinical and cell culture data were used to validate some bioinformatic findings. As shown in bioinformatics findings, 64 predictive biotargets of PL to treat CRC were collected, and 7 most important biotargets of tumor protein p53 (TP53), glyceraldehyde-3-phosphate dehydrogenase (GAPDH), mitogen-activated protein kinase 1 (MAPK1), E1A-associated protein p300 (EP300), poly (ADP-ribose) polymerase 1 (PARP1), nuclear factor kappa p65 protein (RELA), Bcl-2 like protein 1 (BCL2L1) were identified respectively. In addition, top 20 functional biological processes, signaling pathways of PL to treat CRC were screened and prioritized. In human study, CRC samples showed elevated expressions of neoplastic MAPK1, PARP1 mRNAs and reduced EP300 mRNA level. In cell culture study, PL-treated CRC cells resulted in down-regulated MAPK1, PARP1 mRNA expressions and up-regulation of EP300 mRNA level, characterized with suppressed cell proliferation. Taken together, the therapeutic biotargets and molecular mechanisms of PL to treat CRC were screened and identified by using a systematic pharmacology analysis, and some bioinformatic findings were validated in clinical and cell line experiments. Potentially, these hub biotargets may be the biomarkers for CRC detection and treatment.
Collapse
Affiliation(s)
- Yujia Liang
- College of Pharmacy, Guangxi Medical University, Guangxi, Nanning, PR China
| | - Rui Zhou
- Department of Hepatobiliary Surgery, Guigang City People's Hospital, The Eighth Affiliated Hospital of Guangxi Medical University, Guigang, Guangxi, PR China
| | - Xiaoliu Liang
- College of Pharmacy, Guangxi Medical University, Guangxi, Nanning, PR China
| | - Xiaolong Kong
- College of Pharmacy, Guangxi Medical University, Guangxi, Nanning, PR China.
| | - Bin Yang
- College of Pharmacy, Guangxi Medical University, Guangxi, Nanning, PR China.
| |
Collapse
|