1
|
Hu J, Li Z, Rao B, Thafar MA, Arif M. Improving protein-protein interaction prediction using protein language model and protein network features. Anal Biochem 2024; 693:115550. [PMID: 38679191 DOI: 10.1016/j.ab.2024.115550] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2024] [Revised: 04/12/2024] [Accepted: 04/25/2024] [Indexed: 05/01/2024]
Abstract
Interactions between proteins are ubiquitous in a wide variety of biological processes. Accurately identifying the protein-protein interaction (PPI) is of significant importance for understanding the mechanisms of protein functions and facilitating drug discovery. Although the wet-lab technological methods are the best way to identify PPI, their major constraints are their time-consuming nature, high cost, and labor-intensiveness. Hence, lots of efforts have been made towards developing computational methods to improve the performance of PPI prediction. In this study, we propose a novel hybrid computational method (called KSGPPI) that aims at improving the prediction performance of PPI via extracting the discriminative information from protein sequences and interaction networks. The KSGPPI model comprises two feature extraction modules. In the first feature extraction module, a large protein language model, ESM-2, is employed to exploit the global complex patterns concealed within protein sequences. Subsequently, feature representations are further extracted through CKSAAP, and a two-dimensional convolutional neural network (CNN) is utilized to capture local information. In the second feature extraction module, the query protein acquires its similar protein from the STRING database via the sequence alignment tool NW-align and then captures the graph embedding feature for the query protein in the protein interaction network of the similar protein using the algorithm of Node2vec. Finally, the features of these two feature extraction modules are efficiently fused; the fused features are then fed into the multilayer perceptron to predict PPI. The results of five-fold cross-validation on the used benchmarked datasets demonstrate that KSGPPI achieves an average prediction accuracy of 88.96 %. Additionally, the average Matthews correlation coefficient value (0.781) of KSGPPI is significantly higher than that of those state-of-the-art PPI prediction methods. The standalone package of KSGPPI is freely downloaded at https://github.com/rickleezhe/KSGPPI.
Collapse
Affiliation(s)
- Jun Hu
- College of Information Engineering, Zhejiang University of Technology, Hangzhou, 310023, China.
| | - Zhe Li
- College of Information Engineering, Zhejiang University of Technology, Hangzhou, 310023, China
| | - Bing Rao
- Engineering Research Center of Integration and Application of Digital Learning Technology, Ministry of Education, Beijing, 100039, China.
| | - Maha A Thafar
- Computer Science Department, College of Computers and Information Technology, Taif University, Taif, Saudi Arabia
| | - Muhammad Arif
- College of Science and Engineering, Hamad Bin Khalifa University, Doha 34110, Qatar.
| |
Collapse
|
2
|
Chaturvedi J, Wang T, Velupillai S, Stewart R, Roberts A. Development of a Knowledge Graph Embeddings Model for Pain. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2024; 2023:299-308. [PMID: 38222382 PMCID: PMC10785867] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 01/16/2024]
Abstract
Pain is a complex concept that can interconnect with other concepts such as a disorder that might cause pain, a medication that might relieve pain, and so on. To fully understand the context of pain experienced by either an individual or across a population, we may need to examine all concepts related to pain and the relationships between them. This is especially useful when modeling pain that has been recorded in electronic health records. Knowledge graphs represent concepts and their relations by an interlinked network, enabling semantic and context-based reasoning in a computationally tractable form. These graphs can, however, be too large for efficient computation. Knowledge graph embeddings help to resolve this by representing the graphs in a low-dimensional vector space. These embeddings can then be used in various downstream tasks such as classification and link prediction. The various relations associated with pain which are required to construct such a knowledge graph can be obtained from external medical knowledge bases such as SNOMED CT, a hierarchical systematic nomenclature of medical terms. A knowledge graph built in this way could be further enriched with real-world examples of pain and its relations extracted from electronic health records. This paper describes the construction of such knowledge graph embedding models of pain concepts, extracted from the unstructured text of mental health electronic health records, combined with external knowledge created from relations described in SNOMED CT, and their evaluation on a subject-object link prediction task. The performance of the models was compared with other baseline models.
Collapse
Affiliation(s)
- Jaya Chaturvedi
- Institute of Psychiatry, Psychology and Neurosciences, King's College London, London, United Kingdom
| | - Tao Wang
- Institute of Psychiatry, Psychology and Neurosciences, King's College London, London, United Kingdom
| | - Sumithra Velupillai
- Institute of Psychiatry, Psychology and Neurosciences, King's College London, London, United Kingdom
| | - Robert Stewart
- Institute of Psychiatry, Psychology and Neurosciences, King's College London, London, United Kingdom
- South London and Maudsley NHS Foundation Trust, London, United Kingdom
| | - Angus Roberts
- Institute of Psychiatry, Psychology and Neurosciences, King's College London, London, United Kingdom
- South London and Maudsley NHS Foundation Trust, London, United Kingdom
| |
Collapse
|
3
|
Wang W, Yu M, Sun B, Li J, Liu D, Zhang H, Wang X, Zhou Y. SMGCN: Multiple Similarity and Multiple Kernel Fusion Based Graph Convolutional Neural Network for Drug-Target Interactions Prediction. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024; 21:143-154. [PMID: 38051618 DOI: 10.1109/tcbb.2023.3339645] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/07/2023]
Abstract
Accurately identifying potential drug-target interactions (DTIs) is a critical step in accelerating drug discovery. Despite many studies that have been conducted over the past decades, detecting DTIs remains a highly challenging and complicated process. Therefore, we propose a novel method called SMGCN, which combines multiple similarity and multiple kernel fusion based on Graph Convolutional Network (GCN) to predict DTIs. In order to capture the features of the network structure and fully explore direct or indirect relationships between nodes, we propose the method of multiple similarity, which combines similarity fusion matrices with Random Walk with Restart (RWR) and cosine similarity. Then, we use GCN to extract multi-layer low-dimensional embedding features. Unlike traditional GCN methods, we incorporate Multiple Kernel Learning (MKL). Finally, we use the Dual Laplace Regularized Least Squares method to predict novel DTIs through combinatorial kernels in drug and target spaces. We conduct experiments on a golden standard dataset, and demonstrate the effectiveness of our proposed model in predicting DTIs through showing significant improvements in Area Under the Curve (AUC) and Area Under the Precision-Recall Curve (AUPR). In addition, our model can also discover some new DTIs, which can be verified by the KEGG BRITE Database and relevant literature.
Collapse
|
4
|
Daza D, Alivanistos D, Mitra P, Pijnenburg T, Cochez M, Groth P. BioBLP: a modular framework for learning on multimodal biomedical knowledge graphs. J Biomed Semantics 2023; 14:20. [PMID: 38066573 PMCID: PMC10709903 DOI: 10.1186/s13326-023-00301-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2023] [Accepted: 11/29/2023] [Indexed: 12/18/2023] Open
Abstract
BACKGROUND Knowledge graphs (KGs) are an important tool for representing complex relationships between entities in the biomedical domain. Several methods have been proposed for learning embeddings that can be used to predict new links in such graphs. Some methods ignore valuable attribute data associated with entities in biomedical KGs, such as protein sequences, or molecular graphs. Other works incorporate such data, but assume that entities can be represented with the same data modality. This is not always the case for biomedical KGs, where entities exhibit heterogeneous modalities that are central to their representation in the subject domain. OBJECTIVE We aim to understand how to incorporate multimodal data into biomedical KG embeddings, and analyze the resulting performance in comparison with traditional methods. We propose a modular framework for learning embeddings in KGs with entity attributes, that allows encoding attribute data of different modalities while also supporting entities with missing attributes. We additionally propose an efficient pretraining strategy for reducing the required training runtime. We train models using a biomedical KG containing approximately 2 million triples, and evaluate the performance of the resulting entity embeddings on the tasks of link prediction, and drug-protein interaction prediction, comparing against methods that do not take attribute data into account. RESULTS In the standard link prediction evaluation, the proposed method results in competitive, yet lower performance than baselines that do not use attribute data. When evaluated in the task of drug-protein interaction prediction, the method compares favorably with the baselines. Further analyses show that incorporating attribute data does outperform baselines over entities below a certain node degree, comprising approximately 75% of the diseases in the graph. We also observe that optimizing attribute encoders is a challenging task that increases optimization costs. Our proposed pretraining strategy yields significantly higher performance while reducing the required training runtime. CONCLUSION BioBLP allows to investigate different ways of incorporating multimodal biomedical data for learning representations in KGs. With a particular implementation, we find that incorporating attribute data does not consistently outperform baselines, but improvements are obtained on a comparatively large subset of entities below a specific node-degree. Our results indicate a potential for improved performance in scientific discovery tasks where understudied areas of the KG would benefit from link prediction methods.
Collapse
Affiliation(s)
- Daniel Daza
- Vrije Universiteit Amsterdam, Amsterdam, The Netherlands.
- University of Amsterdam, Amsterdam, The Netherlands.
- Discovery Lab, Elsevier, Amsterdam, The Netherlands.
| | - Dimitrios Alivanistos
- Vrije Universiteit Amsterdam, Amsterdam, The Netherlands.
- Discovery Lab, Elsevier, Amsterdam, The Netherlands.
| | | | | | - Michael Cochez
- Vrije Universiteit Amsterdam, Amsterdam, The Netherlands
- Discovery Lab, Elsevier, Amsterdam, The Netherlands
| | - Paul Groth
- University of Amsterdam, Amsterdam, The Netherlands
- Discovery Lab, Elsevier, Amsterdam, The Netherlands
| |
Collapse
|
5
|
Salcedo MV, Gravel N, Keshavarzi A, Huang LC, Kochut KJ, Kannan N. Predicting protein and pathway associations for understudied dark kinases using pattern-constrained knowledge graph embedding. PeerJ 2023; 11:e15815. [PMID: 37868056 PMCID: PMC10590106 DOI: 10.7717/peerj.15815] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2023] [Accepted: 07/10/2023] [Indexed: 10/24/2023] Open
Abstract
The 534 protein kinases encoded in the human genome constitute a large druggable class of proteins that include both well-studied and understudied "dark" members. Accurate prediction of dark kinase functions is a major bioinformatics challenge. Here, we employ a graph mining approach that uses the evolutionary and functional context encoded in knowledge graphs (KGs) to predict protein and pathway associations for understudied kinases. We propose a new scalable graph embedding approach, RegPattern2Vec, which employs regular pattern constrained random walks to sample diverse aspects of node context within a KG flexibly. RegPattern2Vec learns functional representations of kinases, interacting partners, post-translational modifications, pathways, cellular localization, and chemical interactions from a kinase-centric KG that integrates and conceptualizes data from curated heterogeneous data resources. By contextualizing information relevant to prediction, RegPattern2Vec improves accuracy and efficiency in comparison to other random walk-based graph embedding approaches. We show that the predictions produced by our model overlap with pathway enrichment data produced using experimentally validated Protein-Protein Interaction (PPI) data from both publicly available databases and experimental datasets not used in training. Our model also has the advantage of using the collected random walks as biological context to interpret the predicted protein-pathway associations. We provide high-confidence pathway predictions for 34 dark kinases and present three case studies in which analysis of meta-paths associated with the prediction enables biological interpretation. Overall, RegPattern2Vec efficiently samples multiple node types for link prediction on biological knowledge graphs and the predicted associations between understudied kinases, pseudokinases, and known pathways serve as a conceptual starting point for hypothesis generation and testing.
Collapse
Affiliation(s)
- Mariah V. Salcedo
- Department of Biochemistry and Molecular Biology, University of Georgia, Athens, GA, United States of America
| | - Nathan Gravel
- Institute of Bioinformatics, University of Georgia, Athens, GA, United States of America
| | - Abbas Keshavarzi
- School of Computing, University of Georgia, Athens, GA, United States of America
| | - Liang-Chin Huang
- Institute of Bioinformatics, University of Georgia, Athens, GA, United States of America
| | - Krzysztof J. Kochut
- School of Computing, University of Georgia, Athens, GA, United States of America
| | - Natarajan Kannan
- Department of Biochemistry and Molecular Biology, University of Georgia, Athens, GA, United States of America
- Institute of Bioinformatics, University of Georgia, Athens, GA, United States of America
| |
Collapse
|
6
|
Li J, Wang Y, Li Z, Lin H, Wu B. LM-DTI: a tool of predicting drug-target interactions using the node2vec and network path score methods. Front Genet 2023; 14:1181592. [PMID: 37229202 PMCID: PMC10203599 DOI: 10.3389/fgene.2023.1181592] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2023] [Accepted: 04/13/2023] [Indexed: 05/27/2023] Open
Abstract
Introduction: Drug-target interaction (DTI) prediction is a key step in drug function discovery and repositioning. The emergence of large-scale heterogeneous biological networks provides an opportunity to identify drug-related target genes, which led to the development of several computational methods for DTI prediction. Methods: Considering the limitations of conventional computational methods, a novel tool named LM-DTI based on integrated information related to lncRNAs and miRNAs was proposed, which adopted the graph embedding (node2vec) and the network path score methods. First, LM-DTI innovatively constructed a heterogeneous information network containing eight networks composed of four types of nodes (drug, target, lncRNA, and miRNA). Next, the node2vec method was used to obtain feature vectors of drug as well as target nodes, and the path score vector of each drug-target pair was calculated using the DASPfind method. Finally, the feature vectors and path score vectors were merged and input into the XGBoost classifier to predict potential drug-target interactions. Results and Discussion: The 10-fold cross validations evaluate the classification accuracies of the LM-DTI. The prediction performance of LM-DTI in AUPR reached 0.96, which showed a significant improvement compared with those of conventional tools. The validity of LM-DTI has also been verified by manually searching literature and various databases. LM-DTI is scalable and computing efficient; thus representing a powerful drug relocation tool that can be accessed for free at http://www.lirmed.com:5038/lm_dti.
Collapse
Affiliation(s)
- Jianwei Li
- School of Artificial Intelligence, Institute of Computational Medicine, Hebei University of Technology, Tianjin, China
- School of Electronic and Information Engineering, Hebei University of Technology, Tianjin, China
| | - Yinfei Wang
- School of Artificial Intelligence, Institute of Computational Medicine, Hebei University of Technology, Tianjin, China
| | - Zhiguang Li
- School of Artificial Intelligence, Institute of Computational Medicine, Hebei University of Technology, Tianjin, China
| | - Hongxin Lin
- School of Artificial Intelligence, Institute of Computational Medicine, Hebei University of Technology, Tianjin, China
| | - Baoqin Wu
- School of Artificial Intelligence, Institute of Computational Medicine, Hebei University of Technology, Tianjin, China
| |
Collapse
|
7
|
Thafar MA, Albaradei S, Uludag M, Alshahrani M, Gojobori T, Essack M, Gao X. OncoRTT: Predicting novel oncology-related therapeutic targets using BERT embeddings and omics features. Front Genet 2023; 14:1139626. [PMID: 37091791 PMCID: PMC10117673 DOI: 10.3389/fgene.2023.1139626] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2023] [Accepted: 03/24/2023] [Indexed: 04/08/2023] Open
Abstract
Late-stage drug development failures are usually a consequence of ineffective targets. Thus, proper target identification is needed, which may be possible using computational approaches. The reason being, effective targets have disease-relevant biological functions, and omics data unveil the proteins involved in these functions. Also, properties that favor the existence of binding between drug and target are deducible from the protein’s amino acid sequence. In this work, we developed OncoRTT, a deep learning (DL)-based method for predicting novel therapeutic targets. OncoRTT is designed to reduce suboptimal target selection by identifying novel targets based on features of known effective targets using DL approaches. First, we created the “OncologyTT” datasets, which include genes/proteins associated with ten prevalent cancer types. Then, we generated three sets of features for all genes: omics features, the proteins’ amino-acid sequence BERT embeddings, and the integrated features to train and test the DL classifiers separately. The models achieved high prediction performances in terms of area under the curve (AUC), i.e., AUC greater than 0.88 for all cancer types, with a maximum of 0.95 for leukemia. Also, OncoRTT outperformed the state-of-the-art method using their data in five out of seven cancer types commonly assessed by both methods. Furthermore, OncoRTT predicts novel therapeutic targets using new test data related to the seven cancer types. We further corroborated these results with other validation evidence using the Open Targets Platform and a case study focused on the top-10 predicted therapeutic targets for lung cancer.
Collapse
Affiliation(s)
- Maha A. Thafar
- Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center, Computer (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
- College of Computers and Information Technology, Computer Science Department, Taif University, Taif, Saudi Arabia
| | - Somayah Albaradei
- Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center, Computer (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
- Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Mahmut Uludag
- Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center, Computer (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
| | - Mona Alshahrani
- National Center for Artificial Intelligence (NCAI), Saudi Data and Artificial Intelligence Authority (SDAIA), Riyadh, Saudi Arabia
| | - Takashi Gojobori
- Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center, Computer (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
| | - Magbubah Essack
- Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center, Computer (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
- *Correspondence: Xin Gao, ; Magbubah Essack,
| | - Xin Gao
- Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center, Computer (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
- *Correspondence: Xin Gao, ; Magbubah Essack,
| |
Collapse
|
8
|
Li M, Yang H, Liu Y. Biomedical named entity recognition based on fusion multi-features embedding. Technol Health Care 2023; 31:111-121. [PMID: 37038786 DOI: 10.3233/thc-] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/12/2023]
Abstract
BACKGROUND With the exponential increase in the volume of biomedical literature, text mining tasks are becoming increasingly important in the medical domain. Named entities are the primary identification tasks in text mining, prerequisites and critical parts for building medical domain knowledge graphs, medical question and answer systems, medical text classification. OBJECTIVE The study goal is to recognize biomedical entities effectively by fusing multi-feature embedding. Multiple features provide more comprehensive information so that better predictions can be obtained. METHODS Firstly, three different kinds of features are generated, including deep contextual word-level features, local char-level features, and part-of-speech features at the word representation layer. The word representation vectors are inputs into BiLSTM as features to obtain the dependency information. Finally, the CRF algorithm is used to learn the features of the state sequences to obtain the global optimal tagging sequences. RESULTS The experimental results showed that the model outperformed other state-of-the-art methods for all-around performance in six datasets among eight of four biomedical entity types. CONCLUSION The proposed method has a positive effect on the prediction results. It comprehensively considers the relevant factors of named entity recognition because the semantic information is enhanced by fusing multi-features embedding.
Collapse
|
9
|
Wei Z, Guo W, Zhang Y, Zhang J, Zhao J. Bidirectional matching and aggregation network for few-shot relation extraction. PeerJ Comput Sci 2023; 9:e1272. [PMID: 37346532 PMCID: PMC10280494 DOI: 10.7717/peerj-cs.1272] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2022] [Accepted: 02/10/2023] [Indexed: 06/23/2023]
Abstract
Few-shot relation extraction is used to solve the problem of long tail distribution of data by matching between query instances and support instances. Existing methods focus only on the single direction process of matching, ignoring the symmetry of the data in the process. To address this issue, we propose the bidirectional matching and aggregation network (BMAN), which is particularly powerful when the training data is symmetrical. This model not only tries to extract relations for query instances, but also seeks relational prototypes about the query instances to validate the feature representation of the support set. Moreover, to avoid overfitting in bidirectional matching, the data enhancement method was designed to scale up the number of instances while maintaining the scope of the instance relation class. Extensive experiments on FewRel and FewRel2.0 public datasets are conducted and evaluate the effectiveness of BMAN.
Collapse
Affiliation(s)
- Zhongcheng Wei
- School of Information and Electrical Engineering, Hebei University of Engineering, Handan, Hebei, China
- Hebei Key Laboratory of Security & Protection Information Sensing and Processing, Handan, Hebei, China
| | - Wenjie Guo
- School of Information and Electrical Engineering, Hebei University of Engineering, Handan, Hebei, China
- Hebei Key Laboratory of Security & Protection Information Sensing and Processing, Handan, Hebei, China
| | - Yunping Zhang
- School of Information and Electrical Engineering, Hebei University of Engineering, Handan, Hebei, China
- Hebei Key Laboratory of Security & Protection Information Sensing and Processing, Handan, Hebei, China
| | - Jieying Zhang
- School of Information and Electrical Engineering, Hebei University of Engineering, Handan, Hebei, China
- Hebei Key Laboratory of Security & Protection Information Sensing and Processing, Handan, Hebei, China
| | - Jijun Zhao
- School of Information and Electrical Engineering, Hebei University of Engineering, Handan, Hebei, China
- Hebei Key Laboratory of Security & Protection Information Sensing and Processing, Handan, Hebei, China
| |
Collapse
|
10
|
Yang Y, Lu Y, Yan W. A comprehensive review on knowledge graphs for complex diseases. Brief Bioinform 2023; 24:6931722. [PMID: 36528805 DOI: 10.1093/bib/bbac543] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2022] [Revised: 11/02/2022] [Accepted: 11/10/2022] [Indexed: 12/23/2022] Open
Abstract
In recent years, knowledge graphs (KGs) have gained a great deal of popularity as a tool for storing relationships between entities and for performing higher level reasoning. KGs in biomedicine and clinical practice aim to provide an elegant solution for diagnosing and treating complex diseases more efficiently and flexibly. Here, we provide a systematic review to characterize the state-of-the-art of KGs in the area of complex disease research. We cover the following topics: (1) knowledge sources, (2) entity extraction methods, (3) relation extraction methods and (4) the application of KGs in complex diseases. As a result, we offer a complete picture of the domain. Finally, we discuss the challenges in the field by identifying gaps and opportunities for further research and propose potential research directions of KGs for complex disease diagnosis and treatment.
Collapse
Affiliation(s)
- Yang Yang
- School of Computer Science & Technology, Soochow University, Suzhou 215000, China.,Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing 210000, China
| | - Yuwei Lu
- School of Computer Science & Technology, Soochow University, Suzhou 215000, China.,Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing 210000, China
| | - Wenying Yan
- Department of Bioinformatics, School of Biology and Basic Medical Sciences, Medical College of Soochow University, and Center for Systems Biology, Soochow University, Suzhou 215123, China
| |
Collapse
|
11
|
Li M, Yang H, Liu Y. Biomedical named entity recognition based on fusion multi-features embedding. Technol Health Care 2023; 31:111-121. [PMID: 37038786 DOI: 10.3233/thc-236011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
BACKGROUND With the exponential increase in the volume of biomedical literature, text mining tasks are becoming increasingly important in the medical domain. Named entities are the primary identification tasks in text mining, prerequisites and critical parts for building medical domain knowledge graphs, medical question and answer systems, medical text classification. OBJECTIVE The study goal is to recognize biomedical entities effectively by fusing multi-feature embedding. Multiple features provide more comprehensive information so that better predictions can be obtained. METHODS Firstly, three different kinds of features are generated, including deep contextual word-level features, local char-level features, and part-of-speech features at the word representation layer. The word representation vectors are inputs into BiLSTM as features to obtain the dependency information. Finally, the CRF algorithm is used to learn the features of the state sequences to obtain the global optimal tagging sequences. RESULTS The experimental results showed that the model outperformed other state-of-the-art methods for all-around performance in six datasets among eight of four biomedical entity types. CONCLUSION The proposed method has a positive effect on the prediction results. It comprehensively considers the relevant factors of named entity recognition because the semantic information is enhanced by fusing multi-features embedding.
Collapse
|
12
|
Gavali S, Ross K, Chen C, Cowart J, Wu CH. A knowledge graph representation learning approach to predict novel kinase-substrate interactions. Mol Omics 2022; 18:853-864. [PMID: 35975455 PMCID: PMC9621340 DOI: 10.1039/d1mo00521a] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2021] [Accepted: 07/22/2022] [Indexed: 10/12/2023]
Abstract
The human proteome contains a vast network of interacting kinases and substrates. Even though some kinases have proven to be immensely useful as therapeutic targets, a majority are still understudied. In this work, we present a novel knowledge graph representation learning approach to predict novel interaction partners for understudied kinases. Our approach uses a phosphoproteomic knowledge graph constructed by integrating data from iPTMnet, protein ontology, gene ontology and BioKG. The representations of kinases and substrates in this knowledge graph are learned by performing directed random walks on triples coupled with a modified SkipGram or CBOW model. These representations are then used as an input to a supervised classification model to predict novel interactions for understudied kinases. We also present a post-predictive analysis of the predicted interactions and an ablation study of the phosphoproteomic knowledge graph to gain an insight into the biology of the understudied kinases.
Collapse
Affiliation(s)
- Sachin Gavali
- University of Delaware, Newark, DE 590 Avenue 1743, Suite 147, Newark, DE, USA.
| | - Karen Ross
- Georgetown University Medical Center, Washington DC, USA
| | - Chuming Chen
- University of Delaware, Newark, DE 590 Avenue 1743, Suite 147, Newark, DE, USA.
| | - Julie Cowart
- University of Delaware, Newark, DE 590 Avenue 1743, Suite 147, Newark, DE, USA.
| | - Cathy H Wu
- University of Delaware, Newark, DE 590 Avenue 1743, Suite 147, Newark, DE, USA.
| |
Collapse
|
13
|
Sosa DN, Altman RB. Contexts and contradictions: a roadmap for computational drug repurposing with knowledge inference. Brief Bioinform 2022; 23:bbac268. [PMID: 35817308 PMCID: PMC9294417 DOI: 10.1093/bib/bbac268] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2022] [Revised: 05/25/2022] [Accepted: 06/07/2022] [Indexed: 11/30/2022] Open
Abstract
The cost of drug development continues to rise and may be prohibitive in cases of unmet clinical need, particularly for rare diseases. Artificial intelligence-based methods are promising in their potential to discover new treatment options. The task of drug repurposing hypothesis generation is well-posed as a link prediction problem in a knowledge graph (KG) of interacting of drugs, proteins, genes and disease phenotypes. KGs derived from biomedical literature are semantically rich and up-to-date representations of scientific knowledge. Inference methods on scientific KGs can be confounded by unspecified contexts and contradictions. Extracting context enables incorporation of relevant pharmacokinetic and pharmacodynamic detail, such as tissue specificity of interactions. Contradictions in biomedical KGs may arise when contexts are omitted or due to contradicting research claims. In this review, we describe challenges to creating literature-scale representations of pharmacological knowledge and survey current approaches toward incorporating context and resolving contradictions.
Collapse
Affiliation(s)
- Daniel N Sosa
- Department of Biomedical Data Science, Stanford University, 443 Via Ortega, 94305, California, USA
| | - Russ B Altman
- Department of Biological Engineering; Department of Genetics; Department of Biomedical Data Science, Stanford University, 443 Via Ortega, 94305, California, USA
| |
Collapse
|
14
|
Alshahrani M, Almansour A, Alkhaldi A, Thafar MA, Uludag M, Essack M, Hoehndorf R. Combining biomedical knowledge graphs and text to improve predictions for drug-target interactions and drug-indications. PeerJ 2022; 10:e13061. [PMID: 35402106 PMCID: PMC8988936 DOI: 10.7717/peerj.13061] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2020] [Accepted: 02/13/2022] [Indexed: 01/11/2023] Open
Abstract
Biomedical knowledge is represented in structured databases and published in biomedical literature, and different computational approaches have been developed to exploit each type of information in predictive models. However, the information in structured databases and literature is often complementary. We developed a machine learning method that combines information from literature and databases to predict drug targets and indications. To effectively utilize information in published literature, we integrate knowledge graphs and published literature using named entity recognition and normalization before applying a machine learning model that utilizes the combination of graph and literature. We then use supervised machine learning to show the effects of combining features from biomedical knowledge and published literature on the prediction of drug targets and drug indications. We demonstrate that our approach using datasets for drug-target interactions and drug indications is scalable to large graphs and can be used to improve the ranking of targets and indications by exploiting features from either structure or unstructured information alone.
Collapse
Affiliation(s)
- Mona Alshahrani
- National Center for Artificial Intelligence (NCAI), Saudi Data and Artificial Intelligence Authority (SDAIA), Riyadh, Saudi Arabia
| | - Abdullah Almansour
- National Center for Artificial Intelligence (NCAI), Saudi Data and Artificial Intelligence Authority (SDAIA), Riyadh, Saudi Arabia
| | - Asma Alkhaldi
- National Center for Artificial Intelligence (NCAI), Saudi Data and Artificial Intelligence Authority (SDAIA), Riyadh, Saudi Arabia
| | - Maha A. Thafar
- College of Computers and Information Technology, Taif University, Taif, Saudi Arabia,Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| | - Mahmut Uludag
- Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| | - Magbubah Essack
- Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| | - Robert Hoehndorf
- Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| |
Collapse
|
15
|
Affinity2Vec: drug-target binding affinity prediction through representation learning, graph mining, and machine learning. Sci Rep 2022; 12:4751. [PMID: 35306525 PMCID: PMC8934358 DOI: 10.1038/s41598-022-08787-9] [Citation(s) in RCA: 21] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2021] [Accepted: 03/08/2022] [Indexed: 11/21/2022] Open
Abstract
Drug-target interaction (DTI) prediction plays a crucial role in drug repositioning and virtual drug screening. Most DTI prediction methods cast the problem as a binary classification task to predict if interactions exist or as a regression task to predict continuous values that indicate a drug's ability to bind to a specific target. The regression-based methods provide insight beyond the binary relationship. However, most of these methods require the three-dimensional (3D) structural information of targets which are still not generally available to the targets. Despite this bottleneck, only a few methods address the drug-target binding affinity (DTBA) problem from a non-structure-based approach to avoid the 3D structure limitations. Here we propose Affinity2Vec, as a novel regression-based method that formulates the entire task as a graph-based problem. To develop this method, we constructed a weighted heterogeneous graph that integrates data from several sources, including drug-drug similarity, target-target similarity, and drug-target binding affinities. Affinity2Vec further combines several computational techniques from feature representation learning, graph mining, and machine learning to generate or extract features, build the model, and predict the binding affinity between the drug and the target with no 3D structural data. We conducted extensive experiments to evaluate and demonstrate the robustness and efficiency of the proposed method on benchmark datasets used in state-of-the-art non-structured-based drug-target binding affinity studies. Affinity2Vec showed superior and competitive results compared to the state-of-the-art methods based on several evaluation metrics, including mean squared error, rm2, concordance index, and area under the precision-recall curve.
Collapse
|
16
|
Zhu C, Yang Z, Xia X, Li N, Zhong F, Liu L. Multimodal reasoning based on knowledge graph embedding for specific diseases. Bioinformatics 2022; 38:2235-2245. [PMID: 35150235 PMCID: PMC9004655 DOI: 10.1093/bioinformatics/btac085] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2021] [Revised: 01/06/2022] [Accepted: 02/07/2022] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Knowledge Graph (KG) is becoming increasingly important in the biomedical field. Deriving new and reliable knowledge from existing knowledge by KG embedding technology is a cutting-edge method. Some add a variety of additional information to aid reasoning, namely multimodal reasoning. However, few works based on the existing biomedical KGs are focused on specific diseases. RESULTS This work develops a construction and multimodal reasoning process of Specific Disease Knowledge Graphs (SDKGs). We construct SDKG-11, a SDKG set including five cancers, six non-cancer diseases, a combined Cancer5 and a combined Diseases11, aiming to discover new reliable knowledge and provide universal pre-trained knowledge for that specific disease field. SDKG-11 is obtained through original triplet extraction, standard entity set construction, entity linking and relation linking. We implement multimodal reasoning by reverse-hyperplane projection for SDKGs based on structure, category and description embeddings. Multimodal reasoning improves pre-existing models on all SDKGs using entity prediction task as the evaluation protocol. We verify the model's reliability in discovering new knowledge by manually proofreading predicted drug-gene, gene-disease and disease-drug pairs. Using embedding results as initialization parameters for the biomolecular interaction classification, we demonstrate the universality of embedding models. AVAILABILITY AND IMPLEMENTATION The constructed SDKG-11 and the implementation by TensorFlow are available from https://github.com/ZhuChaoY/SDKG-11. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Chaoyu Zhu
- Institute of Biomedical Sciences and School of Basic Medical Science, Shanghai Medical College, Fudan University, Shanghai 200032, China
| | - Zhihao Yang
- College of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Xiaoqiong Xia
- Institute of Biomedical Sciences and School of Basic Medical Science, Shanghai Medical College, Fudan University, Shanghai 200032, China
| | - Nan Li
- College of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Fan Zhong
- To whom correspondence should be addressed. or
| | - Lei Liu
- To whom correspondence should be addressed. or
| |
Collapse
|
17
|
Thafar MA, Olayan RS, Albaradei S, Bajic VB, Gojobori T, Essack M, Gao X. DTi2Vec: Drug-target interaction prediction using network embedding and ensemble learning. J Cheminform 2021; 13:71. [PMID: 34551818 PMCID: PMC8459562 DOI: 10.1186/s13321-021-00552-w] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2020] [Accepted: 09/05/2021] [Indexed: 11/21/2022] Open
Abstract
Drug-target interaction (DTI) prediction is a crucial step in drug discovery and repositioning as it reduces experimental validation costs if done right. Thus, developing in-silico methods to predict potential DTI has become a competitive research niche, with one of its main focuses being improving the prediction accuracy. Using machine learning (ML) models for this task, specifically network-based approaches, is effective and has shown great advantages over the other computational methods. However, ML model development involves upstream hand-crafted feature extraction and other processes that impact prediction accuracy. Thus, network-based representation learning techniques that provide automated feature extraction combined with traditional ML classifiers dealing with downstream link prediction tasks may be better-suited paradigms. Here, we present such a method, DTi2Vec, which identifies DTIs using network representation learning and ensemble learning techniques. DTi2Vec constructs the heterogeneous network, and then it automatically generates features for each drug and target using the nodes embedding technique. DTi2Vec demonstrated its ability in drug-target link prediction compared to several state-of-the-art network-based methods, using four benchmark datasets and large-scale data compiled from DrugBank. DTi2Vec showed a statistically significant increase in the prediction performances in terms of AUPR. We verified the "novel" predicted DTIs using several databases and scientific literature. DTi2Vec is a simple yet effective method that provides high DTI prediction performance while being scalable and efficient in computation, translating into a powerful drug repositioning tool.
Collapse
Affiliation(s)
- Maha A Thafar
- Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center, Computer (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, Kingdom of Saudi Arabia
- College of Computers and Information Technology, Computer Science Department, Taif University, Taif, Kingdom of Saudi Arabia
| | - Rawan S Olayan
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA
| | - Somayah Albaradei
- Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center, Computer (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, Kingdom of Saudi Arabia
- Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Kingdom of Saudi Arabia
| | - Vladimir B Bajic
- Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center, Computer (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, Kingdom of Saudi Arabia
| | - Takashi Gojobori
- Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center, Computer (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, Kingdom of Saudi Arabia
| | - Magbubah Essack
- Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center, Computer (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, Kingdom of Saudi Arabia.
| | - Xin Gao
- Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center, Computer (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, Kingdom of Saudi Arabia.
| |
Collapse
|