1
|
Defilippo A, Veltri P, Lió P, Guzzi PH. Leveraging graph neural networks for supporting automatic triage of patients. Sci Rep 2024; 14:12548. [PMID: 38822012 PMCID: PMC11143315 DOI: 10.1038/s41598-024-63376-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2024] [Accepted: 05/28/2024] [Indexed: 06/02/2024] Open
Abstract
Patient triage is crucial in emergency departments, ensuring timely and appropriate care based on correctly evaluating the emergency grade of patient conditions. Triage methods are generally performed by human operator based on her own experience and information that are gathered from the patient management process. Thus, it is a process that can generate errors in emergency-level associations. Recently, Traditional triage methods heavily rely on human decisions, which can be subjective and prone to errors. A growing interest has recently been focused on leveraging artificial intelligence (AI) to develop algorithms to maximize information gathering and minimize errors in patient triage processing. We define and implement an AI-based module to manage patients' emergency code assignments in emergency departments. It uses historical data from the emergency department to train the medical decision-making process. Data containing relevant patient information, such as vital signs, symptoms, and medical history, accurately classify patients into triage categories. Experimental results demonstrate that the proposed algorithm achieved high accuracy outperforming traditional triage methods. By using the proposed method, we claim that healthcare professionals can predict severity index to guide patient management processing and resource allocation.
Collapse
Affiliation(s)
- Annamaria Defilippo
- Dept. Medical and Surgical Sciences, Magna Graecia University of Catanzaro, Catanzaro, Italy
| | - Pierangelo Veltri
- DIMES Department of Informatics, Modeling, Electronics and Systems, UNICAL, Rende, Cosenza, Italy
| | - Pietro Lió
- Department of Computer Science and Technology, Cambridge University, Cambridge, UK
| | - Pietro Hiram Guzzi
- Dept. Medical and Surgical Sciences, Magna Graecia University of Catanzaro, Catanzaro, Italy.
| |
Collapse
|
2
|
Hayes WB. Exact p-values for global network alignments via combinatorial analysis of shared GO terms : REFANGO: Rigorous Evaluation of Functional Alignments of Networks using Gene Ontology. J Math Biol 2024; 88:50. [PMID: 38551701 PMCID: PMC10980677 DOI: 10.1007/s00285-024-02058-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2020] [Revised: 01/21/2024] [Accepted: 02/05/2024] [Indexed: 04/01/2024]
Abstract
Network alignment aims to uncover topologically similar regions in the protein-protein interaction (PPI) networks of two or more species under the assumption that topologically similar regions tend to perform similar functions. Although there exist a plethora of both network alignment algorithms and measures of topological similarity, currently no "gold standard" exists for evaluating how well either is able to uncover functionally similar regions. Here we propose a formal, mathematically and statistically rigorous method for evaluating the statistical significance of shared GO terms in a global, 1-to-1 alignment between two PPI networks. Given an alignment in which k aligned protein pairs share a particular GO term g, we use a combinatorial argument to precisely quantify the p-value of that alignment with respect to g compared to a random alignment. The p-value of the alignment with respect to all GO terms, including their inter-relationships, is approximated using the Empirical Brown's Method. We note that, just as with BLAST's p-values, this method is not designed to guide an alignment algorithm towards a solution; instead, just as with BLAST, an alignment is guided by a scoring matrix or function; the p-values herein are computed after the fact, providing independent feedback to the user on the biological quality of the alignment that was generated by optimizing the scoring function. Importantly, we demonstrate that among all GO-based measures of network alignments, ours is the only one that correlates with the precision of GO annotation predictions, paving the way for network alignment-based protein function prediction.
Collapse
Affiliation(s)
- Wayne B Hayes
- Department of Computer Science, UC Irvine, Irvine, USA.
| |
Collapse
|
3
|
Li W, Wang B, Dai J, Kou Y, Chen X, Pan Y, Hu S, Xu ZZ. Partial order relation-based gene ontology embedding improves protein function prediction. Brief Bioinform 2024; 25:bbae077. [PMID: 38446740 PMCID: PMC10917077 DOI: 10.1093/bib/bbae077] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2023] [Revised: 01/22/2024] [Indexed: 03/08/2024] Open
Abstract
Protein annotation has long been a challenging task in computational biology. Gene Ontology (GO) has become one of the most popular frameworks to describe protein functions and their relationships. Prediction of a protein annotation with proper GO terms demands high-quality GO term representation learning, which aims to learn a low-dimensional dense vector representation with accompanying semantic meaning for each functional label, also known as embedding. However, existing GO term embedding methods, which mainly take into account ancestral co-occurrence information, have yet to capture the full topological information in the GO-directed acyclic graph (DAG). In this study, we propose a novel GO term representation learning method, PO2Vec, to utilize the partial order relationships to improve the GO term representations. Extensive evaluations show that PO2Vec achieves better outcomes than existing embedding methods in a variety of downstream biological tasks. Based on PO2Vec, we further developed a new protein function prediction method PO2GO, which demonstrates superior performance measured in multiple metrics and annotation specificity as well as few-shot prediction capability in the benchmarks. These results suggest that the high-quality representation of GO structure is critical for diverse biological tasks including computational protein annotation.
Collapse
Affiliation(s)
- Wenjing Li
- College of Computer Science and Software, Shenzhen University, Shenzhen, China
| | - Bin Wang
- School of Mathematics and Computer Sciences, Nanchang University, Nanchang, China
| | - Jin Dai
- Center for Quantum Technology Research and School of Physics, Beijing Institute of Technology, Beijing, China
| | - Yan Kou
- Xbiome, Scientific Research Building, Tsinghua High-Tech Park, Shenzhen, China
| | - Xiaojun Chen
- College of Computer Science and Software, Shenzhen University, Shenzhen, China
| | - Yi Pan
- Faculty of Computer Science and Control Engineering Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences 1068 Xueyuan Avenue, Shenzhen University Town, Shenzhen, China
| | - Shuangwei Hu
- Xbiome, Scientific Research Building, Tsinghua High-Tech Park, Shenzhen, China
| | - Zhenjiang Zech Xu
- School of Mathematics and Computer Sciences, Nanchang University, Nanchang, China
- State Key Laboratory of Food Science and Technology, Nanchang University, Nanchang, China
| |
Collapse
|
4
|
Bandyopadhyay SS, Halder AK, Saha S, Chatterjee P, Nasipuri M, Basu S. Assessment of GO-Based Protein Interaction Affinities in the Large-Scale Human-Coronavirus Family Interactome. Vaccines (Basel) 2023; 11:549. [PMID: 36992133 PMCID: PMC10059867 DOI: 10.3390/vaccines11030549] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2023] [Revised: 02/19/2023] [Accepted: 02/23/2023] [Indexed: 03/03/2023] Open
Abstract
SARS-CoV-2 is a novel coronavirus that replicates itself via interacting with the host proteins. As a result, identifying virus and host protein-protein interactions could help researchers better understand the virus disease transmission behavior and identify possible COVID-19 drugs. The International Committee on Virus Taxonomy has determined that nCoV is genetically 89% compared to the SARS-CoV epidemic in 2003. This paper focuses on assessing the host-pathogen protein interaction affinity of the coronavirus family, having 44 different variants. In light of these considerations, a GO-semantic scoring function is provided based on Gene Ontology (GO) graphs for determining the binding affinity of any two proteins at the organism level. Based on the availability of the GO annotation of the proteins, 11 viral variants, viz., SARS-CoV-2, SARS, MERS, Bat coronavirus HKU3, Bat coronavirus Rp3/2004, Bat coronavirus HKU5, Murine coronavirus, Bovine coronavirus, Rat coronavirus, Bat coronavirus HKU4, Bat coronavirus 133/2005, are considered from 44 viral variants. The fuzzy scoring function of the entire host-pathogen network has been processed with ~180 million potential interactions generated from 19,281 host proteins and around 242 viral proteins. ~4.5 million potential level one host-pathogen interactions are computed based on the estimated interaction affinity threshold. The resulting host-pathogen interactome is also validated with state-of-the-art experimental networks. The study has also been extended further toward the drug-repurposing study by analyzing the FDA-listed COVID drugs.
Collapse
Affiliation(s)
- Soumyendu Sekhar Bandyopadhyay
- Department of Computer Science and Engineering, Jadavpur University, Kolkata 700032, India
- Department of Computer Science and Engineering, School of Engineering and Technology, Adamas University, Kolkata 700126, India
| | - Anup Kumar Halder
- Faculty of Mathematics and Information Sciences, Warsaw University of Technology, 00-662 Warsaw, Poland
| | - Sovan Saha
- Department of Computer Science and Engineering (Artificial Intelligence and Machine Learning), Techno Main Salt Lake, Sector V, Kolkata 700091, India
| | - Piyali Chatterjee
- Department of Computer Science and Engineering, Netaji Subhash Engineering College, Kolkata 700152, India
| | - Mita Nasipuri
- Department of Computer Science and Engineering, Jadavpur University, Kolkata 700032, India
| | - Subhadip Basu
- Department of Computer Science and Engineering, Jadavpur University, Kolkata 700032, India
| |
Collapse
|
5
|
Joshi P, Banerjee S, Hu X, Khade PM, Friedberg I. GOThresher: a program to remove annotation biases from protein function annotation datasets. Bioinformatics 2023; 39:6998200. [PMID: 36688705 DOI: 10.1093/bioinformatics/btad048] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2022] [Revised: 11/30/2022] [Accepted: 01/20/2023] [Indexed: 01/24/2023] Open
Abstract
MOTIVATION Advances in sequencing technologies have led to a surge in genomic data, although the functions of many gene products coded by these genes remain unknown. While in-depth, targeted experiments that determine the functions of these gene products are crucial and routinely performed, they fail to keep up with the inflow of novel genomic data. In an attempt to address this gap, high-throughput experiments are being conducted in which a large number of genes are investigated in a single study. The annotations generated as a result of these experiments are generally biased towards a small subset of less informative Gene Ontology (GO) terms. Identifying and removing biases from protein function annotation databases is important since biases impact our understanding of protein function by providing a poor picture of the annotation landscape. Additionally, as machine learning methods for predicting protein function are becoming increasingly prevalent, it is essential that they are trained on unbiased datasets. Therefore, it is not only crucial to be aware of biases, but also to judiciously remove them from annotation datasets. RESULTS We introduce GOThresher, a Python tool that identifies and removes biases in function annotations from protein function annotation databases. AVAILABILITY AND IMPLEMENTATION GOThresher is written in Python and released via PyPI https://pypi.org/project/gothresher/ and on the Bioconda Anaconda channel https://anaconda.org/bioconda/gothresher. The source code is hosted on GitHub https://github.com/FriedbergLab/GOThresher and distributed under the GPL 3.0 license. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Parnal Joshi
- Bioinformatics and Computational Biology Program, Iowa State University, Ames, IA 50011, USA.,Department of Veterinary Microbiology and Preventive Medicine, Iowa State University, Ames, IA 50011, USA
| | - Sagnik Banerjee
- Bioinformatics and Computational Biology Program, Iowa State University, Ames, IA 50011, USA.,Department of Statistics, Iowa State University, Ames, IA 50011, USA
| | - Xiao Hu
- Department of Veterinary Microbiology and Preventive Medicine, Iowa State University, Ames, IA 50011, USA
| | - Pranav M Khade
- Bioinformatics and Computational Biology Program, Iowa State University, Ames, IA 50011, USA.,Roy J. Carver Department of Biochemistry, Biophysics and Molecular Biology, Iowa State University, Ames, IA 50011, USA
| | - Iddo Friedberg
- Bioinformatics and Computational Biology Program, Iowa State University, Ames, IA 50011, USA.,Department of Veterinary Microbiology and Preventive Medicine, Iowa State University, Ames, IA 50011, USA
| |
Collapse
|
6
|
Orientation algorithm for PPI networks based on network propagation approach. J Biosci 2022. [DOI: 10.1007/s12038-022-00284-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
|
7
|
Wang S, Atkinson GRS, Hayes WB. SANA: cross-species prediction of Gene Ontology GO annotations via topological network alignment. NPJ Syst Biol Appl 2022; 8:25. [PMID: 35859153 PMCID: PMC9300714 DOI: 10.1038/s41540-022-00232-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2019] [Accepted: 05/20/2022] [Indexed: 12/31/2022] Open
Abstract
Topological network alignment aims to align two networks node-wise in order to maximize the observed common connection (edge) topology between them. The topological alignment of two protein-protein interaction (PPI) networks should thus expose protein pairs with similar interaction partners allowing, for example, the prediction of common Gene Ontology (GO) terms. Unfortunately, no network alignment algorithm based on topology alone has been able to achieve this aim, though those that include sequence similarity have seen some success. We argue that this failure of topology alone is due to the sparsity and incompleteness of the PPI network data of almost all species, which provides the network topology with a small signal-to-noise ratio that is effectively swamped when sequence information is added to the mix. Here we show that the weak signal can be detected using multiple stochastic samples of "good" topological network alignments, which allows us to observe regions of the two networks that are robustly aligned across multiple samples. The resulting network alignment frequency (NAF) strongly correlates with GO-based Resnik semantic similarity and enables the first successful cross-species predictions of GO terms based on topology-only network alignments. Our best predictions have an AUPR of about 0.4, which is competitive with state-of-the-art algorithms, even when there is no observable sequence similarity and no known homology relationship. While our results provide only a "proof of concept" on existing network data, we hypothesize that predicting GO terms from topology-only network alignments will become increasingly practical as the volume and quality of PPI network data increase.
Collapse
Affiliation(s)
- Siyue Wang
- Department of Computer Science, University of California, Irvine, CA, 92697-3435, USA
| | - Giles R S Atkinson
- Department of Computer Science, University of California, Irvine, CA, 92697-3435, USA
| | - Wayne B Hayes
- Department of Computer Science, University of California, Irvine, CA, 92697-3435, USA.
| |
Collapse
|
8
|
Network-Based Approaches for Disease-Gene Association Prediction Using Protein-Protein Interaction Networks. Int J Mol Sci 2022; 23:ijms23137411. [PMID: 35806415 PMCID: PMC9266751 DOI: 10.3390/ijms23137411] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2022] [Revised: 06/25/2022] [Accepted: 06/30/2022] [Indexed: 01/02/2023] Open
Abstract
Genome-wide association studies (GWAS) can be used to infer genome intervals that are involved in genetic diseases. However, investigating a large number of putative mutations for GWAS is resource- and time-intensive. Network-based computational approaches are being used for efficient disease-gene association prediction. Network-based methods are based on the underlying assumption that the genes causing the same diseases are located close to each other in a molecular network, such as a protein-protein interaction (PPI) network. In this survey, we provide an overview of network-based disease-gene association prediction methods based on three categories: graph-theoretic algorithms, machine learning algorithms, and an integration of these two. We experimented with six selected methods to compare their prediction performance using a heterogeneous network constructed by combining a genome-wide weighted PPI network, an ontology-based disease network, and disease-gene associations. The experiment was conducted in two different settings according to the presence and absence of known disease-associated genes. The results revealed that HerGePred, an integrative method, outperformed in the presence of known disease-associated genes, whereas PRINCE, which adopted a network propagation algorithm, was the most competitive in the absence of known disease-associated genes. Overall, the results demonstrated that the integrative methods performed better than the methods using graph-theory only, and the methods using a heterogeneous network performed better than those using a homogeneous PPI network only.
Collapse
|
9
|
Zhang Y, Duan L, Zheng H, Li-Ling J, Qin R, Chen Z, He C, Wang T. Mining Similar Aspects for Gene Similarity Explanation Based on Gene Information Network. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:1734-1746. [PMID: 33259307 DOI: 10.1109/tcbb.2020.3041559] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Analysis of gene similarity not only can provide information on the understanding of the biological roles and functions of a gene, but may also reveal the relationships among various genes. In this paper, we introduce a novel idea of mining similar aspects from a gene information network, i.e., for a given gene pair, we want to know in which aspects (meta paths) they are most similar from the perspective of the gene information network. We defined a similarity metric based on the set of meta paths connecting the query genes in the gene information network and used the rank of similarity of a gene pair in a meta path set to measure the similarity significance in that aspect. A minimal set of gene meta paths where the query gene pair ranks the highest is a similar aspect, and the similar aspect of a query gene pair is far from trivial. We proposed a novel method, SCENARIO, to investigate minimal similar aspects. Our empirical study on the gene information network, constructed from six public gene-related databases, verified that our proposed method is effective, efficient, and useful.
Collapse
|
10
|
Eid R, Landès C, Pernet A, Benoît E, Santagostini P, Ghaziri AE, Bourbeillon J. DIVIS: a semantic DIstance to improve the VISualisation of heterogeneous phenotypic datasets. BioData Min 2022; 15:10. [PMID: 35379292 PMCID: PMC8981856 DOI: 10.1186/s13040-022-00293-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2021] [Accepted: 02/27/2022] [Indexed: 11/24/2022] Open
Abstract
Background Thanks to the wider spread of high-throughput experimental techniques, biologists are accumulating large amounts of datasets which often mix quantitative and qualitative variables and are not always complete, in particular when they regard phenotypic traits. In order to get a first insight into these datasets and reduce the data matrices size scientists often rely on multivariate analysis techniques. However such approaches are not always easily practicable in particular when faced with mixed datasets. Moreover displaying large numbers of individuals leads to cluttered visualisations which are difficult to interpret. Results We introduced a new methodology to overcome these limits. Its main feature is a new semantic distance tailored for both quantitative and qualitative variables which allows for a realistic representation of the relationships between individuals (phenotypic descriptions in our case). This semantic distance is based on ontologies which are engineered to represent real-life knowledge regarding the underlying variables. For easier handling by biologists, we incorporated its use into a complete tool, from raw data file to visualisation. Following the distance calculation, the next steps performed by the tool consist in (i) grouping similar individuals, (ii) representing each group by emblematic individuals we call archetypes and (iii) building sparse visualisations based on these archetypes. Our approach was implemented as a Python pipeline and applied to a rosebush dataset including passport and phenotypic data. Conclusions The introduction of our new semantic distance and of the archetype concept allowed us to build a comprehensive representation of an incomplete dataset characterised by a large proportion of qualitative data. The methodology described here could have wider use beyond information characterizing organisms or species and beyond plant science. Indeed we could apply the same approach to any mixed dataset. Supplementary Information The online version contains supplementary material available at (10.1186/s13040-022-00293-y).
Collapse
Affiliation(s)
- Rayan Eid
- Institut Agro, Univ Angers, INRAE, IRHS, SFR QuaSaV, Angers, 49000, France
| | - Claudine Landès
- Institut Agro, Univ Angers, INRAE, IRHS, SFR QuaSaV, Angers, 49000, France
| | - Alix Pernet
- Institut Agro, Univ Angers, INRAE, IRHS, SFR QuaSaV, Angers, 49000, France
| | | | | | | | - Julie Bourbeillon
- Institut Agro, Univ Angers, INRAE, IRHS, SFR QuaSaV, Angers, 49000, France.
| |
Collapse
|
11
|
Mallick K, Mallik S, Bandyopadhyay S, Chakraborty S. A Novel Graph Topology-Based GO-Similarity Measure for Signature Detection From Multi-Omics Data and its Application to Other Problems. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:773-785. [PMID: 32866101 DOI: 10.1109/tcbb.2020.3020537] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Large scale multi-omics data analysis and signature prediction have been a topic of interest in the last two decades. While various traditional clustering/correlation-based methods have been proposed, but the overall prediction is not always satisfactory. To solve these challenges, in this article, we propose a new approach by leveraging the Gene Ontology (GO)similarity combined with multiomics data. In this article, a new GO similarity measure, ModSchlicker, is proposed and the effectiveness of the proposed measure along with other standardized measures are reviewed while using various graph topology-based Information Content (IC)values of GO-term. The proposed measure is deployed to PPI prediction. Furthermore, by involving GO similarity, we propose a new framework for stronger disease-based gene signature detection from the multi-omics data. For the first objective, we predict interaction from various benchmark PPI datasets of Yeast and Human species. For the latter, the gene expression and methylation profiles are used to identify Differentially Expressed and Methylated (DEM)genes. Thereafter, the GO similarity score along with a statistical method are used to determine the potential gene signature. Interestingly, the proposed method produces a better performance ( 0.9 avg. accuracy and 0.95 AUC)as compared to the other existing related methods during the classification of the participating features (genes)of the signature. Moreover, the proposed method is highly useful in other prediction/classification problems for any kind of large scale omics data.
Collapse
|
12
|
Ray A. Machine learning in postgenomic biology and personalized medicine. WILEY INTERDISCIPLINARY REVIEWS. DATA MINING AND KNOWLEDGE DISCOVERY 2022; 12:e1451. [PMID: 35966173 PMCID: PMC9371441 DOI: 10.1002/widm.1451] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/23/2020] [Accepted: 12/22/2021] [Indexed: 06/15/2023]
Abstract
In recent years Artificial Intelligence in the form of machine learning has been revolutionizing biology, biomedical sciences, and gene-based agricultural technology capabilities. Massive data generated in biological sciences by rapid and deep gene sequencing and protein or other molecular structure determination, on the one hand, requires data analysis capabilities using machine learning that are distinctly different from classical statistical methods; on the other, these large datasets are enabling the adoption of novel data-intensive machine learning algorithms for the solution of biological problems that until recently had relied on mechanistic model-based approaches that are computationally expensive. This review provides a bird's eye view of the applications of machine learning in post-genomic biology. Attempt is also made to indicate as far as possible the areas of research that are poised to make further impacts in these areas, including the importance of explainable artificial intelligence (XAI) in human health. Further contributions of machine learning are expected to transform medicine, public health, agricultural technology, as well as to provide invaluable gene-based guidance for the management of complex environments in this age of global warming.
Collapse
Affiliation(s)
- Animesh Ray
- Riggs School of Applied Life Sciences, Keck Graduate Institute, 535 Watson Drive, Claremont, CA91711, USA
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, California, USA
| |
Collapse
|
13
|
Edera AA, Milone DH, Stegmayer G. Anc2vec: embedding gene ontology terms by preserving ancestors relationships. Brief Bioinform 2022; 23:6523148. [PMID: 35136916 DOI: 10.1093/bib/bbac003] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2021] [Revised: 12/13/2021] [Accepted: 01/04/2022] [Indexed: 12/11/2022] Open
Abstract
The gene ontology (GO) provides a hierarchical structure with a controlled vocabulary composed of terms describing functions and localization of gene products. Recent works propose vector representations, also known as embeddings, of GO terms that capture meaningful information about them. Significant performance improvements have been observed when these representations are used on diverse downstream tasks, such as the measurement of semantic similarity between GO terms and functional similarity between proteins. Despite the success shown by these approaches, existing embeddings of GO terms still fail to capture crucial structural features of the GO. Here, we present anc2vec, a novel protocol based on neural networks for constructing vector representations of GO terms by preserving three important ontological features: its ontological uniqueness, ancestors hierarchy and sub-ontology membership. The advantages of using anc2vec are demonstrated by systematic experiments on diverse tasks: visualization, sub-ontology prediction, inference of structurally related terms, retrieval of terms from aggregated embeddings, and prediction of protein-protein interactions. In these tasks, experimental results show that the performance of anc2vec representations is better than those of recent approaches. This demonstrates that higher performances on diverse tasks can be achieved by embeddings when the structure of the GO is better represented. Full source code and data are available at https://github.com/sinc-lab/anc2vec.
Collapse
Affiliation(s)
- Alejandro A Edera
- Research Institute for Signals, Systems and Computational Intelligence, sinc(i), FICH-UNL, CONICET, Ciudad Universitaria UNL, 3000, Santa Fe, Argentina
| | - Diego H Milone
- Research Institute for Signals, Systems and Computational Intelligence, sinc(i), FICH-UNL, CONICET, Ciudad Universitaria UNL, 3000, Santa Fe, Argentina
| | - Georgina Stegmayer
- Research Institute for Signals, Systems and Computational Intelligence, sinc(i), FICH-UNL, CONICET, Ciudad Universitaria UNL, 3000, Santa Fe, Argentina
| |
Collapse
|
14
|
Guzzi PH, Tradigo G, Veltri P. Using dual-network-analyser for communities detecting in dual networks. BMC Bioinformatics 2022; 22:614. [PMID: 35012460 PMCID: PMC8750846 DOI: 10.1186/s12859-022-04564-7] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2021] [Accepted: 01/03/2022] [Indexed: 12/25/2022] Open
Abstract
BACKGROUND Representations of the relationships among data using networks are widely used in several research fields such as computational biology, medical informatics and social network mining. Recently, complex networks have been introduced to better capture the insights of the modelled scenarios. Among others, dual networks (DNs) consist of mapping information as pairs of networks containing the same set of nodes but with different edges: one, called physical network, has unweighted edges, while the other, called conceptual network, has weighted edges. RESULTS We focus on DNs and we propose a tool to find common subgraphs (aka communities) in DNs with particular properties. The tool, called Dual-Network-Analyser, is based on the identification of communities that induce optimal modular subgraphs in the conceptual network and connected subgraphs in the physical one. It includes the Louvain algorithm applied to the considered case. The Dual-Network-Analyser can be used to study DNs, to find common modular communities. We report results on using the tool to identify communities on synthetic DNs as well as real cases in social networks and biological data. CONCLUSION The proposed method has been tested by using synthetic and biological networks. Results demonstrate that it is well able to detect meaningful information from DNs.
Collapse
Affiliation(s)
- Pietro Hiram Guzzi
- Department of Surgical and Medical Sciences, Magna Graecia University, 88100 Catanzaro, Italy
| | | | - Pierangelo Veltri
- Department of Surgical and Medical Sciences, Magna Graecia University, 88100 Catanzaro, Italy
| |
Collapse
|
15
|
Lastra-Díaz JJ, Lara-Clares A, Garcia-Serrano A. HESML: a real-time semantic measures library for the biomedical domain with a reproducible survey. BMC Bioinformatics 2022; 23:23. [PMID: 34991460 PMCID: PMC8734250 DOI: 10.1186/s12859-021-04539-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2020] [Accepted: 12/15/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Ontology-based semantic similarity measures based on SNOMED-CT, MeSH, and Gene Ontology are being extensively used in many applications in biomedical text mining and genomics respectively, which has encouraged the development of semantic measures libraries based on the aforementioned ontologies. However, current state-of-the-art semantic measures libraries have some performance and scalability drawbacks derived from their ontology representations based on relational databases, or naive in-memory graph representations. Likewise, a recent reproducible survey on word similarity shows that one hybrid IC-based measure which integrates a shortest-path computation sets the state of the art in the family of ontology-based semantic measures. However, the lack of an efficient shortest-path algorithm for their real-time computation prevents both their practical use in any application and the use of any other path-based semantic similarity measure. RESULTS To bridge the two aforementioned gaps, this work introduces for the first time an updated version of the HESML Java software library especially designed for the biomedical domain, which implements the most efficient and scalable ontology representation reported in the literature, together with a new method for the approximation of the Dijkstra's algorithm for taxonomies, called Ancestors-based Shortest-Path Length (AncSPL), which allows the real-time computation of any path-based semantic similarity measure. CONCLUSIONS We introduce a set of reproducible benchmarks showing that HESML outperforms by several orders of magnitude the current state-of-the-art libraries in the three aforementioned biomedical ontologies, as well as the real-time performance and approximation quality of the new AncSPL shortest-path algorithm. Likewise, we show that AncSPL linearly scales regarding the dimension of the common ancestor subgraph regardless of the ontology size. Path-based measures based on the new AncSPL algorithm are up to six orders of magnitude faster than their exact implementation in large ontologies like SNOMED-CT and GO. Finally, we provide a detailed reproducibility protocol and dataset as supplementary material to allow the exact replication of all our experiments and results.
Collapse
Affiliation(s)
- Juan J. Lastra-Díaz
- NLP & IR Research Group, E.T.S.I. Informática, Universidad Nacional de Educación a Distancia (UNED), C/Juan del Rosal 16, 28040 Madrid, Spain
| | - Alicia Lara-Clares
- NLP & IR Research Group, E.T.S.I. Informática, Universidad Nacional de Educación a Distancia (UNED), C/Juan del Rosal 16, 28040 Madrid, Spain
| | - Ana Garcia-Serrano
- NLP & IR Research Group, E.T.S.I. Informática, Universidad Nacional de Educación a Distancia (UNED), C/Juan del Rosal 16, 28040 Madrid, Spain
| |
Collapse
|
16
|
Pesaranghader A, Matwin S, Sokolova M, Grenier JC, Beiko RG, Hussin J. OUP accepted manuscript. Bioinformatics 2022; 38:3051-3061. [PMID: 35536192 PMCID: PMC9154256 DOI: 10.1093/bioinformatics/btac304] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2021] [Revised: 02/12/2022] [Indexed: 11/24/2022] Open
Abstract
Motivation There is a plethora of measures to evaluate functional similarity (FS) of genes based on their co-expression, protein–protein interactions and sequence similarity. These measures are typically derived from hand-engineered and application-specific metrics to quantify the degree of shared information between two genes using their Gene Ontology (GO) annotations. Results We introduce deepSimDEF, a deep learning method to automatically learn FS estimation of gene pairs given a set of genes and their GO annotations. deepSimDEF’s key novelty is its ability to learn low-dimensional embedding vector representations of GO terms and gene products and then calculate FS using these learned vectors. We show that deepSimDEF can predict the FS of new genes using their annotations: it outperformed all other FS measures by >5–10% on yeast and human reference datasets on protein–protein interactions, gene co-expression and sequence homology tasks. Thus, deepSimDEF offers a powerful and adaptable deep neural architecture that can benefit a wide range of problems in genomics and proteomics, and its architecture is flexible enough to support its extension to any organism. Availability and implementation Source code and data are available at https://github.com/ahmadpgh/deepSimDEF Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Stan Matwin
- Faculty of Computer Science, Dalhousie University, Halifax B3H 4R2, Canada
- Institute for Big Data Analytics, Dalhousie University, Halifax B3H 4R2, Canada
- Institute of Computer Science, Polish Academy of Sciences, Warsaw, Poland
| | - Marina Sokolova
- Institute for Big Data Analytics, Dalhousie University, Halifax B3H 4R2, Canada
- Faculty of Medicine and Faculty of Engineering, University of Ottawa, Ottawa K1H 8M5, Canada
| | | | - Robert G Beiko
- Faculty of Computer Science, Dalhousie University, Halifax B3H 4R2, Canada
- Institute for Big Data Analytics, Dalhousie University, Halifax B3H 4R2, Canada
| | | |
Collapse
|
17
|
Milano M. Using Gene Ontology to Annotate and Prioritize Microarray Data. Methods Mol Biol 2022; 2401:273-287. [PMID: 34902135 DOI: 10.1007/978-1-0716-1839-4_18] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
The results of high-throughput experiments consist of numerous candidate genes, proteins, or other molecules potentially associated with diseases. A challenge for omics science is the knowledge extraction from the results and the filtering of promising gene or protein candidates. Especially, the hot topic in clinical scenarios consists of highlighting the behavior of few molecules related to some specific disease. In this contest, different computational approaches, also referred Gene prioritization methods, ensure to identify the most related genes to a disease among a larger set of candidate genes. The identification requires the use of domain-specific knowledge that is often encoded into ontologies.
Collapse
Affiliation(s)
- Marianna Milano
- Department of Medical and Surgical Sciences, University of Catanzaro, Catanzaro, Italy.
| |
Collapse
|
18
|
Jung YS, Kim Y, Cho YR. Comparative analysis of network-based approaches and machine learning algorithms for predicting drug-target interactions. Methods 2021; 198:19-31. [PMID: 34737033 DOI: 10.1016/j.ymeth.2021.10.007] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2021] [Revised: 10/21/2021] [Accepted: 10/22/2021] [Indexed: 01/06/2023] Open
Abstract
Computational prediction of drug-target interactions (DTIs) is of particular importance in the process of drug repositioning because of its efficiency in selecting potential candidates for DTIs. A variety of computational methods for predicting DTIs have been proposed over the past decade. Our interest is which methods or techniques are the most advantageous for increasing prediction accuracy. This article provides a comprehensive overview of network-based, machine learning, and integrated DTI prediction methods. The network-based methods handle a DTI network along with drug and target similarities in a matrix form and apply graph-theoretic algorithms to identify new DTIs. Machine learning methods use known DTIs and the features of drugs and target proteins as training data to build a predictive model. Integrated methods combine these two techniques. We assessed the prediction performance of the selected state-of-the-art methods using two different benchmark datasets. Our experimental results demonstrate that the integrated methods outperform the others in general. Some previous methods showed low accuracy on predicting interactions of unknown drugs which do not exist in the training dataset. Combining similarity matrices from multiple features by data fusion was not beneficial in increasing prediction accuracy. Finally, we analyzed future directions for further improvements in DTI predictions.
Collapse
Affiliation(s)
- Yi-Sue Jung
- Division of Software, Yonsei University - Mirae Campus, Republic of Korea
| | - Yoonbee Kim
- Division of Software, Yonsei University - Mirae Campus, Republic of Korea
| | - Young-Rae Cho
- Division of Software, Yonsei University - Mirae Campus, Republic of Korea; Division of Digital Healthcare, Yonsei University - Mirae Campus, Republic of Korea.
| |
Collapse
|
19
|
Dondi R, Hosseinzadeh MM, Guzzi PH. A novel algorithm for finding top-k weighted overlapping densest connected subgraphs in dual networks. APPLIED NETWORK SCIENCE 2021; 6:40. [PMID: 34124340 PMCID: PMC8179714 DOI: 10.1007/s41109-021-00381-8] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 02/26/2021] [Accepted: 05/20/2021] [Indexed: 06/12/2023]
Abstract
The use of networks for modelling and analysing relations among data is currently growing. Recently, the use of a single networks for capturing all the aspects of some complex scenarios has shown some limitations. Consequently, it has been proposed to use Dual Networks (DN), a pair of related networks, to analyse complex systems. The two graphs in a DN have the same set of vertices and different edge sets. Common subgraphs among these networks may convey some insights about the modelled scenarios. For instance, the detection of the Top-k Densest Connected subgraphs, i.e. a set k subgraphs having the largest density in the conceptual network which are also connected in the physical network, may reveal set of highly related nodes. After proposing a formalisation of the approach, we propose a heuristic to find a solution, since the problem is computationally hard. A set of experiments on synthetic and real networks is also presented to support our approach.
Collapse
Affiliation(s)
- Riccardo Dondi
- Department of Science, University of Bergamo, Bergamo, Italy
| | | | - Pietro H. Guzzi
- Department of Surgical and Medical Sciences, Magna Graecia University, Catanzaro, Italy
| |
Collapse
|
20
|
Liu-Wei W, Kafkas Ş, Chen J, Dimonaco NJ, Tegnér J, Hoehndorf R. DeepViral: prediction of novel virus-host interactions from protein sequences and infectious disease phenotypes. Bioinformatics 2021; 37:2722-2729. [PMID: 33682875 PMCID: PMC8428617 DOI: 10.1093/bioinformatics/btab147] [Citation(s) in RCA: 31] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2020] [Revised: 01/18/2021] [Accepted: 03/01/2021] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Infectious diseases caused by novel viruses have become a major public health concern. Rapid identification of virus-host interactions can reveal mechanistic insights into infectious diseases and shed light on potential treatments. Current computational prediction methods for novel viruses are based mainly on protein sequences. However, it is not clear to what extent other important features, such as the symptoms caused by the viruses, could contribute to a predictor. Disease phenotypes (i.e., signs and symptoms) are readily accessible from clinical diagnosis and we hypothesize that they may act as a potential proxy and an additional source of information for the underlying molecular interactions between the pathogens and hosts. RESULTS We developed DeepViral, a deep learning based method that predicts protein-protein interactions (PPI) between humans and viruses. Motivated by the potential utility of infectious disease phenotypes, we first embedded human proteins and viruses in a shared space using their associated phenotypes and functions, supported by formalized background knowledge from biomedical ontologies. By jointly learning from protein sequences and phenotype features, DeepViral significantly improves over existing sequence-based methods for intra- and inter-species PPI prediction. AVAILABILITY Code and datasets for reproduction and customization are available at https://github.com/bio-ontology-research-group/DeepViral. Prediction results for 14 virus families are available at https://doi.org/10.5281/zenodo.4429824.
Collapse
Affiliation(s)
- Wang Liu-Wei
- Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology, Thuwal 23955, Saudi Arabia
| | - Şenay Kafkas
- Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology, Thuwal 23955, Saudi Arabia.,Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal 23955, Saudi Arabia
| | - Jun Chen
- Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology, Thuwal 23955, Saudi Arabia
| | - Nicholas J Dimonaco
- Institute of Biological, Environmental and Rural Sciences, Aberystwyth University, SY23 3BQ, Wales, UK
| | - Jesper Tegnér
- Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology, Thuwal 23955, Saudi Arabia.,Biological and Environmental Science and Engineering Division, King Abdullah University of Science and Technology, Thuwal 23955, Saudi Arabia
| | - Robert Hoehndorf
- Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology, Thuwal 23955, Saudi Arabia.,Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal 23955, Saudi Arabia
| |
Collapse
|
21
|
Gnanavel M, Murugesan A, Konda Mani S, Yli-Harja O, Kandhavelu M. Identifying the miRNA Signature Association with Aging-Related Senescence in Glioblastoma. Int J Mol Sci 2021; 22:ijms22020517. [PMID: 33419230 PMCID: PMC7825621 DOI: 10.3390/ijms22020517] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2020] [Revised: 12/30/2020] [Accepted: 01/04/2021] [Indexed: 12/13/2022] Open
Abstract
Glioblastoma (GBM) is the most common malignant brain tumor and its malignant phenotypic characteristics are classified as grade IV tumors. Molecular interactions, such as protein–protein, protein–ncRNA, and protein–peptide interactions are crucial to transfer the signaling communications in cellular signaling pathways. Evidences suggest that signaling pathways of stem cells are also activated, which helps the propagation of GBM. Hence, it is important to identify a common signaling pathway that could be visible from multiple GBM gene expression data. microRNA signaling is considered important in GBM signaling, which needs further validation. We performed a high-throughput analysis using micro array expression profiles from 574 samples to explore the role of non-coding RNAs in the disease progression and unique signaling communication in GBM. A series of computational methods involving miRNA expression, gene ontology (GO) based gene enrichment, pathway mapping, and annotation from metabolic pathways databases, and network analysis were used for the analysis. Our study revealed the physiological roles of many known and novel miRNAs in cancer signaling, especially concerning signaling in cancer progression and proliferation. Overall, the results revealed a strong connection with stress induced senescence, significant miRNA targets for cell cycle arrest, and many common signaling pathways to GBM in the network.
Collapse
Affiliation(s)
- Mutharasu Gnanavel
- BioMediTech Institute, Faculty of Medicine and Health Technology, Tampere University, ArvoYlpönkatu 34, 33520 Tampere, Finland; (M.G.); (A.M.); (O.Y.-H.)
| | - Akshaya Murugesan
- BioMediTech Institute, Faculty of Medicine and Health Technology, Tampere University, ArvoYlpönkatu 34, 33520 Tampere, Finland; (M.G.); (A.M.); (O.Y.-H.)
- Molecular Signalling Lab, Faculty of Medicine and Health Technology, Tampere University, P.O. Box 553, 33101 Tampere, Finland
- Department of Biotechnology, Lady Doak College, Thallakulam, Madurai 625002, India
| | - Saravanan Konda Mani
- Center for High Performance Computing, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China;
| | - Olli Yli-Harja
- BioMediTech Institute, Faculty of Medicine and Health Technology, Tampere University, ArvoYlpönkatu 34, 33520 Tampere, Finland; (M.G.); (A.M.); (O.Y.-H.)
- Computational Systems Biology Group, Faculty of Medicine and Health Technology, Tampere University, P.O. Box 553, 33101 Tampere, Finland
- Institute for Systems Biology, 1441N 34th Street, Seattle, WA 98109, USA
| | - Meenakshisundaram Kandhavelu
- BioMediTech Institute, Faculty of Medicine and Health Technology, Tampere University, ArvoYlpönkatu 34, 33520 Tampere, Finland; (M.G.); (A.M.); (O.Y.-H.)
- Molecular Signalling Lab, Faculty of Medicine and Health Technology, Tampere University, P.O. Box 553, 33101 Tampere, Finland
- Science Center, Tampere University Hospital, ArvoYlpönkatu 34, 33520 Tampere, Finland
- Correspondence:
| |
Collapse
|
22
|
Milano M, Milenković T, Cannataro M, Guzzi PH. L-HetNetAligner: A novel algorithm for Local Alignment of Heterogeneous Biological Networks. Sci Rep 2020; 10:3901. [PMID: 32127586 PMCID: PMC7054427 DOI: 10.1038/s41598-020-60737-5] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2018] [Accepted: 02/11/2020] [Indexed: 11/10/2022] Open
Abstract
Networks are largely used for modelling and analysing a wide range of biological data. As a consequence, many different research efforts have resulted in the introduction of a large number of algorithms for analysis and comparison of networks. Many of these algorithms can deal with networks with a single class of nodes and edges, also referred to as homogeneous networks. Recently, many different approaches tried to integrate into a single model the interplay of different molecules. A possible formalism to model such a scenario comes from node/edge coloured networks (also known as heterogeneous networks) implemented as node/ edge-coloured graphs. Therefore, the need for the introduction of algorithms able to compare heterogeneous networks arises. We here focus on the local comparison of heterogeneous networks, and we formulate it as a network alignment problem. To the best of our knowledge, the local alignment of heterogeneous networks has not been explored in the past. We here propose L-HetNetAligner a novel algorithm that receives as input two heterogeneous networks (node-coloured graphs) and builds a local alignment of them. We also implemented and tested our algorithm. Our results confirm that our method builds high-quality alignments. The following website *contains Supplementary File 1 material and the code.
Collapse
Affiliation(s)
- Marianna Milano
- Department of Surgical and Medical Sciences, University of Catanzaro, Catanzaro, 88040, Italy
| | - Tijana Milenković
- Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, Indiana, USA
| | - Mario Cannataro
- Department of Surgical and Medical Sciences, University of Catanzaro, Catanzaro, 88040, Italy
- Data Analytics Research Center, University of Catanzaro, Catanzaro, Italy
| | - Pietro Hiram Guzzi
- Department of Surgical and Medical Sciences, University of Catanzaro, Catanzaro, 88040, Italy.
- Data Analytics Research Center, University of Catanzaro, Catanzaro, Italy.
| |
Collapse
|
23
|
Sousa RT, Silva S, Pesquita C. Evolving knowledge graph similarity for supervised learning in complex biomedical domains. BMC Bioinformatics 2020; 21:6. [PMID: 31900127 PMCID: PMC6942314 DOI: 10.1186/s12859-019-3296-1] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2019] [Accepted: 11/27/2019] [Indexed: 01/22/2023] Open
Abstract
Background In recent years, biomedical ontologies have become important for describing existing biological knowledge in the form of knowledge graphs. Data mining approaches that work with knowledge graphs have been proposed, but they are based on vector representations that do not capture the full underlying semantics. An alternative is to use machine learning approaches that explore semantic similarity. However, since ontologies can model multiple perspectives, semantic similarity computations for a given learning task need to be fine-tuned to account for this. Obtaining the best combination of semantic similarity aspects for each learning task is not trivial and typically depends on expert knowledge. Results We have developed a novel approach, evoKGsim, that applies Genetic Programming over a set of semantic similarity features, each based on a semantic aspect of the data, to obtain the best combination for a given supervised learning task. The approach was evaluated on several benchmark datasets for protein-protein interaction prediction using the Gene Ontology as the knowledge graph to support semantic similarity, and it outperformed competing strategies, including manually selected combinations of semantic aspects emulating expert knowledge. evoKGsim was also able to learn species-agnostic models with different combinations of species for training and testing, effectively addressing the limitations of predicting protein-protein interactions for species with fewer known interactions. Conclusions evoKGsim can overcome one of the limitations in knowledge graph-based semantic similarity applications: the need to expertly select which aspects should be taken into account for a given application. Applying this methodology to protein-protein interaction prediction proved successful, paving the way to broader applications.
Collapse
Affiliation(s)
- Rita T Sousa
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Lisboa, Portugal.
| | - Sara Silva
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Lisboa, Portugal
| | - Catia Pesquita
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Lisboa, Portugal
| |
Collapse
|
24
|
Cardoso C, Sousa RT, Köhler S, Pesquita C. A Collection of Benchmark Data Sets for Knowledge Graph-based Similarity in the Biomedical Domain. Database (Oxford) 2020; 2020:baaa078. [PMID: 33181823 PMCID: PMC7661097 DOI: 10.1093/database/baaa078] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2020] [Revised: 08/13/2020] [Accepted: 08/24/2020] [Indexed: 01/12/2023]
Abstract
The ability to compare entities within a knowledge graph is a cornerstone technique for several applications, ranging from the integration of heterogeneous data to machine learning. It is of particular importance in the biomedical domain, where semantic similarity can be applied to the prediction of protein-protein interactions, associations between diseases and genes, cellular localization of proteins, among others. In recent years, several knowledge graph-based semantic similarity measures have been developed, but building a gold standard data set to support their evaluation is non-trivial. We present a collection of 21 benchmark data sets that aim at circumventing the difficulties in building benchmarks for large biomedical knowledge graphs by exploiting proxies for biomedical entity similarity. These data sets include data from two successful biomedical ontologies, Gene Ontology and Human Phenotype Ontology, and explore proxy similarities calculated based on protein sequence similarity, protein family similarity, protein-protein interactions and phenotype-based gene similarity. Data sets have varying sizes and cover four different species at different levels of annotation completion. For each data set, we also provide semantic similarity computations with state-of-the-art representative measures. Database URL: https://github.com/liseda-lab/kgsim-benchmark.
Collapse
Affiliation(s)
- Carlota Cardoso
- Departamento de informática, LASIGE Faculdade de Ciências da Universidade de Lisboa, 1749 - 016 Lisboa, Portugal
| | - Rita T Sousa
- Departamento de informática, LASIGE Faculdade de Ciências da Universidade de Lisboa, 1749 - 016 Lisboa, Portugal
| | | | - Catia Pesquita
- Departamento de informática, LASIGE Faculdade de Ciências da Universidade de Lisboa, 1749 - 016 Lisboa, Portugal
| |
Collapse
|
25
|
Yu G, Wang K, Fu G, Guo M, Wang J. NMFGO: Gene Function Prediction via Nonnegative Matrix Factorization with Gene Ontology. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:238-249. [PMID: 30059316 DOI: 10.1109/tcbb.2018.2861379] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Gene Ontology (GO) is a controlled vocabulary of terms that describe molecule function, biological roles, and cellular locations of gene products (i.e., proteins and RNAs), it hierarchically organizes more than 43,000 GO terms via the direct acyclic graph. A gene is generally annotated with several of these GO terms. Therefore, accurately predicting the association between genes and massive terms is a difficult challenge. To combat with this challenge, we propose an matrix factorization based approach called NMFGO. NMFGO stores the available GO annotations of genes in a gene-term association matrix and adopts an ontological structure based taxonomic similarity measure to capture the GO hierarchy. Next, it factorizes the association matrix into two low-rank matrices via nonnegative matrix factorization regularized with the GO hierarchy. After that, it employs a semantic similarity based k nearest neighbor classifier in the low-rank matrices approximated subspace to predict gene functions. Empirical study on three model species (S. cerevisiae, H. sapiens, and A. thaliana) shows that NMFGO is robust to the input parameters and achieves significantly better prediction performance than GIC, TO, dRW- kNN, and NtN, which were re-implemented based on the instructions of the original papers. The supplementary file and demo codes of NMFGO are available at http://mlda.swu.edu.cn/codes.php?name=NMFGO.
Collapse
|
26
|
Maskey S, Cho YR. LePrimAlign: local entropy-based alignment of PPI networks to predict conserved modules. BMC Genomics 2019; 20:964. [PMID: 31874635 PMCID: PMC6929407 DOI: 10.1186/s12864-019-6271-3] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open
Abstract
Background Cross-species analysis of protein-protein interaction (PPI) networks provides an effective means of detecting conserved interaction patterns. Identifying such conserved substructures between PPI networks of different species increases our understanding of the principles deriving evolution of cellular organizations and their functions in a system level. In recent years, network alignment techniques have been applied to genome-scale PPI networks to predict evolutionary conserved modules. Although a wide variety of network alignment algorithms have been introduced, developing a scalable local network alignment algorithm with high accuracy is still challenging. Results We present a novel pairwise local network alignment algorithm, called LePrimAlign, to predict conserved modules between PPI networks of three different species. The proposed algorithm exploits the results of a pairwise global alignment algorithm with many-to-many node mapping. It also applies the concept of graph entropy to detect initial cluster pairs from two networks. Finally, the initial clusters are expanded to increase the local alignment score that is formulated by a combination of intra-network and inter-network scores. The performance comparison with state-of-the-art approaches demonstrates that the proposed algorithm outperforms in terms of accuracy of identified protein complexes and quality of alignments. Conclusion The proposed method produces local network alignment of higher accuracy in predicting conserved modules even with large biological networks at a reduced computational cost.
Collapse
Affiliation(s)
- Sawal Maskey
- Department of Computer Science, Baylor University, One Bear Place #97141, Waco, 76798, TX, USA
| | - Young-Rae Cho
- Department of Computer Science, Baylor University, One Bear Place #97141, Waco, 76798, TX, USA. .,Bioinformatics Program, Baylor University, One Bear Place #97141, Waco, 76798, TX, USA.
| |
Collapse
|
27
|
|
28
|
Zhang J, Zhong C, Huang Y, Lin HX, Wang M. A method for identifying protein complexes with the features of joint co-localization and joint co-expression in static PPI networks. Comput Biol Med 2019; 111:103333. [PMID: 31376777 DOI: 10.1016/j.compbiomed.2019.103333] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2019] [Revised: 06/01/2019] [Accepted: 06/17/2019] [Indexed: 02/09/2023]
Abstract
Identifying protein complexes in static protein-protein interaction (PPI) networks is essential for understanding the underlying mechanism of biological processes. Proteins in a complex are co-localized at the same place and co-expressed at the same time. We propose a novel method to identify protein complexes with the features of joint co-localization and joint co-expression in static PPI networks. To achieve this goal, we define a joint localization vector to construct a joint co-localization criterion of a protein group, and define a joint gene expression to construct a joint co-expression criterion of a gene group. Moreover, the functional similarity of proteins in a complex is an important characteristic. Thus, we use the CC-based, MF-based, and BP-based protein similarities to devise functional similarity criterion to determine whether a protein is functionally similar to a protein cluster. Based on the core-attachment structure and following to seed expanding strategy, we use four types of biological data including PPI data with reliability score, protein localization data, gene expression data, and gene ontology annotations, to identify protein complexes. The experimental results on yeast data show that comparing with existing methods our proposed method can efficiently and exactly identify more protein complexes, especially more protein complexes of sizes from 2 to 6. Furthermore, the enrichment analysis demonstrates that the protein complexes identified by our method have significant biological meaning.
Collapse
Affiliation(s)
- Jinxiong Zhang
- School of Computer Science and Engineering, South China University of Technology, Guangzhou, China; School of Computer, Electronics and Information, Guangxi University, Nanning, China.
| | - Cheng Zhong
- School of Computer, Electronics and Information, Guangxi University, Nanning, China.
| | - Yiran Huang
- School of Computer, Electronics and Information, Guangxi University, Nanning, China.
| | - Hai Xiang Lin
- Faculty of Electrical Engineering, Mathematics and Computer Science, Delft University of Technology, Delft, the Netherlands.
| | - Mian Wang
- College of Life Science and Technology, Guangxi University, Nanning, China.
| |
Collapse
|
29
|
Díaz-Montaña JJ, Díaz-Díaz N, Barranco CD, Ponzoni I. Development and use of a Cytoscape app for GRNCOP2. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2019; 177:211-218. [PMID: 31319950 DOI: 10.1016/j.cmpb.2019.05.030] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/17/2018] [Revised: 05/05/2019] [Accepted: 05/29/2019] [Indexed: 06/10/2023]
Abstract
BACKGROUND AND OBJECTIVE Gene regulatory networks (GRNs) are essential for understanding most molecular processes. In this context, the so-called model-free approaches have an advantage modeling the complex topologies behind these dynamic molecular networks, since most GRNs are difficult to map correctly by any other mathematical model. Abstract model-free approaches, also known as rule-based extraction methods, offer valuable benefits when performing data-driven analysis; such as requiring the least amount of data and simplifying the inference of large models at a faster analysis speed. In particular, GRNCOP2 is a combinatorial optimization method with an adaptive criterion for the discretization of gene expression data and high performance, in contrast to other rule-based extraction methods for discovering GRNs. However, the analysis of the large relational structures of the networks inferred by GRNCOP2 requires the support of effective tools for interactive network visualization and topological analysis of the extracted associations. This need motivated the possibility of integrating GRNCOP2 in the Cytoscape ecosystem in order to benefit from Cytoscapes core functionality, as well as all the other apps in its ecosystem. METHODS In this paper, we introduce the implementation of a GRNCOP2 Cytoscape app. This incorporation to Cytoscape platform includes new functionality for GRN visualizations, dynamic user-interaction and integration with other apps for topological analysis of the networks. RESULTS In order to demonstrate the usefulness of integrating GRNCOP2 in Cytoscape, the new app was used to tackle a novel use case for GRNCOP2: the analysis of crosstalk between pathways. In this regard, datasets associated with Alzheimer's disease (AD) were analyzed using GRNCOP2 app and other apps of the Cytoscape ecosystem by performing a topological analysis of the AD progression and its synchronization with the Ubiquitin Mediated Proteolysis pathway. Finally, the biological relevance of the findings achieved by this new app were evaluated by searching for evidence in the literature. CONCLUSIONS The proposed crosstalk analysis with the new GRNCOP2 app focused on assessing the phase of the Alzheimer's disease progression where the coordination with the Ubiquitin Mediated Proteolysis pathway increase, and identifying the genes that explain the signalling between these cellular processes. Both questions were explored by topological contrastive analysis of the GRNs generated for the GRNCOP2 app, where several facilities of Cytoscape were exploited. The topological patterns inferred by this new App have been consistent with biological evidence reported in the scientic literature, illustrating the effectiveness of using this new GRNCOP2 App in pathway analysis. AVAILABILITY The GRNCOP2 App is freely available at the official Cytoscape app store: http://apps.cytoscape.org/apps/grncop2.
Collapse
Affiliation(s)
- Juan J Díaz-Montaña
- Intelligent Data Analysis (DATAi), Division of Computer Science, Pablo de Olavide University, Seville ES-41013, Spain.
| | - Norberto Díaz-Díaz
- Intelligent Data Analysis (DATAi), Division of Computer Science, Pablo de Olavide University, Seville ES-41013, Spain.
| | - Carlos D Barranco
- Intelligent Data Analysis (DATAi), Division of Computer Science, Pablo de Olavide University, Seville ES-41013, Spain.
| | - Ignacio Ponzoni
- Instituto de Ciencias e Ingeniería de la Computaciǿn (UNS, CONICET), Departamento de Ciencias e Ingeniería de la Computaciǿn, Universidad Nacional del Sur (UNS), Bahía Blanca, Argentina.
| |
Collapse
|
30
|
Guzzi PH, Milenkovic T. Survey of local and global biological network alignment: the need to reconcile the two sides of the same coin. Brief Bioinform 2019; 19:472-481. [PMID: 28062413 DOI: 10.1093/bib/bbw132] [Citation(s) in RCA: 30] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2016] [Indexed: 12/23/2022] Open
Abstract
Analogous to genomic sequence alignment that allows for across-species transfer of biological knowledge between conserved sequence regions, biological network alignment can be used to guide the knowledge transfer between conserved regions of molecular networks of different species. Hence, biological network alignment can be used to redefine the traditional notion of a sequence-based homology to a new notion of network-based homology. Analogous to genomic sequence alignment, there exist local and global biological network alignments. Here, we survey prominent and recent computational approaches of each network alignment type and discuss their (dis)advantages. Then, as it was recently shown that the two approach types are complementary, in the sense that they capture different slices of cellular functioning, we discuss the need to reconcile the two network alignment types and present a recent first step in this direction. We conclude with some open research problems on this topic and comment on the usefulness of network alignment in other domains besides computational biology.
Collapse
Affiliation(s)
- Pietro Hiram Guzzi
- Department of Surgical and Medical Sciences, University Magna Graecia, Catanzaro, 88100 Italy
| | - Tijana Milenkovic
- Department of Computer Science and Engineering, Interdisciplinary Center for Network Science and Applications (iCeNSA), ECK Institute for Global Health, University of Notre Dame, Notre Dame, IN 46556, USA
| |
Collapse
|
31
|
Hayes WB, Mamano N. SANA NetGO: a combinatorial approach to using Gene Ontology (GO) terms to score network alignments. Bioinformatics 2019; 34:1345-1352. [PMID: 29228175 DOI: 10.1093/bioinformatics/btx716] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2017] [Accepted: 12/04/2017] [Indexed: 01/05/2023] Open
Abstract
Motivation Gene Ontology (GO) terms are frequently used to score alignments between protein-protein interaction (PPI) networks. Methods exist to measure GO similarity between proteins in isolation, but proteins in a network alignment are not isolated: each pairing is dependent on every other via the alignment itself. Existing measures fail to take into account the frequency of GO terms across networks, instead imposing arbitrary rules on when to allow GO terms. Results Here we develop NetGO, a new measure that naturally weighs infrequent, informative GO terms more heavily than frequent, less informative GO terms, without arbitrary cutoffs, instead downweighting GO terms according to their frequency in the networks being aligned. This is a global measure applicable only to alignments, independent of pairwise GO measures, in the same sense that the edge-based EC or S3 scores are global measures of topological similarity independent of pairwise topological similarities. We demonstrate the superiority of NetGO in alignments of predetermined quality and show that NetGO correlates with alignment quality better than any existing GO-based alignment measures. We also demonstrate that NetGO provides a measure of taxonomic similarity between species, consistent with existing taxonomic measuresa feature not shared with existing GObased network alignment measures. Finally, we re-score alignments produced by almost a dozen aligners from a previous study and show that NetGO does a better job at separating good alignments from bad ones. Availability and implementation Available as part of SANA. Contact whayes@uci.edu. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Wayne B Hayes
- Department of Computer Science, University of California, Irvine, CA 92697-3435, USA
| | - Nil Mamano
- Department of Computer Science, University of California, Irvine, CA 92697-3435, USA
| |
Collapse
|
32
|
Handling Big Data Scalability in Biological Domain Using Parallel and Distributed Processing: A Case of Three Biological Semantic Similarity Measures. BIOMED RESEARCH INTERNATIONAL 2019; 2019:6750296. [PMID: 30809545 PMCID: PMC6369486 DOI: 10.1155/2019/6750296] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/12/2018] [Accepted: 01/13/2019] [Indexed: 11/30/2022]
Abstract
In the field of biology, researchers need to compare genes or gene products using semantic similarity measures (SSM). Continuous data growth and diversity in data characteristics comprise what is called big data; current biological SSMs cannot handle big data. Therefore, these measures need the ability to control the size of big data. We used parallel and distributed processing by splitting data into multiple partitions and applied SSM measures to each partition; this approach helped manage big data scalability and computational problems. Our solution involves three steps: split gene ontology (GO), data clustering, and semantic similarity calculation. To test this method, split GO and data clustering algorithms were defined and assessed for performance in the first two steps. Three of the best SSMs in biology [Resnik, Shortest Semantic Differentiation Distance (SSDD), and SORA] are enhanced by introducing threaded parallel processing, which is used in the third step. Our results demonstrate that introducing threads in SSMs reduced the time of calculating semantic similarity between gene pairs and improved performance of the three SSMs. Average time was reduced by 24.51% for Resnik, 22.93%, for SSDD, and 33.68% for SORA. Total time was reduced by 8.88% for Resnik, 23.14% for SSDD, and 39.27% for SORA. Using these threaded measures in the distributed system, combined with using split GO and data clustering algorithms to split input data based on their similarity, reduced the average time more than did the approach of equally dividing input data. Time reduction increased with increasing number of splits. Time reduction percentage was 24.1%, 39.2%, and 66.6% for Threaded SSDD; 33.0%, 78.2%, and 93.1% for Threaded SORA in the case of 2, 3, and 4 slaves, respectively; and 92.04% for Threaded Resnik in the case of four slaves.
Collapse
|
33
|
Luecken MD, Page MJT, Crosby AJ, Mason S, Reinert G, Deane CM. CommWalker: correctly evaluating modules in molecular networks in light of annotation bias. Bioinformatics 2019; 34:994-1000. [PMID: 29112702 PMCID: PMC5860269 DOI: 10.1093/bioinformatics/btx706] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2016] [Accepted: 11/02/2017] [Indexed: 11/24/2022] Open
Abstract
Motivation Detecting novel functional modules in molecular networks is an important step in biological research. In the absence of gold standard functional modules, functional annotations are often used to verify whether detected modules/communities have biological meaning. However, as we show, the uneven distribution of functional annotations means that such evaluation methods favor communities of well-studied proteins. Results We propose a novel framework for the evaluation of communities as functional modules. Our proposed framework, CommWalker, takes communities as inputs and evaluates them in their local network environment by performing short random walks. We test CommWalker’s ability to overcome annotation bias using input communities from four community detection methods on two protein interaction networks. We find that modules accepted by CommWalker are similarly co-expressed as those accepted by current methods. Crucially, CommWalker performs well not only in well-annotated regions, but also in regions otherwise obscured by poor annotation. CommWalker community prioritization both faithfully captures well-validated communities and identifies functional modules that may correspond to more novel biology. Availability and implementation The CommWalker algorithm is freely available at opig.stats.ox.ac.uk/resources or as a docker image on the Docker Hub at hub.docker.com/r/lueckenmd/commwalker/. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- M D Luecken
- Department of Statistics, University of Oxford, Oxford, UK
- Doctoral Training Centre, University of Oxford, Oxford, UK
| | - M J T Page
- Department of Informatics, UCB Pharma, Slough, UK
| | - A J Crosby
- Immunology Therapeutic Area, UCB Pharma, Slough, UK
| | - S Mason
- Immunology Therapeutic Area, UCB Pharma, Slough, UK
| | - G Reinert
- Department of Statistics, University of Oxford, Oxford, UK
| | - C M Deane
- Department of Statistics, University of Oxford, Oxford, UK
- Doctoral Training Centre, University of Oxford, Oxford, UK
- To whom correspondence should be addressed.
| |
Collapse
|
34
|
Acharya S, Saha S, Pradhan P. Novel symmetry-based gene-gene dissimilarity measures utilizing Gene Ontology: Application in gene clustering. Gene 2018; 679:341-351. [PMID: 30184472 DOI: 10.1016/j.gene.2018.08.062] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2018] [Revised: 08/21/2018] [Accepted: 08/21/2018] [Indexed: 11/25/2022]
Abstract
In recent years DNA microarray technology, leading to the generation of high-volume biological data, has gained significant attention. To analyze this high volume gene-expression data, one such powerful tool is Clustering. For any clustering algorithm, its efficiency majorly depends upon the underlying similarity/dissimilarity measure. During the analysis of such data often there is a need to further explore the similarity of genes not only with respect to their expression values but also with respect to their functional annotations, which can be obtained from Gene Ontology (GO) databases. In the existing literature, several novel clustering and bi-clustering approaches were proposed to identify co-regulated genes from gene-expression datasets. Identifying co-regulated genes from gene expression data misses some important biological information about functionalities of genes, which is necessary to identify semantically related genes. In this paper, we have proposed sixteen different semantic gene-gene dissimilarity measures utilizing biological information of genes retrieved from a global biological database namely Gene Ontology (GO). Four proximity measures, viz. Euclidean, Cosine, point symmetry and line symmetry are utilized along with four different representations of gene-GO-term annotation vectors to develop total sixteen gene-gene dissimilarity measures. In order to illustrate the profitability of developed dissimilarity measures, some multi-objective as well as single-objective clustering algorithms are applied utilizing proposed measures to identify functionally similar genes from Mouse genome and Yeast datasets. Furthermore, we have compared the performance of our proposed sixteen dissimilarity measures with three existing state-of-the-art semantic similarity and distance measures.
Collapse
Affiliation(s)
- Sudipta Acharya
- Department of Computer Science and Engineering, IIT Patna, India.
| | - Sriparna Saha
- Department of Computer Science and Engineering, IIT Patna, India
| | - Prasanna Pradhan
- Department of Computer Applications, Sikkim Manipal Institute of Technology, India
| |
Collapse
|
35
|
Ayllón-Benítez A, Mougin F, Allali J, Thiébaut R, Thébault P. A new method for evaluating the impacts of semantic similarity measures on the annotation of gene sets. PLoS One 2018; 13:e0208037. [PMID: 30481204 PMCID: PMC6258551 DOI: 10.1371/journal.pone.0208037] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2018] [Accepted: 11/09/2018] [Indexed: 01/01/2023] Open
Abstract
MOTIVATION The recent revolution in new sequencing technologies, as a part of the continuous process of adopting new innovative protocols has strongly impacted the interpretation of relations between phenotype and genotype. Thus, understanding the resulting gene sets has become a bottleneck that needs to be addressed. Automatic methods have been proposed to facilitate the interpretation of gene sets. While statistical functional enrichment analyses are currently well known, they tend to focus on well-known genes and to ignore new information from less-studied genes. To address such issues, applying semantic similarity measures is logical if the knowledge source used to annotate the gene sets is hierarchically structured. In this work, we propose a new method for analyzing the impact of different semantic similarity measures on gene set annotations. RESULTS We evaluated the impact of each measure by taking into consideration the two following features that correspond to relevant criteria for a "good" synthetic gene set annotation: (i) the number of annotation terms has to be drastically reduced and the representative terms must be retained while annotating the gene set, and (ii) the number of genes described by the selected terms should be as large as possible. Thus, we analyzed nine semantic similarity measures to identify the best possible compromise between both features while maintaining a sufficient level of details. Using Gene Ontology to annotate the gene sets, we obtained better results with node-based measures that use the terms' characteristics than with measures based on edges that link the terms. The annotation of the gene sets achieved with the node-based measures did not exhibit major differences regardless of the characteristics of terms used.
Collapse
Affiliation(s)
- Aarón Ayllón-Benítez
- Univ. Bordeaux, Inserm UMR 1219, Bordeaux Population Health Research Center, team ERIAS, Bordeaux, France
- Univ. Bordeaux, CNRS UMR 5800, LaBRI, Bordeaux, France
- * E-mail: (AA); (PT)
| | - Fleur Mougin
- Univ. Bordeaux, Inserm UMR 1219, Bordeaux Population Health Research Center, team ERIAS, Bordeaux, France
- Univ. Bordeaux, CNRS UMR 5800, LaBRI, Bordeaux, France
| | - Julien Allali
- Univ. Bordeaux, CNRS UMR 5800, LaBRI, Bordeaux, France
| | - Rodolphe Thiébaut
- Univ. Bordeaux, Inserm UMR 1219, INRIA SISTM, Bordeaux, France
- CHU de Bordeaux, Pole de sante publique, Service d’information medicale, Bordeaux, France
- Vaccine Research Institute, Creteil, France
| | - Patricia Thébault
- Univ. Bordeaux, CNRS UMR 5800, LaBRI, Bordeaux, France
- * E-mail: (AA); (PT)
| |
Collapse
|
36
|
PWCDA: Path Weighted Method for Predicting circRNA-Disease Associations. Int J Mol Sci 2018; 19:ijms19113410. [PMID: 30384427 PMCID: PMC6274797 DOI: 10.3390/ijms19113410] [Citation(s) in RCA: 59] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2018] [Revised: 10/25/2018] [Accepted: 10/26/2018] [Indexed: 12/22/2022] Open
Abstract
CircRNAs have particular biological structure and have proven to play important roles in diseases. It is time-consuming and costly to identify circRNA-disease associations by biological experiments. Therefore, it is appealing to develop computational methods for predicting circRNA-disease associations. In this study, we propose a new computational path weighted method for predicting circRNA-disease associations. Firstly, we calculate the functional similarity scores of diseases based on disease-related gene annotations and the semantic similarity scores of circRNAs based on circRNA-related gene ontology, respectively. To address missing similarity scores of diseases and circRNAs, we calculate the Gaussian Interaction Profile (GIP) kernel similarity scores for diseases and circRNAs, respectively, based on the circRNA-disease associations downloaded from circR2Disease database (http://bioinfo.snnu.edu.cn/CircR2Disease/). Then, we integrate disease functional similarity scores and circRNA semantic similarity scores with their related GIP kernel similarity scores to construct a heterogeneous network made up of three sub-networks: disease similarity network, circRNA similarity network and circRNA-disease association network. Finally, we compute an association score for each circRNA-disease pair based on paths connecting them in the heterogeneous network to determine whether this circRNA-disease pair is associated. We adopt leave one out cross validation (LOOCV) and five-fold cross validations to evaluate the performance of our proposed method. In addition, three common diseases, Breast Cancer, Gastric Cancer and Colorectal Cancer, are used for case studies. Experimental results illustrate the reliability and usefulness of our computational method in terms of different validation measures, which indicates PWCDA can effectively predict potential circRNA-disease associations.
Collapse
|
37
|
GOGO: An improved algorithm to measure the semantic similarity between gene ontology terms. Sci Rep 2018; 8:15107. [PMID: 30305653 PMCID: PMC6180005 DOI: 10.1038/s41598-018-33219-y] [Citation(s) in RCA: 49] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2017] [Accepted: 09/24/2018] [Indexed: 01/29/2023] Open
Abstract
Measuring the semantic similarity between Gene Ontology (GO) terms is an essential step in functional bioinformatics research. We implemented a software named GOGO for calculating the semantic similarity between GO terms. GOGO has the advantages of both information-content-based and hybrid methods, such as Resnik’s and Wang’s methods. Moreover, GOGO is relatively fast and does not need to calculate information content (IC) from a large gene annotation corpus but still has the advantage of using IC. This is achieved by considering the number of children nodes in the GO directed acyclic graphs when calculating the semantic contribution of an ancestor node giving to its descendent nodes. GOGO can calculate functional similarities between genes and then cluster genes based on their functional similarities. Evaluations performed on multiple pathways retrieved from the saccharomyces genome database (SGD) show that GOGO can accurately and robustly cluster genes based on functional similarities. We release GOGO as a web server and also as a stand-alone tool, which allows convenient execution of the tool for a small number of GO terms or integration of the tool into bioinformatics pipelines for large-scale calculations. GOGO can be freely accessed or downloaded from http://dna.cs.miami.edu/GOGO/.
Collapse
|
38
|
Kim J, Fischer M, Helms V. Prediction of Synergistic Toxicity of Binary Mixtures to Vibrio fischeri Based on Biomolecular Interaction Networks. Chem Res Toxicol 2018; 31:1138-1150. [DOI: 10.1021/acs.chemrestox.8b00164] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023]
Affiliation(s)
- Jongwoon Kim
- Environmental Safety Group, Korea Institute of Science and Technology (KIST) Europe, Campus E 7.1, 66123 Saarbruecken, Germany
| | - Max Fischer
- Environmental Safety Group, Korea Institute of Science and Technology (KIST) Europe, Campus E 7.1, 66123 Saarbruecken, Germany
- Center for Bioinformatics, Saarland University, E 2.1, 66041 Saarbruecken, Germany
| | - Volkhard Helms
- Center for Bioinformatics, Saarland University, E 2.1, 66041 Saarbruecken, Germany
| |
Collapse
|
39
|
Saghaeian Jazi M, Samaei NM, Mowla SJ, Arefnezhad B, Kouhsar M. SOX2OT knockdown derived changes in mitotic regulatory gene network of cancer cells. Cancer Cell Int 2018; 18:129. [PMID: 30202240 PMCID: PMC6126007 DOI: 10.1186/s12935-018-0618-8] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2018] [Accepted: 08/14/2018] [Indexed: 01/24/2023] Open
Abstract
Background SOX2 overlapping transcript (SOX2OT) is a long non-coding RNA, over-expressed in human tumor tissues and embryonic cells. Evidences support its function in the cell cycle; however there is no clear mechanism explaining its function in cell proliferation regulation. Here we investigated cancer cell response to SOX2OT knockdown by RNA sequencing. Methods SOX2OT expression was inhibited by siRNA in two cancer cell lines (A549, U-87 MG), then the RNA of treated cells were used for the cDNA library synthesis and RNA sequencing. The differentially expressed genes were used for functional enrichment and the gene expression network was analyzed to find the most relevant biological process with SOX2OT function. Furthermore, the expression change of candidate genes was measured by qRT-PCR for more confirmation and the cell cycle was monitored by PI staining. Results Our findings showed that SOX2OT knockdown affects the cellular gene expression generally with enriched cell proliferation and development biological process. Particularly, the cell cycle and mitotic regulatory genes expression including: CDK2, CDK2AP2, ACTR3, and chromosome structure associated genes like SMC4, INCENP and GNL3L are changed in treated cancer cells. Conclusion Our results propound SOX2OT association with cell cycle and mitosis regulation in cancer cells. Electronic supplementary material The online version of this article (10.1186/s12935-018-0618-8) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Marie Saghaeian Jazi
- 1Metabolic Disorders Research Center, Golestan University of Medical Sciences, Gorgan, Iran
| | - Nader Mansour Samaei
- 2Stem Cell Research Center, Golestan University of Medical Sciences, Po Box: 4934174611, Gorgan, Iran
| | - Seyed Javad Mowla
- 3Department of Molecular Genetics, Faculty of Biological Sciences, Tarbiat Modares University, Tehran, Iran
| | | | - Morteza Kouhsar
- 5Laboratory of System Biology and Bioinformatics (LBB), University of Tehran, Institute of Biochemistry and Biophysics, Tehran, Iran
| |
Collapse
|
40
|
Zhang J, Jia K, Jia J, Qian Y. An improved approach to infer protein-protein interaction based on a hierarchical vector space model. BMC Bioinformatics 2018; 19:161. [PMID: 29699476 PMCID: PMC5921294 DOI: 10.1186/s12859-018-2152-z] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2017] [Accepted: 04/09/2018] [Indexed: 02/06/2023] Open
Abstract
BACKGROUND Comparing and classifying functions of gene products are important in today's biomedical research. The semantic similarity derived from the Gene Ontology (GO) annotation has been regarded as one of the most widely used indicators for protein interaction. Among the various approaches proposed, those based on the vector space model are relatively simple, but their effectiveness is far from satisfying. RESULTS We propose a Hierarchical Vector Space Model (HVSM) for computing semantic similarity between different genes or their products, which enhances the basic vector space model by introducing the relation between GO terms. Besides the directly annotated terms, HVSM also takes their ancestors and descendants related by "is_a" and "part_of" relations into account. Moreover, HVSM introduces the concept of a Certainty Factor to calibrate the semantic similarity based on the number of terms annotated to genes. To assess the performance of our method, we applied HVSM to Homo sapiens and Saccharomyces cerevisiae protein-protein interaction datasets. Compared with TCSS, Resnik, and other classic similarity measures, HVSM achieved significant improvement for distinguishing positive from negative protein interactions. We also tested its correlation with sequence, EC, and Pfam similarity using online tool CESSM. CONCLUSIONS HVSM showed an improvement of up to 4% compared to TCSS, 8% compared to IntelliGO, 12% compared to basic VSM, 6% compared to Resnik, 8% compared to Lin, 11% compared to Jiang, 8% compared to Schlicker, and 11% compared to SimGIC using AUC scores. CESSM test showed HVSM was comparable to SimGIC, and superior to all other similarity measures in CESSM as well as TCSS. Supplementary information and the software are available at https://github.com/kejia1215/HVSM .
Collapse
Affiliation(s)
- Jiongmin Zhang
- Department of Computer Science & Technology, East China Normal University, North Zhongshan Road, Shanghai, 200062 China
| | - Ke Jia
- Department of Computer Science & Technology, East China Normal University, North Zhongshan Road, Shanghai, 200062 China
| | - Jinmeng Jia
- School of life science, East China Normal University, Dongchuan Road, Shanghai, 200241 China
| | - Ying Qian
- Department of Computer Science & Technology, East China Normal University, North Zhongshan Road, Shanghai, 200062 China
| |
Collapse
|
41
|
Mehrotra P, Ami VKG, Srinivasan N. Clustering of multi-domain protein sequences. Proteins 2018; 86:759-776. [PMID: 29675880 DOI: 10.1002/prot.25510] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2017] [Revised: 04/09/2018] [Accepted: 04/16/2018] [Indexed: 11/06/2022]
Abstract
The overall function of a multi-domain protein is determined by the functional and structural interplay of its constituent domains. Traditional sequence alignment-based methods commonly utilize domain-level information and provide classification only at the level of domains. Such methods are not capable of taking into account the contributions of other domains in the proteins, and domain-linker regions and classify multi-domain proteins. An alignment-free protein sequence comparison tool, CLAP (CLAssification of Proteins) was previously developed in our laboratory to especially handle multi-domain protein sequences without a requirement of defining domain boundaries and sequential order of domains. Through this method we aim to achieve a biologically meaningful classification scheme for multi-domain protein sequences. In this article, CLAP-based classification has been explored on 5 datasets of multi-domain proteins and we present detailed analysis for proteins containing (1) Tyrosine phosphatase and (2) SH3 domain. At the domain-level CLAP-based classification scheme resulted in a clustering similar to that obtained from an alignment-based method. CLAP-based clusters obtained for full-length datasets were shown to comprise of proteins with similar functions and domain architectures. Our study demonstrates that multi-domain proteins could be classified effectively by considering full-length sequences without a requirement of identification of domains in the sequence.
Collapse
Affiliation(s)
- Prachi Mehrotra
- Indian Institute of Science Mathematics Initiative, Bangalore, 560012, India.,Molecular Biophysics Unit, Indian Institute of Science, Bangalore, 560012, India
| | - Vimla Kany G Ami
- Institute of Bioinformatics and Applied Biotechnology, Bangalore, 560100, India
| | | |
Collapse
|
42
|
Zhao Y, Fu G, Wang J, Guo M, Yu G. Gene function prediction based on Gene Ontology Hierarchy Preserving Hashing. Genomics 2018; 111:334-342. [PMID: 29477548 DOI: 10.1016/j.ygeno.2018.02.008] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2017] [Revised: 02/02/2018] [Accepted: 02/16/2018] [Indexed: 12/27/2022]
Abstract
Gene Ontology (GO) uses structured vocabularies (or terms) to describe the molecular functions, biological roles, and cellular locations of gene products in a hierarchical ontology. GO annotations associate genes with GO terms and indicate the given gene products carrying out the biological functions described by the relevant terms. However, predicting correct GO annotations for genes from a massive set of GO terms as defined by GO is a difficult challenge. To combat with this challenge, we introduce a Gene Ontology Hierarchy Preserving Hashing (HPHash) based semantic method for gene function prediction. HPHash firstly measures the taxonomic similarity between GO terms. It then uses a hierarchy preserving hashing technique to keep the hierarchical order between GO terms, and to optimize a series of hashing functions to encode massive GO terms via compact binary codes. After that, HPHash utilizes these hashing functions to project the gene-term association matrix into a low-dimensional one and performs semantic similarity based gene function prediction in the low-dimensional space. Experimental results on three model species (Homo sapiens, Mus musculus and Rattus norvegicus) for interspecies gene function prediction show that HPHash performs better than other related approaches and it is robust to the number of hash functions. In addition, we also take HPHash as a plugin for BLAST based gene function prediction. From the experimental results, HPHash again significantly improves the prediction performance. The codes of HPHash are available at: http://mlda.swu.edu.cn/codes.php?name=HPHash.
Collapse
Affiliation(s)
- Yingwen Zhao
- College of Computer and Information Science, Southwest University, Chongqing 400715, China
| | - Guangyuan Fu
- College of Computer and Information Science, Southwest University, Chongqing 400715, China
| | - Jun Wang
- College of Computer and Information Science, Southwest University, Chongqing 400715, China
| | - Maozu Guo
- School of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, Beijing 100044, China; Beijing Key Laboratory of Intelligent Processing for Building Big Data, Beijing 100044, China.
| | - Guoxian Yu
- College of Computer and Information Science, Southwest University, Chongqing 400715, China.
| |
Collapse
|
43
|
Rodríguez-García MÁ, Hoehndorf R. Inferring ontology graph structures using OWL reasoning. BMC Bioinformatics 2018; 19:7. [PMID: 29304741 PMCID: PMC5756413 DOI: 10.1186/s12859-017-1999-8] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2017] [Accepted: 12/13/2017] [Indexed: 12/14/2022] Open
Abstract
BACKGROUND Ontologies are representations of a conceptualization of a domain. Traditionally, ontologies in biology were represented as directed acyclic graphs (DAG) which represent the backbone taxonomy and additional relations between classes. These graphs are widely exploited for data analysis in the form of ontology enrichment or computation of semantic similarity. More recently, ontologies are developed in a formal language such as the Web Ontology Language (OWL) and consist of a set of axioms through which classes are defined or constrained. While the taxonomy of an ontology can be inferred directly from the axioms of an ontology as one of the standard OWL reasoning tasks, creating general graph structures from OWL ontologies that exploit the ontologies' semantic content remains a challenge. RESULTS We developed a method to transform ontologies into graphs using an automated reasoner while taking into account all relations between classes. Searching for (existential) patterns in the deductive closure of ontologies, we can identify relations between classes that are implied but not asserted and generate graph structures that encode for a large part of the ontologies' semantic content. We demonstrate the advantages of our method by applying it to inference of protein-protein interactions through semantic similarity over the Gene Ontology and demonstrate that performance is increased when graph structures are inferred using deductive inference according to our method. Our software and experiment results are available at http://github.com/bio-ontology-research-group/Onto2Graph . CONCLUSIONS Onto2Graph is a method to generate graph structures from OWL ontologies using automated reasoning. The resulting graphs can be used for improved ontology visualization and ontology-based data analysis.
Collapse
Affiliation(s)
- Miguel Ángel Rodríguez-García
- Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal, 23955-6900 Kingdom of Saudi Arabia
- Computer, Electrical and Mathematical Sciences & Engineering Division, King Abdullah University of Science and Technology, Thuwal, 23955-6900 Kingdom of Saudi Arabia
| | - Robert Hoehndorf
- Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal, 23955-6900 Kingdom of Saudi Arabia
- Computer, Electrical and Mathematical Sciences & Engineering Division, King Abdullah University of Science and Technology, Thuwal, 23955-6900 Kingdom of Saudi Arabia
| |
Collapse
|
44
|
Mazandu GK, Chimusa ER, Mulder NJ. Gene Ontology semantic similarity tools: survey on features and challenges for biological knowledge discovery. Brief Bioinform 2017; 18:886-901. [PMID: 27473066 DOI: 10.1093/bib/bbw067] [Citation(s) in RCA: 33] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2016] [Indexed: 01/02/2023] Open
Abstract
Gene Ontology (GO) semantic similarity tools enable retrieval of semantic similarity scores, which incorporate biological knowledge embedded in the GO structure for comparing or classifying different proteins or list of proteins based on their GO annotations. This facilitates a better understanding of biological phenomena underlying the corresponding experiment and enables the identification of processes pertinent to different biological conditions. Currently, about 14 tools are available, which may play an important role in improving protein analyses at the functional level using different GO semantic similarity measures. Here we survey these tools to provide a comprehensive view of the challenges and advances made in this area to avoid redundant effort in developing features that already exist, or implementing ideas already proven to be obsolete in the context of GO. This helps researchers, tool developers, as well as end users, understand the underlying semantic similarity measures implemented through knowledge of pertinent features of, and issues related to, a particular tool. This should empower users to make appropriate choices for their biological applications and ensure effective knowledge discovery based on GO annotations.
Collapse
|
45
|
HashGO: hashing gene ontology for protein function prediction. Comput Biol Chem 2017; 71:264-273. [DOI: 10.1016/j.compbiolchem.2017.09.010] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2017] [Accepted: 09/25/2017] [Indexed: 10/18/2022]
|
46
|
Acharya S, Saha S, Nikhil N. Unsupervised gene selection using biological knowledge : application in sample clustering. BMC Bioinformatics 2017; 18:513. [PMID: 29166852 PMCID: PMC5700545 DOI: 10.1186/s12859-017-1933-0] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2017] [Accepted: 11/08/2017] [Indexed: 11/10/2022] Open
Abstract
Background Classification of biological samples of gene expression data is a basic building block in solving several problems in the field of bioinformatics like cancer and other disease diagnosis and making a proper treatment plan. One big challenge in sample classification is handling large dimensional and redundant gene expression data. To reduce the complexity of handling this high dimensional data, gene/feature selection plays a major role. Results The current paper explores the use of biological knowledge acquired from Gene Ontology database in selecting the proper subset of genes which can further participate in clustering of samples. The proposed feature selection technique is unsupervised in nature as it does not utilize any class label information in the process of gene selection. At the end, a multi-objective clustering approach is deployed to cluster the available set of samples in the reduced gene space. Conclusions Reported results show that consideration of biological knowledge in gene selection technique not only reduces the feature space dimensionality in great extent but also improves the accuracy of sample classification. The obtained reduced gene space is validated using strong biological significance tests. In order to prove the supremacy of our proposed gene selection based sample clustering technique, a thorough comparative analysis has also been performed with state-of-the-art techniques.
Collapse
Affiliation(s)
- Sudipta Acharya
- IIT Patna, Department of Computer Science and engineering, Patna, India.
| | - Sriparna Saha
- IIT Patna, Department of Computer Science and engineering, Patna, India
| | - N Nikhil
- IIT Ropar, Department of Computer Science and engineering, Punjab, India
| |
Collapse
|
47
|
Yea SJ, Kim BY, Kim C, Yi MY. A framework for the targeted selection of herbs with similar efficacy by exploiting drug repositioning technique and curated biomedical knowledge. JOURNAL OF ETHNOPHARMACOLOGY 2017; 208:117-128. [PMID: 28687508 DOI: 10.1016/j.jep.2017.06.048] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/28/2017] [Revised: 06/27/2017] [Accepted: 06/27/2017] [Indexed: 06/07/2023]
Abstract
ETHNO PHARMACOLOGICAL RELEVANCE Plants have been the most important natural resources for traditional medicine and for the modern pharmaceutical industry. They have been in demand in regards to finding alternative medicinal herbs with similar efficacy. Due to the very low probability of discovering useful compounds by random screening, researchers have advocated for using targeted selection approaches. Furthermore, because drug repositioning can speed up the process of drug development, an integrated technique that exploits chemical, genetic, and disease information has been recently developed. Building upon these findings, in this paper, we propose a novel framework for the targeted selection of herbs with similar efficacy by exploiting drug repositioning technique and curated modern scientific biomedical knowledge, with the goal of improving the possibility of inferring the traditional empirical ethno-pharmacological knowledge. MATERIALS AND METHODS To rank candidate herbs on the basis of similarities against target herb, we proposed and evaluated a framework that is comprised of the following four layers: links, extract, similarity, and model. In the framework, multiple databases are linked to build an herb-compound-protein-disease network which was composed of one tripartite network and two bipartite networks allowing comprehensive and detailed information to be extracted. Further, various similarity scores between herbs are calculated, and then prediction models are trained and tested on the basis of theses similarity features. RESULTS The proposed framework has been found to be feasible in terms of link loss. Out of the 50 similarities, the best one enhanced the performance of ranking herbs with similar efficacy by about 120-320% compared with our previous study. Also, the prediction model showed improved performance by about 180-480%. While building the prediction model, we identified the compound information as being the most important knowledge source and structural similarity as the most useful measure. CONCLUSIONS In the proposed framework, we took the knowledge of herbal medicine, chemistry, biology, and medicine into consideration to rank herbs with similar efficacy in candidates. The experimental results demonstrated that the performances of framework outperformed the baselines and identified the important knowledge source and useful similarity measure.
Collapse
Affiliation(s)
- Sang-Jun Yea
- Graduate School of Knowledge Service Engineering, Korea Advanced Institute of Science and Technology, Republic of Korea; K-herb Research Center, Korea Institute of Oriental Medicine, Republic of Korea
| | - Bu-Yeo Kim
- KM Convergence Research Division, Korea Institute of Oriental Medicine, Republic of Korea
| | - Chul Kim
- K-herb Research Center, Korea Institute of Oriental Medicine, Republic of Korea.
| | - Mun Yong Yi
- Graduate School of Knowledge Service Engineering, Korea Advanced Institute of Science and Technology, Republic of Korea.
| |
Collapse
|
48
|
Yu G, Lu C, Wang J. NoGOA: predicting noisy GO annotations using evidences and sparse representation. BMC Bioinformatics 2017; 18:350. [PMID: 28732468 PMCID: PMC5521088 DOI: 10.1186/s12859-017-1764-z] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2017] [Accepted: 07/14/2017] [Indexed: 01/11/2023] Open
Abstract
BACKGROUND Gene Ontology (GO) is a community effort to represent functional features of gene products. GO annotations (GOA) provide functional associations between GO terms and gene products. Due to resources limitation, only a small portion of annotations are manually checked by curators, and the others are electronically inferred. Although quality control techniques have been applied to ensure the quality of annotations, the community consistently report that there are still considerable noisy (or incorrect) annotations. Given the wide application of annotations, however, how to identify noisy annotations is an important but yet seldom studied open problem. RESULTS We introduce a novel approach called NoGOA to predict noisy annotations. NoGOA applies sparse representation on the gene-term association matrix to reduce the impact of noisy annotations, and takes advantage of sparse representation coefficients to measure the semantic similarity between genes. Secondly, it preliminarily predicts noisy annotations of a gene based on aggregated votes from semantic neighborhood genes of that gene. Next, NoGOA estimates the ratio of noisy annotations for each evidence code based on direct annotations in GOA files archived on different periods, and then weights entries of the association matrix via estimated ratios and propagates weights to ancestors of direct annotations using GO hierarchy. Finally, it integrates evidence-weighted association matrix and aggregated votes to predict noisy annotations. Experiments on archived GOA files of six model species (H. sapiens, A. thaliana, S. cerevisiae, G. gallus, B. Taurus and M. musculus) demonstrate that NoGOA achieves significantly better results than other related methods and removing noisy annotations improves the performance of gene function prediction. CONCLUSIONS The comparative study justifies the effectiveness of integrating evidence codes with sparse representation for predicting noisy GO annotations. Codes and datasets are available at http://mlda.swu.edu.cn/codes.php?name=NoGOA .
Collapse
Affiliation(s)
- Guoxian Yu
- College of Computer and Information Sciences, Southwest University, Chongqing, China.
| | - Chang Lu
- College of Computer and Information Sciences, Southwest University, Chongqing, China
| | - Jun Wang
- College of Computer and Information Sciences, Southwest University, Chongqing, China
| |
Collapse
|
49
|
Kang H, Gong Y. Developing a similarity searching module for patient safety event reporting system using semantic similarity measures. BMC Med Inform Decis Mak 2017; 17:75. [PMID: 28699567 PMCID: PMC5506579 DOI: 10.1186/s12911-017-0467-8] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open
Abstract
Background The most important knowledge in the field of patient safety is regarding the prevention and reduction of patient safety events (PSE) during treatment and care. The similarities and patterns among the events may otherwise go unnoticed if they are not properly reported and analyzed. There is an urgent need for developing a PSE reporting system that can dynamically measure the similarities of the events and thus promote event analysis and learning effect. Methods In this study, three prevailing algorithms of semantic similarity were implemented to measure the similarities of the 366 PSE annotated by the taxonomy of The Agency for Healthcare Research and Quality (AHRQ). The performance of each algorithm was then evaluated by a group of domain experts based on a 4-point Likert scale. The consistency between the scales of the algorithms and experts was measured and compared with the scales randomly assigned. The similarity algorithms and scores, as a self-learning and self-updating module, were then integrated into the system. Results The result shows that the similarity scores reflect a high consistency with the experts’ review than those randomly assigned. Moreover, incorporating the algorithms into our reporting system enables a mechanism to learn and update based upon PSE similarity. Conclusion In conclusion, integrating semantic similarity algorithms into a PSE reporting system can help us learn from previous events and provide timely knowledge support to the reporters. With the knowledge base in the PSE domain, the new generation reporting system holds promise in educating healthcare providers and preventing the recurrence and serious consequences of PSE.
Collapse
Affiliation(s)
- Hong Kang
- School of Biomedical Informatics, the University of Texas Health Science Center at Houston, 7000 Fannin St., Houston, TX, 77030, USA
| | - Yang Gong
- School of Biomedical Informatics, the University of Texas Health Science Center at Houston, 7000 Fannin St., Houston, TX, 77030, USA.
| |
Collapse
|
50
|
Malod-Dognin N, Pržulj N. Omics Data Complementarity Underlines Functional Cross-Communication in Yeast. J Integr Bioinform 2017; 14:/j/jib.ahead-of-print/jib-2017-0018/jib-2017-0018.xml. [PMID: 28600905 PMCID: PMC6042824 DOI: 10.1515/jib-2017-0018] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2017] [Accepted: 04/18/2017] [Indexed: 11/26/2022] Open
Abstract
Mapping the complete functional layout of a cell and understanding the cross-talk between different processes are fundamental challenges. They elude us because of the incompleteness and noisiness of molecular data and because of the computational intractability of finding the exact answer. We perform a simple integration of three types of baker’s yeast omics data to elucidate the functional organization and lines of cross-functional communication. We examine protein–protein interaction (PPI), co-expression (COEX) and genetic interaction (GI) data, and explore their relationship with the gold standard of functional organization, the Gene Ontology (GO). We utilize a simple framework that identifies functional cross-communication lines in each of the three data types, in GO, and collectively in the integrated model of the three omics data types; we present each of them in our new Functional Organization Map (FOM) model. We compare the FOMs of the three omics datasets with the FOM of GO and find that GI is in best agreement with GO, followed COEX and PPI. We integrate the three FOMs into a unified FOM and find that it is in better agreement with the FOM of GO than those of any omics dataset alone, demonstrating functional complementarity of different omics data.
Collapse
|