1
|
He F, Liu K, Yang Z, Chen Y, Hammer RD, Xu D, Popescu M. pathCLIP: Detection of Genes and Gene Relations From Biological Pathway Figures Through Image-Text Contrastive Learning. IEEE J Biomed Health Inform 2024; 28:5007-5019. [PMID: 38568768 DOI: 10.1109/jbhi.2024.3383610] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/05/2024]
Abstract
In biomedical literature, biological pathways are commonly described through a combination of images and text. These pathways contain valuable information, including genes and their relationships, which provide insight into biological mechanisms and precision medicine. Curating pathway information across the literature enables the integration of this information to build a comprehensive knowledge base. While some studies have extracted pathway information from images and text independently, they often overlook the correspondence between the two modalities. In this paper, we present a pathway figure curation system named pathCLIP for identifying genes and gene relations from pathway figures. Our key innovation is the use of an image-text contrastive learning model to learn coordinated embeddings of image snippets and text descriptions of genes and gene relations, thereby improving curation. Our validation results, using pathway figures from PubMed, showed that our multimodal model outperforms models using only a single modality. Additionally, our system effectively curates genes and gene relations from multiple literature sources. Two case studies on extracting pathway information from literature of non-small cell lung cancer and Alzheimer's disease further demonstrate the usefulness of our curated pathway information in enhancing related pathways in the KEGG database.
Collapse
|
2
|
Cousins H, Hall T, Guo Y, Tso L, Tzeng KTH, Cong L, Altman RB. Gene set proximity analysis: expanding gene set enrichment analysis through learned geometric embeddings, with drug-repurposing applications in COVID-19. Bioinformatics 2023; 39:btac735. [PMID: 36394254 PMCID: PMC9805577 DOI: 10.1093/bioinformatics/btac735] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2022] [Revised: 09/27/2022] [Accepted: 11/16/2022] [Indexed: 11/18/2022] Open
Abstract
MOTIVATION Gene set analysis methods rely on knowledge-based representations of genetic interactions in the form of both gene set collections and protein-protein interaction (PPI) networks. However, explicit representations of genetic interactions often fail to capture complex interdependencies among genes, limiting the analytic power of such methods. RESULTS We propose an extension of gene set enrichment analysis to a latent embedding space reflecting PPI network topology, called gene set proximity analysis (GSPA). Compared with existing methods, GSPA provides improved ability to identify disease-associated pathways in disease-matched gene expression datasets, while improving reproducibility of enrichment statistics for similar gene sets. GSPA is statistically straightforward, reducing to a version of traditional gene set enrichment analysis through a single user-defined parameter. We apply our method to identify novel drug associations with SARS-CoV-2 viral entry. Finally, we validate our drug association predictions through retrospective clinical analysis of claims data from 8 million patients, supporting a role for gabapentin as a risk factor and metformin as a protective factor for severe COVID-19. AVAILABILITY AND IMPLEMENTATION GSPA is available for download as a command-line Python package at https://github.com/henrycousins/gspa. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Henry Cousins
- Department of Biomedical Data Science, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Taryn Hall
- Optum Labs at UnitedHealth Group, Minneapolis, MN 55343, USA
| | - Yinglong Guo
- Optum Labs at UnitedHealth Group, Minneapolis, MN 55343, USA
| | - Luke Tso
- Optum Labs at UnitedHealth Group, Minneapolis, MN 55343, USA
| | - Kathy T H Tzeng
- Optum Labs at UnitedHealth Group, Minneapolis, MN 55343, USA
| | - Le Cong
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
- Department of Pathology, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Russ B Altman
- Department of Biomedical Data Science, Stanford University School of Medicine, Stanford, CA 94305, USA
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
- Department of Medicine, Stanford University School of Medicine, Stanford, CA 94305, USA
- Department of Bioengineering, Stanford University, Stanford, CA 94305, USA
| |
Collapse
|
3
|
Wang Y, Sun Z, He Q, Li J, Ni M, Yang M. Self-supervised graph representation learning integrates multiple molecular networks and decodes gene-disease relationships. PATTERNS (NEW YORK, N.Y.) 2022; 4:100651. [PMID: 36699743 PMCID: PMC9868676 DOI: 10.1016/j.patter.2022.100651] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/04/2022] [Revised: 05/19/2022] [Accepted: 11/07/2022] [Indexed: 12/12/2022]
Abstract
Leveraging molecular networks to discover disease-relevant modules is a long-standing challenge. With the accumulation of interactomes, there is a pressing need for powerful computational approaches to handle the inevitable noise and context-specific nature of biological networks. Here, we introduce Graphene, a two-step self-supervised representation learning framework tailored to concisely integrate multiple molecular networks and adapted to gene functional analysis via downstream re-training. In practice, we first leverage GNN (graph neural network) pre-training techniques to obtain initial node embeddings followed by re-training Graphene using a graph attention architecture, achieving superior performance over competing methods for pathway gene recovery, disease gene reprioritization, and comorbidity prediction. Graphene successfully recapitulates tissue-specific gene expression across disease spectrum and demonstrates shared heritability of common mental disorders. Graphene can be updated with new interactomes or other omics features. Graphene holds promise to decipher gene function under network context and refine GWAS (genome-wide association study) hits and offers mechanistic insights via decoding diseases from genome to networks to phenotypes.
Collapse
Affiliation(s)
- Yi Wang
- MGI, BGI-Shenzhen, Shenzhen, China
| | - Zijun Sun
- Computer Center, Peking University, Beijing, China
| | | | - Jiwei Li
- Department of Computer Science, Zhejiang University, Hangzhou, China
| | - Ming Ni
- MGI, BGI-Shenzhen, Shenzhen, China
- MGI-QingDao, BGI-Shenzhen, Qingdao, China
| | - Meng Yang
- MGI, BGI-Shenzhen, Shenzhen, China
- Corresponding author
| |
Collapse
|
4
|
Ghandikota S, Jegga AG. gene2gauss: A multi-view gaussian gene embedding learner for analyzing transcriptomic networks. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2022; 2022:206-215. [PMID: 35854722 PMCID: PMC9285176] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 05/01/2023]
Abstract
Analyzing gene co-expression networks can help in the discovery of biological processes and regulatory mechanisms underlying normal or perturbed states. Unlike standard differential analysis, network-based approaches consider the interactions between the genes involved leading to biologically relevant results. Applying such network-based methods to jointly analyze multiple transcriptomic networks representing independent disease cohorts or studies could lead to the identification of more robust gene modules or gene regulatory networks. We present gene2gauss, a novel feature learning framework that is capable of embedding genes as multivariate gaussian distributions by taking into account their long-range interaction neighborhoods across multiple transcriptomic studies. Using multiple gene co-expression networks from idiopathic pulmonary fibrosis, we demonstrate that these multi-dimensional gaussian features are suitable for identifying regulons of known transcription factors (TF). Using standard TF-target libraries, we demonstrate that the features from our method are highly relevant in comparison with other feature learning approaches on transcriptomic data.
Collapse
Affiliation(s)
- Sudhir Ghandikota
- Division of Biomedical Informatics, Cincinnati Children's Hospital Medical Center, Cincinnati, Ohio, USA
- Department of Electrical Engineering and Computer Science, University of Cincinnati College of Engineering, Cincinnati, Ohio, USA
| | - Anil G Jegga
- Division of Biomedical Informatics, Cincinnati Children's Hospital Medical Center, Cincinnati, Ohio, USA
- Department of Pediatrics, University of Cincinnati College of Medicine, Cincinnati, Ohio, USA
| |
Collapse
|
5
|
Fernández-Torras A, Comajuncosa-Creus A, Duran-Frigola M, Aloy P. Connecting chemistry and biology through molecular descriptors. Curr Opin Chem Biol 2021; 66:102090. [PMID: 34626922 DOI: 10.1016/j.cbpa.2021.09.001] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2021] [Revised: 08/23/2021] [Accepted: 09/03/2021] [Indexed: 01/14/2023]
Abstract
Through the representation of small molecule structures as numerical descriptors and the exploitation of the similarity principle, chemoinformatics has made paramount contributions to drug discovery, from unveiling mechanisms of action and repurposing approved drugs to de novo crafting of molecules with desired properties and tailored targets. Yet, the inherent complexity of biological systems has fostered the implementation of large-scale experimental screenings seeking a deeper understanding of the targeted proteins, the disrupted biological processes and the systemic responses of cells to chemical perturbations. After this wealth of data, a new generation of data-driven descriptors has arisen providing a rich portrait of small molecule characteristics that goes beyond chemical properties. Here, we give an overview of biologically relevant descriptors, covering chemical compounds, proteins and other biological entities, such as diseases and cell lines, while aligning them to the major contributions in the field from disciplines, such as natural language processing or computer vision. We now envision a new scenario for chemical and biological entities where they both are translated into a common numerical format. In this computational framework, complex connections between entities can be unveiled by means of simple arithmetic operations, such as distance measures, additions, and subtractions.
Collapse
Affiliation(s)
- Adrià Fernández-Torras
- Joint IRB-BSC-CRG Program in Computational Biology, Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, Barcelona, Catalonia, Spain
| | - Arnau Comajuncosa-Creus
- Joint IRB-BSC-CRG Program in Computational Biology, Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, Barcelona, Catalonia, Spain
| | - Miquel Duran-Frigola
- Joint IRB-BSC-CRG Program in Computational Biology, Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, Barcelona, Catalonia, Spain; Ersilia Open Source Initiative, Cambridge, United Kingdom
| | - Patrick Aloy
- Joint IRB-BSC-CRG Program in Computational Biology, Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, Barcelona, Catalonia, Spain; Institució Catalana de Recerca I Estudis Avançats (ICREA), Barcelona, Catalonia, Spain.
| |
Collapse
|
6
|
Abstract
Gene expression signatures (GES) connect phenotypes to differential messenger RNA (mRNA) expression of genes, providing a powerful approach to define cellular identity, function, and the effects of perturbations. The use of GES has suffered from vague assessment criteria and limited reproducibility. Because the structure of proteins defines the functional capability of genes, we hypothesized that enrichment of structural features could be a generalizable representation of gene sets. We derive structural gene expression signatures (sGES) using features from multiple levels of protein structure (e.g., domain and fold) encoded by the mRNAs in GES. Comprehensive analyses of data from the Genotype-Tissue Expression Project (GTEx), the all RNA-seq and ChIP-seq sample and signature search (ARCHS4) database, and mRNA expression of drug effects on cardiomyocytes show that sGES are useful for characterizing biological phenomena. sGES enable phenotypic characterization across experimental platforms, facilitates interoperability of expression datasets, and describe drug action on cells.
Collapse
|
7
|
Embedding gene sets in low-dimensional space. NAT MACH INTELL 2020. [DOI: 10.1038/s42256-020-0204-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|