1
|
Brechtmann F, Bechtler T, Londhe S, Mertes C, Gagneur J. Evaluation of input data modality choices on functional gene embeddings. NAR Genom Bioinform 2023; 5:lqad095. [PMID: 37942285 PMCID: PMC10629286 DOI: 10.1093/nargab/lqad095] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2023] [Revised: 09/07/2023] [Accepted: 09/28/2023] [Indexed: 11/10/2023] Open
Abstract
Functional gene embeddings, numerical vectors capturing gene function, provide a promising way to integrate functional gene information into machine learning models. These embeddings are learnt by applying self-supervised machine-learning algorithms on various data types including quantitative omics measurements, protein-protein interaction networks and literature. However, downstream evaluations comparing alternative data modalities used to construct functional gene embeddings have been lacking. Here we benchmarked functional gene embeddings obtained from various data modalities for predicting disease-gene lists, cancer drivers, phenotype-gene associations and scores from genome-wide association studies. Off-the-shelf predictors trained on precomputed embeddings matched or outperformed dedicated state-of-the-art predictors, demonstrating their high utility. Embeddings based on literature and protein-protein interactions inferred from low-throughput experiments outperformed embeddings derived from genome-wide experimental data (transcriptomics, deletion screens and protein sequence) when predicting curated gene lists. In contrast, they did not perform better when predicting genome-wide association signals and were biased towards highly-studied genes. These results indicate that embeddings derived from literature and low-throughput experiments appear favourable in many existing benchmarks because they are biased towards well-studied genes and should therefore be considered with caution. Altogether, our study and precomputed embeddings will facilitate the development of machine-learning models in genetics and related fields.
Collapse
Affiliation(s)
- Felix Brechtmann
- TUM School of Computation, Information and Technology, Technical University of Munich, Garching, Germany
- Munich Center for Machine Learning, Munich, Germany
| | - Thibault Bechtler
- TUM School of Computation, Information and Technology, Technical University of Munich, Garching, Germany
| | - Shubhankar Londhe
- TUM School of Computation, Information and Technology, Technical University of Munich, Garching, Germany
| | - Christian Mertes
- TUM School of Computation, Information and Technology, Technical University of Munich, Garching, Germany
- Munich Data Science Institute, Technical University of Munich, Garching, Germany
- Institute of Human Genetics, School of Medicine, Technical University of Munich, Munich, Germany
| | - Julien Gagneur
- TUM School of Computation, Information and Technology, Technical University of Munich, Garching, Germany
- Institute of Human Genetics, School of Medicine, Technical University of Munich, Munich, Germany
- Computational Health Center, Helmholtz Center Munich, Neuherberg, Germany
| |
Collapse
|
2
|
Murakami Y, Mizuguchi K. Recent developments of sequence-based prediction of protein-protein interactions. Biophys Rev 2022; 14:1393-1411. [PMID: 36589735 PMCID: PMC9789376 DOI: 10.1007/s12551-022-01038-1] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/08/2022] [Indexed: 12/25/2022] Open
Abstract
The identification of protein-protein interactions (PPIs) can lead to a better understanding of cellular functions and biological processes of proteins and contribute to the design of drugs to target disease-causing PPIs. In addition, targeting host-pathogen PPIs is useful for elucidating infection mechanisms. Although several experimental methods have been used to identify PPIs, these methods can yet to draw complete PPI networks. Hence, computational techniques are increasingly required for the prediction of potential PPIs, which have never been seen experimentally. Recent high-performance sequence-based methods have contributed to the construction of PPI networks and the elucidation of pathogenetic mechanisms in specific diseases. However, the usefulness of these methods depends on the quality and quantity of training data of PPIs. In this brief review, we introduce currently available PPI databases and recent sequence-based methods for predicting PPIs. Also, we discuss key issues in this field and present future perspectives of the sequence-based PPI predictions.
Collapse
Affiliation(s)
- Yoichi Murakami
- grid.440890.10000 0004 0640 9413Tokyo University of Information Sciences, 4-1 Onaridai, Wakaba-Ku, Chiba, 265-8501 Japan
| | - Kenji Mizuguchi
- grid.136593.b0000 0004 0373 3971Institute for Protein Research, Osaka University, 3-2 Yamadaoka, Suita-Shi, Osaka, 565-0871 Japan ,grid.482562.fNational Institutes of Biomedical Innovation, Health and Nutrition, 7-6-8 Saito Asagi, Ibaraki, Osaka 567-0085 Japan
| |
Collapse
|
3
|
Robin V, Bodein A, Scott-Boyer MP, Leclercq M, Périn O, Droit A. Overview of methods for characterization and visualization of a protein–protein interaction network in a multi-omics integration context. Front Mol Biosci 2022; 9:962799. [PMID: 36158572 PMCID: PMC9494275 DOI: 10.3389/fmolb.2022.962799] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2022] [Accepted: 08/16/2022] [Indexed: 11/26/2022] Open
Abstract
At the heart of the cellular machinery through the regulation of cellular functions, protein–protein interactions (PPIs) have a significant role. PPIs can be analyzed with network approaches. Construction of a PPI network requires prediction of the interactions. All PPIs form a network. Different biases such as lack of data, recurrence of information, and false interactions make the network unstable. Integrated strategies allow solving these different challenges. These approaches have shown encouraging results for the understanding of molecular mechanisms, drug action mechanisms, and identification of target genes. In order to give more importance to an interaction, it is evaluated by different confidence scores. These scores allow the filtration of the network and thus facilitate the representation of the network, essential steps to the identification and understanding of molecular mechanisms. In this review, we will discuss the main computational methods for predicting PPI, including ones confirming an interaction as well as the integration of PPIs into a network, and we will discuss visualization of these complex data.
Collapse
Affiliation(s)
- Vivian Robin
- Molecular Medicine Department, CHU de Québec Research Center, Université Laval, Québec, QC, Canada
| | - Antoine Bodein
- Molecular Medicine Department, CHU de Québec Research Center, Université Laval, Québec, QC, Canada
| | - Marie-Pier Scott-Boyer
- Molecular Medicine Department, CHU de Québec Research Center, Université Laval, Québec, QC, Canada
| | - Mickaël Leclercq
- Molecular Medicine Department, CHU de Québec Research Center, Université Laval, Québec, QC, Canada
| | - Olivier Périn
- Digital Sciences Department, L'Oréal Advanced Research, Aulnay-sous-bois, France
| | - Arnaud Droit
- Molecular Medicine Department, CHU de Québec Research Center, Université Laval, Québec, QC, Canada
- *Correspondence: Arnaud Droit,
| |
Collapse
|
4
|
Ahmed SAJA, Bapatdhar N, Kumar BP, Ghosh S, Yachie A, Palaniappan SK. Large scale text mining for deriving useful insights: A case study focused on microbiome. Front Physiol 2022; 13:933069. [PMID: 36117696 PMCID: PMC9473635 DOI: 10.3389/fphys.2022.933069] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2022] [Accepted: 07/18/2022] [Indexed: 11/23/2022] Open
Abstract
Text mining has been shown to be an auxiliary but key driver for modeling, data harmonization, and interpretation in bio-medicine. Scientific literature holds a wealth of information and embodies cumulative knowledge and remains the core basis on which mechanistic pathways, molecular databases, and models are built and refined. Text mining provides the necessary tools to automatically harness the potential of text. In this study, we show the potential of large-scale text mining for deriving novel insights, with a focus on the growing field of microbiome. We first collected the complete set of abstracts relevant to the microbiome from PubMed and used our text mining and intelligence platform Taxila for analysis. We drive the usefulness of text mining using two case studies. First, we analyze the geographical distribution of research and study locations for the field of microbiome by extracting geo mentions from text. Using this analysis, we were able to draw useful insights on the state of research in microbiome w. r.t geographical distributions and economic drivers. Next, to understand the relationships between diseases, microbiome, and food which are central to the field, we construct semantic relationship networks between these different concepts central to the field of microbiome. We show how such networks can be useful to derive useful insight with no prior knowledge encoded.
Collapse
Affiliation(s)
| | | | | | - Samik Ghosh
- SBX Corporation Inc., Tokyo, Japan
- The NLP Group, The Systems Biology Institute, Tokyo, Japan
| | - Ayako Yachie
- SBX Corporation Inc., Tokyo, Japan
- The NLP Group, The Systems Biology Institute, Tokyo, Japan
| | - Sucheendra K. Palaniappan
- SBX Corporation Inc., Tokyo, Japan
- The NLP Group, The Systems Biology Institute, Tokyo, Japan
- *Correspondence: Sucheendra K. Palaniappan,
| |
Collapse
|
5
|
Ravanmehr V, Blau H, Cappelletti L, Fontana T, Carmody L, Coleman B, George J, Reese J, Joachimiak M, Bocci G, Hansen P, Bult C, Rueter J, Casiraghi E, Valentini G, Mungall C, Oprea TI, Robinson PN. Supervised learning with word embeddings derived from PubMed captures latent knowledge about protein kinases and cancer. NAR Genom Bioinform 2021; 3:lqab113. [PMID: 34888523 PMCID: PMC8652379 DOI: 10.1093/nargab/lqab113] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2021] [Revised: 10/14/2021] [Accepted: 11/24/2021] [Indexed: 11/17/2022] Open
Abstract
Inhibiting protein kinases (PKs) that cause cancers has been an important topic in cancer therapy for years. So far, almost 8% of >530 PKs have been targeted by FDA-approved medications, and around 150 protein kinase inhibitors (PKIs) have been tested in clinical trials. We present an approach based on natural language processing and machine learning to investigate the relations between PKs and cancers, predicting PKs whose inhibition would be efficacious to treat a certain cancer. Our approach represents PKs and cancers as semantically meaningful 100-dimensional vectors based on word and concept neighborhoods in PubMed abstracts. We use information about phase I-IV trials in ClinicalTrials.gov to construct a training set for random forest classification. Our results with historical data show that associations between PKs and specific cancers can be predicted years in advance with good accuracy. Our tool can be used to predict the relevance of inhibiting PKs for specific cancers and to support the design of well-focused clinical trials to discover novel PKIs for cancer therapy.
Collapse
Affiliation(s)
- Vida Ravanmehr
- The Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, USA
| | - Hannah Blau
- The Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, USA
| | - Luca Cappelletti
- AnacletoLab, Dipartimento di Informatica, Università degli Studi di Milano, Italy
| | - Tommaso Fontana
- AnacletoLab, Dipartimento di Informatica, Università degli Studi di Milano, Italy
| | - Leigh Carmody
- The Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, USA
| | - Ben Coleman
- The Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, USA
- University of Connecticut Health Center, Department of Genetics and Genome Sciences, Farmington, CT 06030, USA
| | - Joshy George
- The Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, USA
| | - Justin Reese
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94710, USA
| | - Marcin Joachimiak
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94710, USA
| | - Giovanni Bocci
- Department of Internal Medicine and UNM Comprehensive Cancer Center, UNM School of, Medicine, Albuquerque, NM 87102, USA
| | - Peter Hansen
- The Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, USA
| | - Carol Bult
- The Jackson Laboratory for Mammalian Genetics, Bar Harbor, ME 04609, USA
| | - Jens Rueter
- The Jackson Laboratory for Mammalian Genetics, Bar Harbor, ME 04609, USA
| | - Elena Casiraghi
- AnacletoLab, Dipartimento di Informatica, Università degli Studi di Milano, Italy
| | - Giorgio Valentini
- AnacletoLab, Dipartimento di Informatica, Università degli Studi di Milano, Italy
| | - Christopher Mungall
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94710, USA
| | - Tudor I Oprea
- Department of Internal Medicine and UNM Comprehensive Cancer Center, UNM School of, Medicine, Albuquerque, NM 87102, USA
| | - Peter N Robinson
- The Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, USA
- Institute for Systems Genomics, University of Connecticut, Farmington, CT 06032, USA
| |
Collapse
|