1
|
Databases of ligand-binding pockets and protein-ligand interactions. Comput Struct Biotechnol J 2024; 23:1320-1338. [PMID: 38585646 PMCID: PMC10997877 DOI: 10.1016/j.csbj.2024.03.015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2024] [Revised: 03/16/2024] [Accepted: 03/17/2024] [Indexed: 04/09/2024] Open
Abstract
Many research groups and institutions have created a variety of databases curating experimental and predicted data related to protein-ligand binding. The landscape of available databases is dynamic, with new databases emerging and established databases becoming defunct. Here, we review the current state of databases that contain binding pockets and protein-ligand binding interactions. We have compiled a list of such databases, fifty-three of which are currently available for use. We discuss variation in how binding pockets are defined and summarize pocket-finding methods. We organize the fifty-three databases into subgroups based on goals and contents, and describe standard use cases. We also illustrate that pockets within the same protein are characterized differently across different databases. Finally, we assess critical issues of sustainability, accessibility and redundancy.
Collapse
|
2
|
Elucidating the semantics-topology trade-off for knowledge inference-based pharmacological discovery. J Biomed Semantics 2024; 15:5. [PMID: 38693563 PMCID: PMC11064343 DOI: 10.1186/s13326-024-00308-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2024] [Accepted: 04/21/2024] [Indexed: 05/03/2024] Open
Abstract
Leveraging AI for synthesizing the deluge of biomedical knowledge has great potential for pharmacological discovery with applications including developing new therapeutics for untreated diseases and repurposing drugs as emergent pandemic treatments. Creating knowledge graph representations of interacting drugs, diseases, genes, and proteins enables discovery via embedding-based ML approaches and link prediction. Previously, it has been shown that these predictive methods are susceptible to biases from network structure, namely that they are driven not by discovering nuanced biological understanding of mechanisms, but based on high-degree hub nodes. In this work, we study the confounding effect of network topology on biological relation semantics by creating an experimental pipeline of knowledge graph semantic and topological perturbations. We show that the drop in drug repurposing performance from ablating meaningful semantics increases by 21% and 38% when mitigating topological bias in two networks. We demonstrate that new methods for representing knowledge and inferring new knowledge must be developed for making use of biomedical semantics for pharmacological innovation, and we suggest fruitful avenues for their development.
Collapse
|
3
|
Computational Approaches to Drug Repurposing: Methods, Challenges, and Opportunities. Annu Rev Biomed Data Sci 2024. [PMID: 38598857 DOI: 10.1146/annurev-biodatasci-110123-025333] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/12/2024]
Abstract
Drug repurposing refers to the inference of therapeutic relationships between a clinical indication and existing compounds. As an emerging paradigm in drug development, drug repurposing enables more efficient treatment of rare diseases, stratified patient populations, and urgent threats to public health. However, prioritizing well-suited drug candidates from among a nearly infinite number of repurposing options continues to represent a significant challenge in drug development. Over the past decade, advances in genomic profiling, database curation, and machine learning techniques have enabled more accurate identification of drug repurposing candidates for subsequent clinical evaluation. This review outlines the major methodologic classes that these approaches comprise, which rely on (a) protein structure, (b) genomic signatures, (c) biological networks, and (d) real-world clinical data. We propose that realizing the full impact of drug repurposing methodologies requires a multidisciplinary understanding of each method's advantages and limitations with respect to clinical practice.
Collapse
|
4
|
Leveraging large-scale biobank EHRs to enhance pharmacogenetics of cardiometabolic disease medications. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2024.04.06.24305415. [PMID: 38633781 PMCID: PMC11023668 DOI: 10.1101/2024.04.06.24305415] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/19/2024]
Abstract
Electronic health records (EHRs) coupled with large-scale biobanks offer great promises to unravel the genetic underpinnings of treatment efficacy. However, medication-induced biomarker trajectories stemming from such records remain poorly studied. Here, we extract clinical and medication prescription data from EHRs and conduct GWAS and rare variant burden tests in the UK Biobank (discovery) and the All of Us program (replication) on ten cardiometabolic drug response outcomes including lipid response to statins, HbA1c response to metformin and blood pressure response to antihypertensives (N = 740-26,669). Our findings at genome-wide significance level recover previously reported pharmacogenetic signals and also include novel associations for lipid response to statins (N = 26,669) near LDLR and ZNF800. Importantly, these associations are treatment-specific and not associated with biomarker progression in medication-naive individuals. Furthermore, we demonstrate that individuals with higher genetically determined low-density and total cholesterol baseline levels experience increased absolute, albeit lower relative biomarker reduction following statin treatment. In summary, we systematically investigated the common and rare pharmacogenetic contribution to cardiometabolic drug response phenotypes in over 50,000 UK Biobank and All of Us participants with EHR and identified clinically relevant genetic predictors for improved personalized treatment strategies.
Collapse
|
5
|
CAGI, the Critical Assessment of Genome Interpretation, establishes progress and prospects for computational genetic variant interpretation methods. Genome Biol 2024; 25:53. [PMID: 38389099 PMCID: PMC10882881 DOI: 10.1186/s13059-023-03113-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2023] [Accepted: 11/17/2023] [Indexed: 02/24/2024] Open
Abstract
BACKGROUND The Critical Assessment of Genome Interpretation (CAGI) aims to advance the state-of-the-art for computational prediction of genetic variant impact, particularly where relevant to disease. The five complete editions of the CAGI community experiment comprised 50 challenges, in which participants made blind predictions of phenotypes from genetic data, and these were evaluated by independent assessors. RESULTS Performance was particularly strong for clinical pathogenic variants, including some difficult-to-diagnose cases, and extends to interpretation of cancer-related variants. Missense variant interpretation methods were able to estimate biochemical effects with increasing accuracy. Assessment of methods for regulatory variants and complex trait disease risk was less definitive and indicates performance potentially suitable for auxiliary use in the clinic. CONCLUSIONS Results show that while current methods are imperfect, they have major utility for research and clinical applications. Emerging methods and increasingly large, robust datasets for training and assessment promise further progress ahead.
Collapse
|
6
|
A mitochondrial inside-out iron-calcium signal reveals drug targets for Parkinson's disease. Cell Rep 2023; 42:113544. [PMID: 38060381 PMCID: PMC10804639 DOI: 10.1016/j.celrep.2023.113544] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2023] [Revised: 09/11/2023] [Accepted: 11/17/2023] [Indexed: 12/30/2023] Open
Abstract
Dysregulated iron or Ca2+ homeostasis has been reported in Parkinson's disease (PD) models. Here, we discover a connection between these two metals at the mitochondria. Elevation of iron levels causes inward mitochondrial Ca2+ overflow, through an interaction of Fe2+ with mitochondrial calcium uniporter (MCU). In PD neurons, iron accumulation-triggered Ca2+ influx across the mitochondrial surface leads to spatially confined Ca2+ elevation at the outer mitochondrial membrane, which is subsequently sensed by Miro1, a Ca2+-binding protein. A Miro1 blood test distinguishes PD patients from controls and responds to drug treatment. Miro1-based drug screens in PD cells discover Food and Drug Administration-approved T-type Ca2+-channel blockers. Human genetic analysis reveals enrichment of rare variants in T-type Ca2+-channel subtypes associated with PD status. Our results identify a molecular mechanism in PD pathophysiology and drug targets and candidates coupled with a convenient stratification method.
Collapse
|
7
|
Deep Learning for Localized Detection of Optic Disc Hemorrhages. Am J Ophthalmol 2023; 255:161-169. [PMID: 37490992 DOI: 10.1016/j.ajo.2023.07.007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2023] [Revised: 06/12/2023] [Accepted: 07/05/2023] [Indexed: 07/27/2023]
Abstract
PURPOSE To develop an automated deep learning system for detecting the presence and location of disc hemorrhages in optic disc photographs. DESIGN Development and testing of a deep learning algorithm. METHODS Optic disc photos (597 images with at least 1 disc hemorrhage and 1075 images without any disc hemorrhage from 1562 eyes) from 5 institutions were classified by expert graders based on the presence or absence of disc hemorrhage. The images were split into training (n = 1340), validation (n = 167), and test (n = 165) datasets. Two state-of-the-art deep learning algorithms based on either object-level detection or image-level classification were trained on the dataset. These models were compared to one another and against 2 independent glaucoma specialists. We evaluated model performance by the area under the receiver operating characteristic curve (AUC). AUCs were compared with the Hanley-McNeil method. RESULTS The object detection model achieved an AUC of 0.936 (95% CI = 0.857-0.964) across all held-out images (n = 165 photographs), which was significantly superior to the image classification model (AUC = 0.845, 95% CI = 0.740-0.912; P = .006). At an operating point selected for high specificity, the model achieved a specificity of 94.3% and a sensitivity of 70.0%, which was statistically indistinguishable from an expert clinician (P = .7). At an operating point selected for high sensitivity, the model achieves a sensitivity of 96.7% and a specificity of 73.3%. CONCLUSIONS An autonomous object detection model is superior to an image classification model for detecting disc hemorrhages, and performed comparably to 2 clinicians.
Collapse
|
8
|
Integrative analyses highlight functional regulatory variants associated with neuropsychiatric diseases. Nat Genet 2023; 55:1876-1891. [PMID: 37857935 PMCID: PMC10859123 DOI: 10.1038/s41588-023-01533-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2021] [Accepted: 09/15/2023] [Indexed: 10/21/2023]
Abstract
Noncoding variants of presumed regulatory function contribute to the heritability of neuropsychiatric disease. A total of 2,221 noncoding variants connected to risk for ten neuropsychiatric disorders, including autism spectrum disorder, attention deficit hyperactivity disorder, bipolar disorder, borderline personality disorder, major depression, generalized anxiety disorder, panic disorder, post-traumatic stress disorder, obsessive-compulsive disorder and schizophrenia, were studied in developing human neural cells. Integrating epigenomic and transcriptomic data with massively parallel reporter assays identified differentially-active single-nucleotide variants (daSNVs) in specific neural cell types. Expression-gene mapping, network analyses and chromatin looping nominated candidate disease-relevant target genes modulated by these daSNVs. Follow-up integration of daSNV gene editing with clinical cohort analyses suggested that magnesium transport dysfunction may increase neuropsychiatric disease risk and indicated that common genetic pathomechanisms may mediate specific symptoms that are shared across multiple neuropsychiatric diseases.
Collapse
|
9
|
Explainable protein function annotation using local structure embeddings. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.10.13.562298. [PMID: 37905033 PMCID: PMC10614799 DOI: 10.1101/2023.10.13.562298] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/02/2023]
Abstract
The rapid expansion of protein sequence and structure databases has resulted in a significant number of proteins with ambiguous or unknown function. While advances in machine learning techniques hold great potential to fill this annotation gap, current methods for function prediction are unable to associate global function reliably to the specific residues responsible for that function. We address this issue by introducing PARSE (Protein Annotation by Residue-Specific Enrichment), a knowledge-based method which combines pre-trained embeddings of local structural environments with traditional statistical techniques to identify enriched functions with residue-level explainability. For the task of predicting the catalytic function of enzymes, PARSE achieves comparable or superior global performance to state-of-the-art machine learning methods (F1 score > 85%) while simultaneously annotating the specific residues involved in each function with much greater precision. Since it does not require supervised training, our method can make one-shot predictions for very rare functions and is not limited to a particular type of functional label (e.g. Enzyme Commission numbers or Gene Ontology codes). Finally, we leverage the AlphaFold Structure Database to perform functional annotation at a proteome scale. By applying PARSE to the dark proteome-predicted structures which cannot be classified into known structural families-we predict several novel bacterial metalloproteases. Each of these proteins shares a strongly conserved catalytic site despite highly divergent sequences and global folds, illustrating the value of local structure representations for new function discovery.
Collapse
|
10
|
A Holy Grail - The Prediction of Protein Structure. N Engl J Med 2023; 389:1431-1434. [PMID: 37732608 DOI: 10.1056/nejmcibr2307735] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 09/22/2023]
|
11
|
Stronger regulation of AI in biomedicine. Sci Transl Med 2023; 15:eadi0336. [PMID: 37703349 PMCID: PMC10977140 DOI: 10.1126/scitranslmed.adi0336] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/15/2023]
Abstract
Regulatory agencies need to ensure the safety and equity of AI in biomedicine, and the time to do so is now.
Collapse
|
12
|
The phenotype-genotype reference map: Improving biobank data science through replication. Am J Hum Genet 2023; 110:1522-1533. [PMID: 37607538 PMCID: PMC10502848 DOI: 10.1016/j.ajhg.2023.07.012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2023] [Revised: 07/26/2023] [Accepted: 07/27/2023] [Indexed: 08/24/2023] Open
Abstract
Population-scale biobanks linked to electronic health record data provide vast opportunities to extend our knowledge of human genetics and discover new phenotype-genotype associations. Given their dense phenotype data, biobanks can also facilitate replication studies on a phenome-wide scale. Here, we introduce the phenotype-genotype reference map (PGRM), a set of 5,879 genetic associations from 523 GWAS publications that can be used for high-throughput replication experiments. PGRM phenotypes are standardized as phecodes, ensuring interoperability between biobanks. We applied the PGRM to five ancestry-specific cohorts from four independent biobanks and found evidence of robust replications across a wide array of phenotypes. We show how the PGRM can be used to detect data corruption and to empirically assess parameters for phenome-wide studies. Finally, we use the PGRM to explore factors associated with replicability of GWAS results.
Collapse
|
13
|
Associating biological context with protein-protein interactions through text mining at PubMed scale. J Biomed Inform 2023; 145:104474. [PMID: 37572825 DOI: 10.1016/j.jbi.2023.104474] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2022] [Revised: 08/03/2023] [Accepted: 08/05/2023] [Indexed: 08/14/2023]
Abstract
Inferring knowledge from known relationships between drugs, proteins, genes, and diseases has great potential for clinical impact, such as predicting which existing drugs could be repurposed to treat rare diseases. Incorporating key biological context such as cell type or tissue of action into representations of extracted biomedical knowledge is essential for principled pharmacological discovery. Existing global, literature-derived knowledge graphs of interactions between drugs, proteins, genes, and diseases lack this essential information. In this study, we frame the task of associating biological context with protein-protein interactions extracted from text as a classification task using syntactic, semantic, and novel meta-discourse features. We introduce the Insider corpora, which are automatically generated PubMed-scale corpora for training classifiers for the context association task. These corpora are created by searching for precise syntactic cues of cell type and tissue relevancy to extracted regulatory relations. We report F1 scores of 0.955 and 0.862 for identifying relevant cell types and tissues, respectively, for our identified relations. By classifying with this framework, we demonstrate that the problem of context association can be addressed using intuitive, interpretable features. We demonstrate the potential of this approach to enrich text-derived knowledge bases with biological detail by incorporating cell type context into a protein-protein network for dengue fever.
Collapse
|
14
|
Genetic Correlations Among Corneal Biophysical Parameters and Anthropometric Traits. Transl Vis Sci Technol 2023; 12:8. [PMID: 37561511 PMCID: PMC10424803 DOI: 10.1167/tvst.12.8.8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2023] [Accepted: 07/10/2023] [Indexed: 08/11/2023] Open
Abstract
Purpose The genetic architecture of corneal dysfunction remains poorly understood. Epidemiological and clinical evidence suggests a relationship between corneal structural features and anthropometric measures. We used global and local genetic similarity analysis to identify genomic features that may underlie structural corneal dysfunction. Methods We assembled genome-wide association study summary statistics for corneal features (central corneal thickness, corneal hysteresis [CH], corneal resistance factor [CRF], and the 3 mm index of keratometry) and anthropometric traits (body mass index, weight, and height) in Europeans. We calculated global genetic correlations (rg) between traits using linkage disequilibrium (LD) score regression and local genetic covariance using ρ-HESS, which partitions the genome and performs regression with LD regions. Finally, we identified genes located within regions of significant genetic covariance and analyzed patterns of tissue expression and pathway enrichment. Results Global LD score regression revealed significant negative correlations between height and both CH (rg = -0.12; P = 2.0 × 10-7) and CRF (rg = -0.11; P = 6.9 × 10-7). Local analysis revealed 68 genomic regions exhibiting significant local genetic covariance between CRF and height, containing 2874 unique genes. Pathway analysis of genes in regions with significant local rg revealed enrichment among signaling pathways with known keratoconus associations, including cadherin and Wnt signaling, as well as enrichment of genes modulated by copper and zinc ions. Conclusions Corneal biophysical parameters and height share a common genomic architecture, which may facilitate identification of disease-associated genes and therapies for corneal ectasias. Translational Relevance Local genetic covariance analysis enables the identification of associated genes and therapeutic targets for corneal ectatic disease.
Collapse
|
15
|
Integrative analysis of functional genomic screening and clinical data identifies a protective role for spironolactone in severe COVID-19. CELL REPORTS METHODS 2023; 3:100503. [PMID: 37529368 PMCID: PMC10243122 DOI: 10.1016/j.crmeth.2023.100503] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/04/2022] [Revised: 04/01/2023] [Accepted: 05/23/2023] [Indexed: 08/03/2023]
Abstract
We demonstrate that integrative analysis of CRISPR screening datasets enables network-based prioritization of prescription drugs modulating viral entry in severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) by developing a network-based approach called Rapid proXimity Guidance for Repurposing Investigational Drugs (RxGRID). We use our results to guide a propensity-score-matched, retrospective cohort study of 64,349 COVID-19 patients, showing that a top candidate drug, spironolactone, is associated with improved clinical prognosis, measured by intensive care unit (ICU) admission and mechanical ventilation rates. Finally, we show that spironolactone exerts a dose-dependent inhibitory effect on viral entry in human lung epithelial cells. Our RxGRID method presents a computational framework, implemented as an open-source software package, enabling genomics researchers to identify drugs likely to modulate a molecular phenotype of interest based on high-throughput screening data. Our results, derived from this method and supported by experimental and clinical analysis, add additional supporting evidence for a potential protective role of the potassium-sparing diuretic spironolactone in severe COVID-19.
Collapse
|
16
|
Association between spironolactone use and COVID-19 outcomes in population-scale claims data: a retrospective cohort study. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2023:2023.02.28.23286515. [PMID: 36909470 PMCID: PMC10002773 DOI: 10.1101/2023.02.28.23286515] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/06/2023]
Abstract
Background Spironolactone has been proposed as a potential modulator of SARS-CoV-2 cellular entry. We aimed to measure the effect of spironolactone use on the risk of adverse outcomes following COVID-19 hospitalization. Methods We performed a retrospective cohort study of COVID-19 outcomes for patients with or without exposure to spironolactone, using population-scale claims data from the Komodo Healthcare Map. We identified all patients with a hospital admission for COVID-19 in the study window, defining treatment status based on spironolactone prescription orders. The primary outcomes were progression to respiratory ventilation or mortality during the hospitalization. Odds ratios (OR) were estimated following either 1:1 propensity score matching (PSM) or multivariable regression. Subgroup analysis was performed based on age, gender, body mass index (BMI), and dominant SARS-CoV-2 variant. Findings Among 898,303 eligible patients with a COVID-19-related hospitalization, 16,324 patients (1.8%) had a spironolactone prescription prior to hospitalization. 59,937 patients (6.7%) met the ventilation endpoint, and 26,515 patients (3.0%) met the mortality endpoint. Spironolactone use was associated with a significant reduction in odds of both ventilation (OR 0.82; 95% CI: 0.75-0.88; p < 0.001) and mortality (OR 0.88; 95% CI: 0.78-0.99; p = 0.033) in the PSM analysis, supported by the regression analysis. Spironolactone use was associated with significantly reduced odds of ventilation for all age groups, men, women, and non-obese patients, with the greatest protective effects in younger patients, men, and non-obese patients. Interpretation Spironolactone use was associated with a protective effect against ventilation and mortality following COVID-19 infection, amounting to up to 64% of the protective effect of vaccination against ventilation and consistent with an androgen-dependent mechanism. The findings warrant initiation of large-scale randomized controlled trials to establish a potential therapeutic role for spironolactone in COVID-19 patients.
Collapse
|
17
|
Using GPT-3 to Build a Lexicon of Drugs of Abuse Synonyms for Social Media Pharmacovigilance. Biomolecules 2023; 13:biom13020387. [PMID: 36830756 PMCID: PMC9953178 DOI: 10.3390/biom13020387] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2023] [Revised: 02/09/2023] [Accepted: 02/16/2023] [Indexed: 02/22/2023] Open
Abstract
Drug abuse is a serious problem in the United States, with over 90,000 drug overdose deaths nationally in 2020. A key step in combating drug abuse is detecting, monitoring, and characterizing its trends over time and location, also known as pharmacovigilance. While federal reporting systems accomplish this to a degree, they often have high latency and incomplete coverage. Social-media-based pharmacovigilance has zero latency, is easily accessible and unfiltered, and benefits from drug users being willing to share their experiences online pseudo-anonymously. However, unlike highly structured official data sources, social media text is rife with misspellings and slang, making automated analysis difficult. Generative Pretrained Transformer 3 (GPT-3) is a large autoregressive language model specialized for few-shot learning that was trained on text from the entire internet. We demonstrate that GPT-3 can be used to generate slang and common misspellings of terms for drugs of abuse. We repeatedly queried GPT-3 for synonyms of drugs of abuse and filtered the generated terms using automated Google searches and cross-references to known drug names. When generated terms for alprazolam were manually labeled, we found that our method produced 269 synonyms for alprazolam, 221 of which were new discoveries not included in an existing drug lexicon for social media. We repeated this process for 98 drugs of abuse, of which 22 are widely-discussed drugs of abuse, building a lexicon of colloquial drug synonyms that can be used for pharmacovigilance on social media.
Collapse
|
18
|
Multilingual translation for zero-shot biomedical classification using BioTranslator. Nat Commun 2023; 14:738. [PMID: 36759510 PMCID: PMC9911740 DOI: 10.1038/s41467-023-36476-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2022] [Accepted: 02/01/2023] [Indexed: 02/11/2023] Open
Abstract
Existing annotation paradigms rely on controlled vocabularies, where each data instance is classified into one term from a predefined set of controlled vocabularies. This paradigm restricts the analysis to concepts that are known and well-characterized. Here, we present the novel multilingual translation method BioTranslator to address this problem. BioTranslator takes a user-written textual description of a new concept and then translates this description to a non-text biological data instance. The key idea of BioTranslator is to develop a multilingual translation framework, where multiple modalities of biological data are all translated to text. We demonstrate how BioTranslator enables the identification of novel cell types using only a textual description and how BioTranslator can be further generalized to protein function prediction and drug target identification. Our tool frees scientists from limiting their analyses within predefined controlled vocabularies, enabling them to interact with biological data using free text.
Collapse
|
19
|
Promises and challenges in pharmacoepigenetics. CAMBRIDGE PRISMS. PRECISION MEDICINE 2023; 1:e18. [PMID: 37560024 PMCID: PMC10406571 DOI: 10.1017/pcm.2023.6] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/16/2022] [Revised: 01/27/2023] [Accepted: 01/31/2023] [Indexed: 08/11/2023]
Abstract
Pharmacogenetics, the study of how interindividual genetic differences affect drug response, does not explain all observed heritable variance in drug response. Epigenetic mechanisms, such as DNA methylation, and histone acetylation may account for some of the unexplained variances. Epigenetic mechanisms modulate gene expression and can be suitable drug targets and can impact the action of nonepigenetic drugs. Pharmacoepigenetics is the field that studies the relationship between epigenetic variability and drug response. Much of this research focuses on compounds targeting epigenetic mechanisms, called epigenetic drugs, which are used to treat cancers, immune disorders, and other diseases. Several studies also suggest an epigenetic role in classical drug response; however, we know little about this area. The amount of information correlating epigenetic biomarkers to molecular datasets has recently expanded due to technological advances, and novel computational approaches have emerged to better identify and predict epigenetic interactions. We propose that the relationship between epigenetics and classical drug response may be examined using data already available by (1) finding regions of epigenetic variance, (2) pinpointing key epigenetic biomarkers within these regions, and (3) mapping these biomarkers to a drug-response phenotype. This approach expands on existing knowledge to generate putative pharmacoepigenetic relationships, which can be tested experimentally. Epigenetic modifications are involved in disease and drug response. Therefore, understanding how epigenetic drivers impact the response to classical drugs is important for improving drug design and administration to better treat disease.
Collapse
|
20
|
COLLAPSE: A representation learning framework for identification and characterization of protein structural sites. Protein Sci 2023; 32:e4541. [PMID: 36519247 PMCID: PMC9847082 DOI: 10.1002/pro.4541] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2022] [Revised: 12/02/2022] [Accepted: 12/08/2022] [Indexed: 12/23/2022]
Abstract
The identification and characterization of the structural sites which contribute to protein function are crucial for understanding biological mechanisms, evaluating disease risk, and developing targeted therapies. However, the quantity of known protein structures is rapidly outpacing our ability to functionally annotate them. Existing methods for function prediction either do not operate on local sites, suffer from high false positive or false negative rates, or require large site-specific training datasets, necessitating the development of new computational methods for annotating functional sites at scale. We present COLLAPSE (Compressed Latents Learned from Aligned Protein Structural Environments), a framework for learning deep representations of protein sites. COLLAPSE operates directly on the 3D positions of atoms surrounding a site and uses evolutionary relationships between homologous proteins as a self-supervision signal, enabling learned embeddings to implicitly capture structure-function relationships within each site. Our representations generalize across disparate tasks in a transfer learning context, achieving state-of-the-art performance on standardized benchmarks (protein-protein interactions and mutation stability) and on the prediction of functional sites from the Prosite database. We use COLLAPSE to search for similar sites across large protein datasets and to annotate proteins based on a database of known functional sites. These methods demonstrate that COLLAPSE is computationally efficient, tunable, and interpretable, providing a general-purpose platform for computational protein analysis.
Collapse
|
21
|
Genetic association studies using disease liabilities from deep neural networks. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2023:2023.01.18.23284383. [PMID: 36712099 PMCID: PMC9882423 DOI: 10.1101/2023.01.18.23284383] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/21/2023]
Abstract
The case-control study is a widely used method for investigating the genetic landscape of binary traits. However, the health-related outcome or disease status of participants in long-term, prospective cohort studies such as the UK Biobank are subject to change. Here, we develop an approach for the genetic association study leveraging disease liabilities computed from a deep patient phenotyping framework (AI-based liability). Analyzing 44 common traits in 261,807 participants from the UK Biobank, we identified novel loci compared to the conventional case-control (CC) association studies. Our results showed that combining liability scores with CC status was more powerful than the CC-GWAS in detecting independent genetic loci across different diseases. This boost in statistical power was further reflected in increased SNP-based heritability estimates. Moreover, polygenic risk scores calculated from AI-based liabilities better identified newly diagnosed cases in the 2022 release of the UK Biobank that served as controls in the 2019 version (6.2% percentile rank increase on average). These findings demonstrate the utility of deep neural networks that are able to model disease liabilities from high-dimensional phenotypic data in large-scale population cohorts. Our pipeline of genome-wide association studies with disease liabilities can be applied to other biobanks with rich phenotype and genotype data.
Collapse
|
22
|
POPDx: an automated framework for patient phenotyping across 392 246 individuals in the UK Biobank study. J Am Med Inform Assoc 2023; 30:245-255. [PMID: 36469791 PMCID: PMC9846671 DOI: 10.1093/jamia/ocac226] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2022] [Revised: 10/19/2022] [Accepted: 11/18/2022] [Indexed: 12/12/2022] Open
Abstract
OBJECTIVE For the UK Biobank, standardized phenotype codes are associated with patients who have been hospitalized but are missing for many patients who have been treated exclusively in an outpatient setting. We describe a method for phenotype recognition that imputes phenotype codes for all UK Biobank participants. MATERIALS AND METHODS POPDx (Population-based Objective Phenotyping by Deep Extrapolation) is a bilinear machine learning framework for simultaneously estimating the probabilities of 1538 phenotype codes. We extracted phenotypic and health-related information of 392 246 individuals from the UK Biobank for POPDx development and evaluation. A total of 12 803 ICD-10 diagnosis codes of the patients were converted to 1538 phecodes as gold standard labels. The POPDx framework was evaluated and compared to other available methods on automated multiphenotype recognition. RESULTS POPDx can predict phenotypes that are rare or even unobserved in training. We demonstrate substantial improvement of automated multiphenotype recognition across 22 disease categories, and its application in identifying key epidemiological features associated with each phenotype. CONCLUSIONS POPDx helps provide well-defined cohorts for downstream studies. It is a general-purpose method that can be applied to other biobanks with diverse but incomplete data.
Collapse
|
23
|
Gene set proximity analysis: expanding gene set enrichment analysis through learned geometric embeddings, with drug-repurposing applications in COVID-19. Bioinformatics 2023; 39:btac735. [PMID: 36394254 PMCID: PMC9805577 DOI: 10.1093/bioinformatics/btac735] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2022] [Revised: 09/27/2022] [Accepted: 11/16/2022] [Indexed: 11/18/2022] Open
Abstract
MOTIVATION Gene set analysis methods rely on knowledge-based representations of genetic interactions in the form of both gene set collections and protein-protein interaction (PPI) networks. However, explicit representations of genetic interactions often fail to capture complex interdependencies among genes, limiting the analytic power of such methods. RESULTS We propose an extension of gene set enrichment analysis to a latent embedding space reflecting PPI network topology, called gene set proximity analysis (GSPA). Compared with existing methods, GSPA provides improved ability to identify disease-associated pathways in disease-matched gene expression datasets, while improving reproducibility of enrichment statistics for similar gene sets. GSPA is statistically straightforward, reducing to a version of traditional gene set enrichment analysis through a single user-defined parameter. We apply our method to identify novel drug associations with SARS-CoV-2 viral entry. Finally, we validate our drug association predictions through retrospective clinical analysis of claims data from 8 million patients, supporting a role for gabapentin as a risk factor and metformin as a protective factor for severe COVID-19. AVAILABILITY AND IMPLEMENTATION GSPA is available for download as a command-line Python package at https://github.com/henrycousins/gspa. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
|
24
|
Mapping transcriptional heterogeneity and metabolic networks in fatty livers at single-cell resolution. iScience 2022; 26:105802. [PMID: 36636354 PMCID: PMC9830221 DOI: 10.1016/j.isci.2022.105802] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2022] [Revised: 11/15/2022] [Accepted: 12/09/2022] [Indexed: 12/23/2022] Open
Abstract
Non-alcoholic fatty liver disease is a heterogeneous disease with unclear underlying molecular mechanisms. Here, we perform single-cell RNA sequencing of hepatocytes and hepatic non-parenchymal cells to map the lipid signatures in mice with non-alcoholic fatty liver disease (NAFLD). We uncover previously unidentified clusters of hepatocytes characterized by either high or low srebp1 expression. Surprisingly, the canonical lipid synthesis driver Srebp1 is not predictive of hepatic lipid accumulation, suggestive of other drivers of lipid metabolism. By combining transcriptional data at single-cell resolution with computational network analyses, we find that NAFLD is associated with high constitutive androstane receptor (CAR) expression. Mechanistically, CAR interacts with four functional modules: cholesterol homeostasis, bile acid metabolism, fatty acid metabolism, and estrogen response. Nuclear expression of CAR positively correlates with steatohepatitis in human livers. These findings demonstrate significant cellular differences in lipid signatures and identify functional networks linked to hepatic steatosis in mice and humans.
Collapse
|
25
|
Functional genomics of OCTN2 variants informs protein-specific variant effect predictor for Carnitine Transporter Deficiency. Proc Natl Acad Sci U S A 2022; 119:e2210247119. [PMID: 36343260 PMCID: PMC9674959 DOI: 10.1073/pnas.2210247119] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2022] [Accepted: 09/16/2022] [Indexed: 11/09/2022] Open
Abstract
Genetic variants in SLC22A5, encoding the membrane carnitine transporter OCTN2, cause the rare metabolic disorder Carnitine Transporter Deficiency (CTD). CTD is potentially lethal but actionable if detected early, with confirmatory diagnosis involving sequencing of SLC22A5. Interpretation of missense variants of uncertain significance (VUSs) is a major challenge. In this study, we sought to characterize the largest set to date (n = 150) of OCTN2 variants identified in diverse ancestral populations, with the goals of furthering our understanding of the mechanisms leading to OCTN2 loss-of-function (LOF) and creating a protein-specific variant effect prediction model for OCTN2 function. Uptake assays with 14C-carnitine revealed that 105 variants (70%) significantly reduced transport of carnitine compared to wild-type OCTN2, and 37 variants (25%) severely reduced function to less than 20%. All ancestral populations harbored LOF variants; 62% of green fluorescent protein (GFP)-tagged variants impaired OCTN2 localization to the plasma membrane of human embryonic kidney (HEK293T) cells, and subcellular localization significantly associated with function, revealing a major LOF mechanism of interest for CTD. With these data, we trained a model to classify variants as functional (>20% function) or LOF (<20% function). Our model outperformed existing state-of-the-art methods as evaluated by multiple performance metrics, with mean area under the receiver operating characteristic curve (AUROC) of 0.895 ± 0.025. In summary, in this study we generated a rich dataset of OCTN2 variant function and localization, revealed important disease-causing mechanisms, and improved upon machine learning-based prediction of OCTN2 variant function to aid in variant interpretation in the diagnosis and treatment of CTD.
Collapse
|
26
|
A cis-regulatory lexicon of DNA motif combinations mediating cell-type-specific gene regulation. CELL GENOMICS 2022; 2:100191. [PMID: 36742369 PMCID: PMC9894309 DOI: 10.1016/j.xgen.2022.100191] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Abstract
Gene expression is controlled by transcription factors (TFs) that bind cognate DNA motif sequences in cis-regulatory elements (CREs). The combinations of DNA motifs acting within homeostasis and disease, however, are unclear. Gene expression, chromatin accessibility, TF footprinting, and H3K27ac-dependent DNA looping data were generated and a random-forest-based model was applied to identify 7,531 cell-type-specific cis-regulatory modules (CRMs) across 15 diploid human cell types. A co-enrichment framework within CRMs nominated 838 cell-type-specific, recurrent heterotypic DNA motif combinations (DMCs), which were functionally validated using massively parallel reporter assays. Cancer cells engaged DMCs linked to neoplasia-enabling processes operative in normal cells while also activating new DMCs only seen in the neoplastic state. This integrative approach identifies cell-type-specific cis-regulatory combinatorial DNA motifs in diverse normal and diseased human cells and represents a general framework for deciphering cis-regulatory sequence logic in gene regulation.
Collapse
|
27
|
A network paradigm predicts drug synergistic effects using downstream protein-protein interactions. CPT Pharmacometrics Syst Pharmacol 2022; 11:1527-1538. [PMID: 36204824 PMCID: PMC9662203 DOI: 10.1002/psp4.12861] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2022] [Revised: 08/05/2022] [Accepted: 08/11/2022] [Indexed: 11/16/2022] Open
Abstract
In some cases, drug combinations affect adverse outcome phenotypes by binding the same protein; however, drug-binding proteins are associated through protein-protein interaction (PPI) networks within the cell, suggesting that drug phenotypes may result from long-range network effects. We first used PPI network analysis to classify drugs based on proteins downstream of their targets and next predicted drug combination effects where drugs shared network proteins but had distinct binding proteins (e.g., targets, enzymes, or transporters). By classifying drugs using their downstream proteins, we had an 80.7% sensitivity for predicting rare drug combination effects documented in gold-standard datasets. We further measured the effect of predicted drug combinations on adverse outcome phenotypes using novel observational studies in the electronic health record. We tested predictions for 60 network-drug classes on seven adverse outcomes and measured changes in clinical outcomes for predicted combinations. These results demonstrate a novel paradigm for anticipating drug synergistic effects using proteins downstream of drug targets.
Collapse
|
28
|
Contexts and contradictions: a roadmap for computational drug repurposing with knowledge inference. Brief Bioinform 2022; 23:6640007. [PMID: 35817308 PMCID: PMC9294417 DOI: 10.1093/bib/bbac268] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2022] [Revised: 05/25/2022] [Accepted: 06/07/2022] [Indexed: 11/30/2022] Open
Abstract
The cost of drug development continues to rise and may be prohibitive in cases of unmet clinical need, particularly for rare diseases. Artificial intelligence-based methods are promising in their potential to discover new treatment options. The task of drug repurposing hypothesis generation is well-posed as a link prediction problem in a knowledge graph (KG) of interacting of drugs, proteins, genes and disease phenotypes. KGs derived from biomedical literature are semantically rich and up-to-date representations of scientific knowledge. Inference methods on scientific KGs can be confounded by unspecified contexts and contradictions. Extracting context enables incorporation of relevant pharmacokinetic and pharmacodynamic detail, such as tissue specificity of interactions. Contradictions in biomedical KGs may arise when contexts are omitted or due to contradicting research claims. In this review, we describe challenges to creating literature-scale representations of pharmacological knowledge and survey current approaches toward incorporating context and resolving contradictions.
Collapse
|
29
|
Training data composition affects performance of protein structure analysis algorithms. PACIFIC SYMPOSIUM ON BIOCOMPUTING. PACIFIC SYMPOSIUM ON BIOCOMPUTING 2022; 27:10-21. [PMID: 34890132 PMCID: PMC8669736] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
The three-dimensional structures of proteins are crucial for understanding their molecular mechanisms and interactions. Machine learning algorithms that are able to learn accurate representations of protein structures are therefore poised to play a key role in protein engineering and drug development. The accuracy of such models in deployment is directly influenced by training data quality. The use of different experimental methods for protein structure determination may introduce bias into the training data. In this work, we evaluate the magnitude of this effect across three distinct tasks: estimation of model accuracy, protein sequence design, and catalytic residue prediction. Most protein structures are derived from X-ray crystallography, nuclear magnetic resonance (NMR), or cryo-electron microscopy (cryo-EM); we trained each model on datasets consisting of either all three structure types or of only X-ray data. We Find that across these tasks, models consistently perform worse on test sets derived from NMR and cryo-EM than they do on test sets of structures derived from X-ray crystallography, but that the difference can be mitigated when NMR and cryo-EM structures are included in the training set. Importantly, we show that including all three types of structures in the training set does not degrade test performance on X-ray structures, and in some cases even increases it. Finally, we examine the relationship between model performance and the biophysical properties of each method, and recommend that the biochemistry of the task of interest should be considered when composing training sets.
Collapse
|
30
|
Challenges and opportunities in network-based solutions for biological questions. Brief Bioinform 2021; 23:6438103. [PMID: 34849568 PMCID: PMC8769687 DOI: 10.1093/bib/bbab437] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2021] [Revised: 09/09/2021] [Accepted: 09/22/2021] [Indexed: 11/28/2022] Open
Abstract
Network biology is useful for modeling complex biological phenomena; it has attracted attention with the advent of novel graph-based machine learning methods. However, biological applications of network methods often suffer from inadequate follow-up. In this perspective, we discuss obstacles for contemporary network approaches—particularly focusing on challenges representing biological concepts, applying machine learning methods, and interpreting and validating computational findings about biology—in an effort to catalyze actionable biological discovery.
Collapse
|
31
|
Quantifying the Severity of Adverse Drug Reactions Using Social Media: Network Analysis. J Med Internet Res 2021; 23:e27714. [PMID: 34673524 PMCID: PMC8569532 DOI: 10.2196/27714] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2021] [Revised: 05/25/2021] [Accepted: 06/14/2021] [Indexed: 01/30/2023] Open
Abstract
BACKGROUND Adverse drug reactions (ADRs) affect the health of hundreds of thousands of individuals annually in the United States, with associated costs of hundreds of billions of dollars. The monitoring and analysis of the severity of ADRs is limited by the current qualitative and categorical systems of severity classification. Previous efforts have generated quantitative estimates for a subset of ADRs but were limited in scope because of the time and costs associated with the efforts. OBJECTIVE The aim of this study is to increase the number of ADRs for which there are quantitative severity estimates while improving the quality of these severity estimates. METHODS We present a semisupervised approach that estimates ADR severity by using social media word embeddings to construct a lexical network of ADRs and perform label propagation. We used this method to estimate the severity of 28,113 ADRs, representing 12,198 unique ADR concepts from the Medical Dictionary for Regulatory Activities. RESULTS Our Severity of Adverse Events Derived from Reddit (SAEDR) scores have good correlations with real-world outcomes. The SAEDR scores had Spearman correlations of 0.595, 0.633, and -0.748 for death, serious outcome, and no outcome, respectively, with ADR case outcomes in the Food and Drug Administration Adverse Event Reporting System. We investigated different methods for defining initial seed term sets and evaluated their impact on the severity estimates. We analyzed severity distributions for ADRs based on their appearance in boxed warning drug label sections, as well as for ADRs with sex-specific associations. We found that ADRs discovered in the postmarketing period had significantly greater severity than those discovered during the clinical trial (P<.001). We created quantitative drug-risk profile (DRIP) scores for 968 drugs that had a Spearman correlation of 0.377 with drugs ranked by the Food and Drug Administration Adverse Event Reporting System cases resulting in death, where the given drug was the primary suspect. CONCLUSIONS Our SAEDR and DRIP scores are well correlated with the real-world outcomes of the entities they represent and have demonstrated utility in pharmacovigilance research. We make the SAEDR scores for 12,198 ADRs and the DRIP scores for 968 drugs publicly available to enable more quantitative analysis of pharmacovigilance data.
Collapse
|
32
|
Abstract
Single cell technologies are rapidly generating large amounts of data that enables us to understand biological systems at single-cell resolution. However, joint analysis of datasets generated by independent labs remains challenging due to a lack of consistent terminology to describe cell types. Here, we present OnClass, an algorithm and accompanying software for automatically classifying cells into cell types that are part of the controlled vocabulary that forms the Cell Ontology. A key advantage of OnClass is its capability to classify cells into cell types not present in the training data because it uses the Cell Ontology graph to infer cell type relationships. Furthermore, OnClass can be used to identify marker genes for all the cell ontology categories, regardless of whether the cell types are present or absent in the training data, suggesting that OnClass goes beyond a simple annotation tool for single cell datasets, being the first algorithm capable to identify marker genes specific to all terms of the Cell Ontology and offering the possibility of refining the Cell Ontology using a data-centric approach.
Collapse
|
33
|
Genomewide Association Studies in Pharmacogenomics. Clin Pharmacol Ther 2021; 110:637-648. [PMID: 34185318 PMCID: PMC8376796 DOI: 10.1002/cpt.2349] [Citation(s) in RCA: 23] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2021] [Accepted: 06/15/2021] [Indexed: 12/24/2022]
Abstract
The increasing availability of genotype data linked with information about drug-response phenotypes has enabled genomewide association studies (GWAS) that uncover genetic determinants of drug response. GWAS have discovered associations between genetic variants and both drug efficacy and adverse drug reactions. Despite these successes, the design of GWAS in pharmacogenomics (PGx) faces unique challenges. In this review, we analyze the last decade of GWAS in PGx. We review trends in publications over time, including the drugs and drug classes studied and the clinical phenotypes used. Several data sharing consortia have contributed substantially to the PGx GWAS literature. We anticipate increased focus on biobanks and highlight phenotypes that would best enable future PGx discoveries.
Collapse
|
34
|
Scientific considerations for global drug development. Sci Transl Med 2021; 12:12/554/eaax2550. [PMID: 32727913 DOI: 10.1126/scitranslmed.aax2550] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2019] [Revised: 06/05/2019] [Accepted: 11/05/2019] [Indexed: 12/12/2022]
Abstract
Requiring regional or in-country confirmatory clinical trials before approval of drugs already approved elsewhere delays access to medicines in low- and middle-income countries and raises drug costs. Here, we discuss the scientific and technological advances that may reduce the need for in-country or in-region clinical trials for drugs approved in other countries and limitations of these advances that could necessitate in-region clinical studies.
Collapse
|
35
|
Distinct clinical phenotypes for Crohn's disease derived from patient surveys. BMC Gastroenterol 2021; 21:160. [PMID: 33836648 PMCID: PMC8034169 DOI: 10.1186/s12876-021-01740-6] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/18/2020] [Accepted: 03/25/2021] [Indexed: 11/14/2022] Open
Abstract
Background Defining clinical phenotypes provides opportunities for new diagnostics and may provide insights into early intervention and disease prevention. There is increasing evidence that patient-derived health data may contain information that complements traditional methods of clinical phenotyping. The utility of these data for defining meaningful phenotypic groups is of great interest because social media and online resources make it possible to query large cohorts of patients with health conditions. Methods We evaluated the degree to which patient-reported categorical data is useful for discovering subclinical phenotypes and evaluated its utility for discovering new measures of disease severity, treatment response and genetic architecture. Specifically, we examined the responses of 1961 patients with inflammatory bowel disease to questionnaires in search of sub-phenotypes. We applied machine learning methods to identify novel subtypes of Crohn’s disease and studied their associations with drug responses. Results Using the patients’ self-reported information, we identified two subpopulations of Crohn’s disease; these subpopulations differ in disease severity, associations with smoking, and genetic transmission patterns. We also identified distinct features of drug response for the two Crohn’s disease subtypes. These subtypes show a trend towards differential genotype signatures. Conclusion Our findings suggest that patient-defined data can have unplanned utility for defining disease subtypes and may be useful for guiding treatment approaches. Supplementary Information The online version contains supplementary material available at 10.1186/s12876-021-01740-6.
Collapse
|
36
|
Opportunities and challenges for the computational interpretation of rare variation in clinically important genes. Am J Hum Genet 2021; 108:535-548. [PMID: 33798442 PMCID: PMC8059338 DOI: 10.1016/j.ajhg.2021.03.003] [Citation(s) in RCA: 29] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
Genome sequencing is enabling precision medicine-tailoring treatment to the unique constellation of variants in an individual's genome. The impact of recurrent pathogenic variants is often understood, however there is a long tail of rare genetic variants that are uncharacterized. The problem of uncharacterized rare variation is especially acute when it occurs in genes of known clinical importance with functionally consequential variants and associated mechanisms. Variants of uncertain significance (VUSs) in these genes are discovered at a rate that outpaces current ability to classify them with databases of previous cases, experimental evaluation, and computational predictors. Clinicians are thus left without guidance about the significance of variants that may have actionable consequences. Computational prediction of the impact of rare genetic variation is increasingly becoming an important capability. In this paper, we review the technical and ethical challenges of interpreting the function of rare variants in two settings: inborn errors of metabolism in newborns and pharmacogenomics. We propose a framework for a genomic learning healthcare system with an initial focus on early-onset treatable disease in newborns and actionable pharmacogenomics. We argue that (1) a genomic learning healthcare system must allow for continuous collection and assessment of rare variants, (2) emerging machine learning methods will enable algorithms to predict the clinical impact of rare variants on protein function, and (3) ethical considerations must inform the construction and deployment of all rare-variation triage strategies, particularly with respect to health disparities arising from unbalanced ancestry representation.
Collapse
|
37
|
Large-scale labeling and assessment of sex bias in publicly available expression data. BMC Bioinformatics 2021; 22:168. [PMID: 33784977 PMCID: PMC8011224 DOI: 10.1186/s12859-021-04070-2] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2020] [Accepted: 03/08/2021] [Indexed: 01/09/2023] Open
Abstract
BACKGROUND Women are at more than 1.5-fold higher risk for clinically relevant adverse drug events. While this higher prevalence is partially due to gender-related effects, biological sex differences likely also impact drug response. Publicly available gene expression databases provide a unique opportunity for examining drug response at a cellular level. However, missingness and heterogeneity of metadata prevent large-scale identification of drug exposure studies and limit assessments of sex bias. To address this, we trained organism-specific models to infer sample sex from gene expression data, and used entity normalization to map metadata cell line and drug mentions to existing ontologies. Using this method, we inferred sex labels for 450,371 human and 245,107 mouse microarray and RNA-seq samples from refine.bio. RESULTS Overall, we find slight female bias (52.1%) in human samples and (62.5%) male bias in mouse samples; this corresponds to a majority of mixed sex studies in humans and single sex studies in mice, split between female-only and male-only (25.8% vs. 18.9% in human and 21.6% vs. 31.1% in mouse, respectively). In drug studies, we find limited evidence for sex-sampling bias overall; however, specific categories of drugs, including human cancer and mouse nervous system drugs, are enriched in female-only and male-only studies, respectively. We leverage our expression-based sex labels to further examine the complexity of cell line sex and assess the frequency of metadata sex label misannotations (2-5%). CONCLUSIONS Our results demonstrate limited overall sex bias, while highlighting high bias in specific subfields and underscoring the importance of including sex labels to better understand the underlying biology. We make our inferred and normalized labels, along with flags for misannotated samples, publicly available to catalyze the routine use of sex as a study variable in future analyses.
Collapse
|
38
|
Search and visualization of gene-drug-disease interactions for pharmacogenomics and precision medicine research using GeneDive. J Biomed Inform 2021; 117:103732. [PMID: 33737208 DOI: 10.1016/j.jbi.2021.103732] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2020] [Revised: 12/10/2020] [Accepted: 02/28/2021] [Indexed: 10/21/2022]
Abstract
BACKGROUND Understanding the relationships between genes, drugs, and disease states is at the core of pharmacogenomics. Two leading approaches for identifying these relationships in medical literature are: human expert led manual curation efforts, and modern data mining based automated approaches. The former generates small amounts of high-quality data, and the latter offers large volumes of mixed quality data. The algorithmically extracted relationships are often accompanied by supporting evidence, such as, confidence scores, source articles, and surrounding contexts (excerpts) from the articles, that can be used as data quality indicators. Tools that can leverage these quality indicators to help the user gain access to larger and high-quality data are needed. APPROACH We introduce GeneDive, a web application for pharmacogenomics researchers and precision medicine practitioners that makes gene, disease, and drug interactions data easily accessible and usable. GeneDive is designed to meet three key objectives: (1) provide functionality to manage information-overload problem and facilitate easy assimilation of supporting evidence, (2) support longitudinal and exploratory research investigations, and (3) offer integration of user-provided interactions data without requiring data sharing. RESULTS GeneDive offers multiple search modalities, visualizations, and other features that guide the user efficiently to the information of their interest. To facilitate exploratory research, GeneDive makes the supporting evidence and context for each interaction readily available and allows the data quality threshold to be controlled by the user as per their risk tolerance level. The interactive search-visualization loop enables relationship discoveries between diseases, genes, and drugs that might not be explicitly described in literature but are emergent from the source medical corpus and deductive reasoning. The ability to utilize user's data either in combination with the GeneDive native datasets or in isolation promotes richer data-driven exploration and discovery. These functionalities along with GeneDive's applicability for precision medicine, bringing the knowledge contained in biomedical literature to bear on particular clinical situations and improving patient care, are illustrated through detailed use cases. CONCLUSION GeneDive is a comprehensive, broad-use biological interactions browser. The GeneDive application and information about its underlying system architecture are available at http://www.genedive.net. GeneDive Docker image is also available for download at this URL, allowing users to (1) import their own interaction data securely and privately; and (2) generate and test hypotheses across their own and other datasets.
Collapse
|
39
|
Abstract
The global SARS-CoV-2 pandemic has caused a surge in research exploring all aspects of the virus and its effects on human health. The overwhelming rate of publications means that human researchers are unable to keep abreast of the research. To ameliorate this, we present the CoronaCentral resource which uses machine learning to process the research literature on SARS-CoV-2 along with articles on SARS-CoV and MERS-CoV. We break the literature down into useful categories and enable analysis of the contents, pace, and emphasis of research during the crisis. These categories cover therapeutics, forecasting as well as growing areas such as “Long Covid” and studies of inequality and misinformation. Using this data, we compare topics that appear in original research articles compared to commentaries and other article types. Finally, using Altmetric data, we identify the topics that have gained the most media attention. This resource, available at https://coronacentral.ai, is updated multiple times per day and provides an easy-to-navigate system to find papers in different categories, focussing on different aspects of the virus along with currently trending articles.
Collapse
|
40
|
Pharmacogenetics at Scale: An Analysis of the UK Biobank. Clin Pharmacol Ther 2020; 109:1528-1537. [PMID: 33237584 DOI: 10.1002/cpt.2122] [Citation(s) in RCA: 64] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2020] [Accepted: 10/22/2020] [Indexed: 01/06/2023]
Abstract
Pharmacogenetics (PGx) studies the influence of genetic variation on drug response. Clinically actionable associations inform guidelines created by the Clinical Pharmacogenetics Implementation Consortium (CPIC), but the broad impact of genetic variation on entire populations is not well understood. We analyzed PGx allele and phenotype frequencies for 487,409 participants in the UK Biobank, the largest PGx study to date. For 14 CPIC pharmacogenes known to influence human drug response, we find that 99.5% of individuals may have an atypical response to at least 1 drug; on average they may have an atypical response to 10.3 drugs. Nearly 24% of participants have been prescribed a drug for which they are predicted to have an atypical response. Non-European populations carry a greater frequency of variants that are predicted to be functionally deleterious; many of these are not captured by current PGx allele definitions. Strategies for detecting and interpreting rare variation will be critical for enabling broad application of pharmacogenetics.
Collapse
|
41
|
Pharmacogenomic polygenic response score predicts ischaemic events and cardiovascular mortality in clopidogrel-treated patients. EUROPEAN HEART JOURNAL. CARDIOVASCULAR PHARMACOTHERAPY 2020; 6:203-210. [PMID: 31504375 DOI: 10.1093/ehjcvp/pvz045] [Citation(s) in RCA: 55] [Impact Index Per Article: 13.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/23/2019] [Revised: 08/15/2019] [Accepted: 08/29/2019] [Indexed: 01/23/2023]
Abstract
AIMS Clopidogrel is prescribed for the prevention of atherothrombotic events. While investigations have identified genetic determinants of inter-individual variability in on-treatment platelet inhibition (e.g. CYP2C19*2), evidence that these variants have clinical utility to predict major adverse cardiovascular events (CVEs) remains controversial. METHODS AND RESULTS We assessed the impact of 31 candidate gene polymorphisms on adenosine diphosphate (ADP)-stimulated platelet reactivity in 3391 clopidogrel-treated coronary artery disease patients of the International Clopidogrel Pharmacogenomics Consortium (ICPC). The influence of these polymorphisms on CVEs was tested in 2134 ICPC patients (N = 129 events) in whom clinical event data were available. Several variants were associated with on-treatment ADP-stimulated platelet reactivity (CYP2C19*2, P = 8.8 × 10-54; CES1 G143E, P = 1.3 × 10-16; CYP2C19*17, P = 9.5 × 10-10; CYP2B6 1294 + 53 C > T, P = 3.0 × 10-4; CYP2B6 516 G > T, P = 1.0 × 10-3; CYP2C9*2, P = 1.2 × 10-3; and CYP2C9*3, P = 1.5 × 10-3). While no individual variant was associated with CVEs, generation of a pharmacogenomic polygenic response score (PgxRS) revealed that patients who carried a greater number of alleles that associated with increased on-treatment platelet reactivity were more likely to experience CVEs (β = 0.17, SE 0.06, P = 0.01) and cardiovascular-related death (β = 0.43, SE 0.16, P = 0.007). Patients who carried eight or more risk alleles were significantly more likely to experience CVEs [odds ratio (OR) = 1.78, 95% confidence interval (CI) 1.14-2.76, P = 0.01] and cardiovascular death (OR = 4.39, 95% CI 1.35-14.27, P = 0.01) compared to patients who carried six or fewer of these alleles. CONCLUSION Several polymorphisms impact clopidogrel response and PgxRS is a predictor of cardiovascular outcomes. Additional investigations that identify novel determinants of clopidogrel response and validating polygenic models may facilitate future precision medicine strategies.
Collapse
|
42
|
Transfer learning enables prediction of CYP2D6 haplotype function. PLoS Comput Biol 2020; 16:e1008399. [PMID: 33137098 PMCID: PMC7660895 DOI: 10.1371/journal.pcbi.1008399] [Citation(s) in RCA: 24] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2020] [Revised: 11/12/2020] [Accepted: 09/24/2020] [Indexed: 12/31/2022] Open
Abstract
Cytochrome P450 2D6 (CYP2D6) is a highly polymorphic gene whose protein product metabolizes more than 20% of clinically used drugs. Genetic variations in CYP2D6 are responsible for interindividual heterogeneity in drug response that can lead to drug toxicity and ineffective treatment, making CYP2D6 one of the most important pharmacogenes. Prediction of CYP2D6 phenotype relies on curation of literature-derived functional studies to assign a functional status to CYP2D6 haplotypes. As the number of large-scale sequencing efforts grows, new haplotypes continue to be discovered, and assignment of function is challenging to maintain. To address this challenge, we have trained a convolutional neural network to predict functional status of CYP2D6 haplotypes, called Hubble.2D6. Hubble.2D6 predicts haplotype function from sequence data and was trained using two pre-training steps with a combination of real and simulated data. We find that Hubble.2D6 predicts CYP2D6 haplotype functional status with 88% accuracy in a held-out test set and explains 47.5% of the variance in in vitro functional data among star alleles with unknown function. Hubble.2D6 may be a useful tool for assigning function to haplotypes with uncurated function, and used for screening individuals who are at risk of being poor metabolizers.
Collapse
|
43
|
PharmGKB Tutorial for Pharmacogenomics of Drugs Potentially Used in the Context of COVID-19. Clin Pharmacol Ther 2020; 109:116-122. [PMID: 32978778 PMCID: PMC7537078 DOI: 10.1002/cpt.2067] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2020] [Accepted: 09/17/2020] [Indexed: 12/03/2022]
Abstract
Pharmacogenomics (PGx) is a key area of precision medicine, which is already being implemented in some health systems and may help guide clinicians toward effective therapies for individual patients. Over the last 2 decades, the Pharmacogenomics Knowledgebase (PharmGKB) has built a unique repository of PGx knowledge, including annotations of clinical guideline and regulator‐approved drug labels in addition to evidence‐based drug pathways and annotations of the scientific literature. All of this knowledge is freely accessible on the PharmGKB website. In the first of a series of PharmGKB tutorials, we introduce the PharmGKB coronavirus disease 2019 (COVID‐19) portal and, using examples of drugs found in the portal, demonstrate some of the main features of PharmGKB. This paper is intended as a resource to help users become quickly acquainted with the wealth of information stored in PharmGKB.
Collapse
|
44
|
Sex-specific genetic effects across biomarkers. Eur J Hum Genet 2020; 29:154-163. [PMID: 32873964 DOI: 10.1038/s41431-020-00712-w] [Citation(s) in RCA: 35] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2020] [Revised: 07/28/2020] [Accepted: 08/04/2020] [Indexed: 11/09/2022] Open
Abstract
Sex differences have been shown in laboratory biomarkers; however, the extent to which this is due to genetics is unknown. In this study, we infer sex-specific genetic parameters (heritability and genetic correlation) across 33 quantitative biomarker traits in 181,064 females and 156,135 males from the UK Biobank study. We apply a Bayesian Mixture Model, Sex Effects Mixture Model (SEMM), to Genome-wide Association Study summary statistics in order to (1) estimate the contributions of sex to the genetic variance of these biomarkers and (2) identify variants whose statistical association with these traits is sex-specific. We find that the genetics of most biomarker traits are shared between males and females, with the notable exception of testosterone, where we identify 119 female and 445 male-specific variants. These include protein-altering variants in steroid hormone production genes (POR, UGT2B7). Using the sex-specific variants as genetic instruments for Mendelian randomization, we find evidence for causal links between testosterone levels and height, body mass index, waist and hip circumference, and type 2 diabetes. We also show that sex-specific polygenic risk score models for testosterone outperform a combined model. Overall, these results demonstrate that while sex has a limited role in the genetics of most biomarker traits, sex plays an important role in testosterone genetics.
Collapse
|
45
|
Pharmacogenomics in Asian Subpopulations and Impacts on Commonly Prescribed Medications. Clin Transl Sci 2020; 13:861-870. [PMID: 32100936 PMCID: PMC7485947 DOI: 10.1111/cts.12771] [Citation(s) in RCA: 25] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2019] [Accepted: 01/07/2020] [Indexed: 12/17/2022] Open
Abstract
Asians as a group comprise > 60% the world's population. There is an incredible amount of diversity in Asian and admixed populations that has not been addressed in a pharmacogenetic context. The known pharmacogenetic differences in Asian subgroups generally represent previously known variants that are present at much lower or higher frequencies in Asians compared with other populations. In this review we summarize the main drugs and known genes that appear to have differences in their pharmacogenetic properties in certain Asian populations. Evidence-based guidelines and summary statistics from the US Food and Drug Administration and the Clinical Pharmacogenetics Implementation Consortium were analyzed for ethnic differences in outcomes. Implicated drugs included commonly prescribed drugs such as warfarin, clopidogrel, carbamazepine, and allopurinol. The majority of these associations are due to Asians more commonly being poor metabolizers of cytochrome P450 (CYP) 2C19 and carriers of the human leukocyte antigen (HLA)-B*15:02 allele. The relative risk increase was shown to vary between genes and drugs, but could be > 100-fold higher in Asians. Specifically, there was a 172-fold increased risk of Stevens-Johnson syndrome and toxic epidermal necrolysis with carbamazepine use among HLA-B*15:02 carriers. The effects ranged from relatively benign reactions such as reduced drug efficacy to severe cutaneous skin reactions. These reactions are severe and prevalent enough to warrant pharmacogenetic testing and appropriate changes in dose and medication choice for at-risk populations. Further studies should be done on Asian cohorts to more fully understand pharmacogenetic variants in these populations and to clarify how such differences may influence drug response.
Collapse
|
46
|
Genomewide Association Study of Platelet Reactivity and Cardiovascular Response in Patients Treated With Clopidogrel: A Study by the International Clopidogrel Pharmacogenomics Consortium. Clin Pharmacol Ther 2020; 108:1067-1077. [PMID: 32472697 PMCID: PMC7689744 DOI: 10.1002/cpt.1911] [Citation(s) in RCA: 24] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2020] [Accepted: 05/08/2020] [Indexed: 01/07/2023]
Abstract
Antiplatelet response to clopidogrel shows wide variation, and poor response is correlated with adverse clinical outcomes. CYP2C19 loss‐of‐function alleles play an important role in this response, but account for only a small proportion of variability in response to clopidogrel. An aim of the International Clopidogrel Pharmacogenomics Consortium (ICPC) is to identify other genetic determinants of clopidogrel pharmacodynamics and clinical response. A genomewide association study (GWAS) was performed using DNA from 2,750 European ancestry individuals, using adenosine diphosphate‐induced platelet reactivity and major cardiovascular and cerebrovascular events as outcome parameters. GWAS for platelet reactivity revealed a strong signal for CYP2C19*2 (P value = 1.67e−33). After correction for CYP2C19*2 no other single‐nucleotide polymorphism reached genomewide significance. GWAS for a combined clinical end point of cardiovascular death, myocardial infarction, or stroke (5.0% event rate), or a combined end point of cardiovascular death or myocardial infarction (4.7% event rate) showed no significant results, although in coronary artery disease, percutaneous coronary intervention, and acute coronary syndrome subgroups, mutations in SCOS5P1, CDC42BPA, and CTRAC1 showed genomewide significance (lowest P values: 1.07e−09, 4.53e−08, and 2.60e−10, respectively). CYP2C19*2 is the strongest genetic determinant of on‐clopidogrel platelet reactivity. We identified three novel associations in clinical outcome subgroups, suggestive for each of these outcomes.
Collapse
|
47
|
Abstract
Gene sets, including protein complexes and signaling pathways, have proliferated greatly, in large part as a result of high-throughput biological data. Leveraging gene sets to gain insight into biological discovery requires computational methods for converting them into a useful form for available machine learning models. Here, we study the problem of embedding gene sets as compact features that are compatible with available machine learning codes. We present Set2Gaussian, a novel network-based gene set embedding approach, which represents each gene set as a multivariate Gaussian distribution rather than a single point in the low-dimensional space, according to the proximity of these genes in a protein-protein interaction network. We demonstrate that Set2Gaussian improves gene set member identification, accurately stratifies tumors, and finds concise gene sets for gene set enrichment analysis. We further show how Set2Gaussian allows us to identify a previously unknown clinical prognostic and predictive subnetwork around NEFM in sarcoma, which we validate in independent cohorts.
Collapse
|
48
|
High precision protein functional site detection using 3D convolutional neural networks. Bioinformatics 2020; 35:1503-1512. [PMID: 31051039 PMCID: PMC6499237 DOI: 10.1093/bioinformatics/bty813] [Citation(s) in RCA: 35] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2018] [Revised: 08/14/2018] [Accepted: 09/19/2018] [Indexed: 12/02/2022] Open
Abstract
Motivation Accurate annotation of protein functions is fundamental for understanding molecular and cellular physiology. Data-driven methods hold promise for systematically deriving rules underlying the relationship between protein structure and function. However, the choice of protein structural representation is critical. Pre-defined biochemical features emphasize certain aspects of protein properties while ignoring others, and therefore may fail to capture critical information in complex protein sites. Results In this paper, we present a general framework that applies 3D convolutional neural networks (3DCNNs) to structure-based protein functional site detection. The framework can extract task-dependent features automatically from the raw atom distributions. We benchmarked our method against other methods and demonstrate better or comparable performance for site detection. Our deep 3DCNNs achieved an average recall of 0.955 at a precision threshold of 0.99 on PROSITE families, detected 98.89 and 92.88% of nitric oxide synthase and TRYPSIN-like enzyme sites in Catalytic Site Atlas, and showed good performance on challenging cases where sequence motifs are absent but a function is known to exist. Finally, we inspected the individual contributions of each atom to the classification decisions and show that our models successfully recapitulate known 3D features within protein functional sites. Availability and implementation The 3DCNN models described in this paper are available at https://simtk.org/projects/fscnn. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
|
49
|
Classifying non-small cell lung cancer types and transcriptomic subtypes using convolutional neural networks. J Am Med Inform Assoc 2020; 27:757-769. [PMID: 32364237 PMCID: PMC7309263 DOI: 10.1093/jamia/ocz230] [Citation(s) in RCA: 54] [Impact Index Per Article: 13.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2019] [Revised: 11/22/2019] [Accepted: 03/05/2020] [Indexed: 12/26/2022] Open
Abstract
OBJECTIVE Non-small cell lung cancer is a leading cause of cancer death worldwide, and histopathological evaluation plays the primary role in its diagnosis. However, the morphological patterns associated with the molecular subtypes have not been systematically studied. To bridge this gap, we developed a quantitative histopathology analytic framework to identify the types and gene expression subtypes of non-small cell lung cancer objectively. MATERIALS AND METHODS We processed whole-slide histopathology images of lung adenocarcinoma (n = 427) and lung squamous cell carcinoma patients (n = 457) in the Cancer Genome Atlas. We built convolutional neural networks to classify histopathology images, evaluated their performance by the areas under the receiver-operating characteristic curves (AUCs), and validated the results in an independent cohort (n = 125). RESULTS To establish neural networks for quantitative image analyses, we first built convolutional neural network models to identify tumor regions from adjacent dense benign tissues (AUCs > 0.935) and recapitulated expert pathologists' diagnosis (AUCs > 0.877), with the results validated in an independent cohort (AUCs = 0.726-0.864). We further demonstrated that quantitative histopathology morphology features identified the major transcriptomic subtypes of both adenocarcinoma and squamous cell carcinoma (P < .01). DISCUSSION Our study is the first to classify the transcriptomic subtypes of non-small cell lung cancer using fully automated machine learning methods. Our approach does not rely on prior pathology knowledge and can discover novel clinically relevant histopathology patterns objectively. The developed procedure is generalizable to other tumor types or diseases.
Collapse
|
50
|
Examining the Use of Real-World Evidence in the Regulatory Process. Clin Pharmacol Ther 2020; 107:843-852. [PMID: 31562770 PMCID: PMC7093234 DOI: 10.1002/cpt.1658] [Citation(s) in RCA: 76] [Impact Index Per Article: 19.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2019] [Accepted: 09/17/2019] [Indexed: 12/12/2022]
Abstract
The 21st Century Cures Act passed by the United States Congress mandates the US Food and Drug Administration to develop guidance to evaluate the use of real-world evidence (RWE) to support the regulatory process. RWE has generated important medical discoveries, especially in areas where traditional clinical trials would be unethical or infeasible. However, RWE suffers from several issues that hinder its ability to provide proof of treatment efficacy at a level comparable to randomized controlled trials. In this review article, we summarized the advantages and limitations of RWE, identified the key opportunities for RWE, and pointed the way forward to maximize the potential of RWE for regulatory purposes.
Collapse
|