1
|
Gargano MA, Matentzoglu N, Coleman B, Addo-Lartey EB, Anagnostopoulos A, Anderton J, Avillach P, Bagley AM, Bakštein E, Balhoff JP, Baynam G, Bello SM, Berk M, Bertram H, Bishop S, Blau H, Bodenstein DF, Botas P, Boztug K, Čady J, Callahan TJ, Cameron R, Carbon S, Castellanos F, Caufield JH, Chan LE, Chute C, Cruz-Rojo J, Dahan-Oliel N, Davids JR, de Dieuleveult M, de Souza V, de Vries BBA, de Vries E, DePaulo JR, Derfalvi B, Dhombres F, Diaz-Byrd C, Dingemans AJM, Donadille B, Duyzend M, Elfeky R, Essaid S, Fabrizzi C, Fico G, Firth HV, Freudenberg-Hua Y, Fullerton JM, Gabriel DL, Gilmour K, Giordano J, Goes FS, Moses RG, Green I, Griese M, Groza T, Gu W, Guthrie J, Gyori B, Hamosh A, Hanauer M, Hanušová K, He Y(O, Hegde H, Helbig I, Holasová K, Hoyt CT, Huang S, Hurwitz E, Jacobsen JOB, Jiang X, Joseph L, Keramatian K, King B, Knoflach K, Koolen DA, Kraus M, Kroll C, Kusters M, Ladewig MS, Lagorce D, Lai MC, Lapunzina P, Laraway B, Lewis-Smith D, Li X, Lucano C, Majd M, Marazita ML, Martinez-Glez V, McHenry TH, McInnis MG, McMurry JA, Mihulová M, Millett CE, Mitchell PB, Moslerová V, Narutomi K, Nematollahi S, Nevado J, Nierenberg AA, Čajbiková NN, Nurnberger JI, Ogishima S, Olson D, Ortiz A, Pachajoa H, Perez de Nanclares G, Peters A, Putman T, Rapp CK, Rath A, Reese J, Rekerle L, Roberts A, Roy S, Sanders SJ, Schuetz C, Schulte EC, Schulze TG, Schwarz M, Scott K, Seelow D, Seitz B, Shen Y, Similuk MN, Simon ES, Singh B, Smedley D, Smith CL, Smolinsky JT, Sperry S, Stafford E, Stefancsik R, Steinhaus R, Strawbridge R, Sundaramurthi JC, Talapova P, Tenorio Castano JA, Tesner P, Thomas RH, Thurm A, Turnovec M, van Gijn ME, Vasilevsky NA, Vlčková M, Walden A, Wang K, Wapner R, Ware JS, Wiafe AA, Wiafe SA, Wiggins LD, Williams AE, Wu C, Wyrwoll MJ, Xiong H, Yalin N, Yamamoto Y, Yatham LN, Yocum AK, Young AH, Yüksel Z, Zandi PP, Zankl A, Zarante I, Zvolský M, Toro S, Carmody LC, Harris NL, Munoz-Torres MC, Danis D, Mungall CJ, Köhler S, Haendel MA, Robinson PN. The Human Phenotype Ontology in 2024: phenotypes around the world. Nucleic Acids Res 2024; 52:D1333-D1346. [PMID: 37953324 PMCID: PMC10767975 DOI: 10.1093/nar/gkad1005] [Citation(s) in RCA: 75] [Impact Index Per Article: 75.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2023] [Revised: 10/12/2023] [Accepted: 10/19/2023] [Indexed: 11/14/2023] Open
Abstract
The Human Phenotype Ontology (HPO) is a widely used resource that comprehensively organizes and defines the phenotypic features of human disease, enabling computational inference and supporting genomic and phenotypic analyses through semantic similarity and machine learning algorithms. The HPO has widespread applications in clinical diagnostics and translational research, including genomic diagnostics, gene-disease discovery, and cohort analytics. In recent years, groups around the world have developed translations of the HPO from English to other languages, and the HPO browser has been internationalized, allowing users to view HPO term labels and in many cases synonyms and definitions in ten languages in addition to English. Since our last report, a total of 2239 new HPO terms and 49235 new HPO annotations were developed, many in collaboration with external groups in the fields of psychiatry, arthrogryposis, immunology and cardiology. The Medical Action Ontology (MAxO) is a new effort to model treatments and other measures taken for clinical management. Finally, the HPO consortium is contributing to efforts to integrate the HPO and the GA4GH Phenopacket Schema into electronic health records (EHRs) with the goal of more standardized and computable integration of rare disease data in EHRs.
Collapse
|
research-article |
1 |
75 |
2
|
Peng Y, Tang Y, Lee S, Zhu Y, Summers RM, Lu Z. COVID-19-CT-CXR: A Freely Accessible and Weakly Labeled Chest X-Ray and CT Image Collection on COVID-19 From Biomedical Literature. IEEE TRANSACTIONS ON BIG DATA 2021; 7:3-12. [PMID: 33997112 PMCID: PMC8117951 DOI: 10.1109/tbdata.2020.3035935] [Citation(s) in RCA: 25] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/29/2020] [Revised: 10/09/2020] [Accepted: 10/19/2020] [Indexed: 05/06/2023]
Abstract
The latest threat to global health is the COVID-19 outbreak. Although there exist large datasets of chest X-rays (CXR) and computed tomography (CT) scans, few COVID-19 image collections are currently available due to patient privacy. At the same time, there is a rapid growth of COVID-19-relevant articles in the biomedical literature, including those that report findings on radiographs. Here, we present COVID-19-CT-CXR, a public database of COVID-19 CXR and CT images, which are automatically extracted from COVID-19-relevant articles from the PubMed Central Open Access (PMC-OA) Subset. We extracted figures, associated captions, and relevant figure descriptions in the article and separated compound figures into subfigures. Because a large portion of figures in COVID-19 articles are not CXR or CT, we designed a deep-learning model to distinguish them from other figure types and to classify them accordingly. The final database includes 1,327 CT and 263 CXR images (as of May 9, 2020) with their relevant text. To demonstrate the utility of COVID-19-CT-CXR, we conducted four case studies. (1) We show that COVID-19-CT-CXR, when used as additional training data, is able to contribute to improved deep-learning (DL) performance for the classification of COVID-19 and non-COVID-19 CT. (2) We collected CT images of influenza, another common infectious respiratory illness that may present similarly to COVID-19, and fine-tuned a baseline deep neural network to distinguish a diagnosis of COVID-19, influenza, or normal or other types of diseases on CT. (3) We fine-tuned an unsupervised one-class classifier from non-COVID-19 CXR and performed anomaly detection to detect COVID-19 CXR. (4) From text-mined captions and figure descriptions, we compared 15 clinical symptoms and 20 clinical findings of COVID-19 versus those of influenza to demonstrate the disease differences in the scientific publications. Our database is unique, as the figures are retrieved along with relevant text with fine-grained descriptions, and it can be extended easily in the future. We believe that our work is complementary to existing resources and hope that it will contribute to medical image analysis of the COVID-19 pandemic. The dataset, code, and DL models are publicly available at https://github.com/ncbi-nlp/COVID-19-CT-CXR.
Collapse
|
research-article |
4 |
25 |
3
|
Richman I, Tessier-Sherman B, Galusha D, Oladele CR, Wang K. Breast cancer screening during the COVID-19 pandemic: moving from disparities to health equity. J Natl Cancer Inst 2023; 115:139-145. [PMID: 36069622 PMCID: PMC9494402 DOI: 10.1093/jnci/djac172] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2022] [Revised: 08/02/2022] [Accepted: 08/29/2022] [Indexed: 11/19/2022] Open
Abstract
The COVID-19 pandemic created unprecedented disruptions to routine health care in the United States. Screening mammography, a cornerstone of breast cancer control and prevention, was completely halted in the spring of 2020, and screening programs have continued to face challenges with subsequent COVID-19 waves. Although screening mammography rates decreased for all women during the pandemic, a number of studies have now clearly documented that reductions in screening have been greater for some populations than others. Specifically, minoritized women have been screened at lower rates than White women across studies, although the specific patterns of disparity vary depending on the populations and communities studied. We posit that these disparities are likely due to a variety of structural and contextual factors, including the differential impact of COVID-19 on communities. We also outline key considerations for closing gaps in screening mammography. First, practices, health systems, and communities must measure screening mammography use to identify whether gaps exist and which populations are most affected. Second, we propose that strategies to close disparities in breast cancer screening must be multifaceted, targeting the health system or practice, but also structural factors at the policy level. Health disparities arise from a complex set of conditions, and multimodal solutions that address the complex, multifactorial conditions that lead to disparities may be more likely to succeed and are necessary for promoting health equity.
Collapse
|
Research Support, N.I.H., Extramural |
2 |
16 |
4
|
Rodriguez VA, Bhave S, Chen R, Pang C, Hripcsak G, Sengupta S, Elhadad N, Green R, Adelman J, Metitiri KS, Elias P, Groves H, Mohan S, Natarajan K, Perotte A. Development and validation of prediction models for mechanical ventilation, renal replacement therapy, and readmission in COVID-19 patients. J Am Med Inform Assoc 2021; 28:1480-1488. [PMID: 33706377 PMCID: PMC7989331 DOI: 10.1093/jamia/ocab029] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2020] [Revised: 01/09/2021] [Accepted: 02/05/2021] [Indexed: 12/28/2022] Open
Abstract
OBJECTIVE Coronavirus disease 2019 (COVID-19) patients are at risk for resource-intensive outcomes including mechanical ventilation (MV), renal replacement therapy (RRT), and readmission. Accurate outcome prognostication could facilitate hospital resource allocation. We develop and validate predictive models for each outcome using retrospective electronic health record data for COVID-19 patients treated between March 2 and May 6, 2020. MATERIALS AND METHODS For each outcome, we trained 3 classes of prediction models using clinical data for a cohort of SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2)-positive patients (n = 2256). Cross-validation was used to select the best-performing models per the areas under the receiver-operating characteristic and precision-recall curves. Models were validated using a held-out cohort (n = 855). We measured each model's calibration and evaluated feature importances to interpret model output. RESULTS The predictive performance for our selected models on the held-out cohort was as follows: area under the receiver-operating characteristic curve-MV 0.743 (95% CI, 0.682-0.812), RRT 0.847 (95% CI, 0.772-0.936), readmission 0.871 (95% CI, 0.830-0.917); area under the precision-recall curve-MV 0.137 (95% CI, 0.047-0.175), RRT 0.325 (95% CI, 0.117-0.497), readmission 0.504 (95% CI, 0.388-0.604). Predictions were well calibrated, and the most important features within each model were consistent with clinical intuition. DISCUSSION Our models produce performant, well-calibrated, and interpretable predictions for COVID-19 patients at risk for the target outcomes. They demonstrate the potential to accurately estimate outcome prognosis in resource-constrained care sites managing COVID-19 patients. CONCLUSIONS We develop and validate prognostic models targeting MV, RRT, and readmission for hospitalized COVID-19 patients which produce accurate, interpretable predictions. Additional external validation studies are needed to further verify the generalizability of our results.
Collapse
|
Research Support, N.I.H., Extramural |
4 |
16 |
5
|
Merle JL, Li D, Keiser B, Zamantakis A, Queiroz A, Gallo CG, Villamar JA, McKay V, Zapata JP, Mustanski B, Benbow N, Smith JD. Categorising implementation determinants and strategies within the US HIV implementation literature: a systematic review protocol. BMJ Open 2023; 13:e070216. [PMID: 36927593 PMCID: PMC10030793 DOI: 10.1136/bmjopen-2022-070216] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/17/2022] [Accepted: 03/06/2023] [Indexed: 03/18/2023] Open
Abstract
INTRODUCTION Despite decreased rates of new infections, HIV/AIDS continues to impact certain US populations. In order to achieve the goals laid out in the Ending the HIV Epidemic (EHE) in the US initiative, implementation science is needed to expand the sustained use of effective prevention and treatment interventions, particularly among priority populations at risk for and living with HIV/AIDS. Over 200 HIV-related implementation studies have been funded by the US National Institutes of Health. Therefore, a comprehensive review of the literature identifying implementation determinants (barriers and facilitators) and categorising implementation strategies across the continuum of HIV prevention and care in the USA is appropriate and needed to enhance current knowledge and help achieve the goals laid out in the EHE national strategic plan. METHODS AND ANALYSIS This systematic review protocol follows the Preferred Reporting Items for Systematic Reviews and Meta-Analyses guidelines. Between November 2020 and January 2022, a broad database search strategy of Ovid MEDLINE, PsycINFO and Web of Science was conducted to capture implementation-related studies along the HIV prevention and care continuum. Articles were eligible for inclusion if they were: conducted in the USA, published after the year 2000, written in English, related to HIV/AIDS, focused on outcomes related to dissemination and implementation (ie, did not test/evaluate/explore implementation determinants or strategies) and were behavioural studies (ie, not basic science). We plan to conduct three systematic reviews to identify and categorise determinants and strategies associated with three HIV focus areas: pre-exposure prophylaxis, testing/diagnosing and linkage to care, and treatment. Determinants will be coded according to an adapted Consolidated Framework for Implementation Research 2.0. Implementation strategies and outcomes will be categorised in accordance with existing taxonomies and frameworks. ETHICS AND DISSEMINATION Ethics approval is not applicable. No original data will be collected. Results will be disseminated through peer-reviewed publications, conference presentations and via online tools. PROSPERO REGISTRATION NUMBER CRD42021233089.
Collapse
|
Research Support, N.I.H., Extramural |
2 |
9 |
6
|
Elworth RAL, Wang Q, Kota PK, Barberan CJ, Coleman B, Balaji A, Gupta G, Baraniuk RG, Shrivastava A, Treangen T. To Petabytes and beyond: recent advances in probabilistic and signal processing algorithms and their application to metagenomics. Nucleic Acids Res 2020; 48:5217-5234. [PMID: 32338745 PMCID: PMC7261164 DOI: 10.1093/nar/gkaa265] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2019] [Revised: 03/20/2020] [Accepted: 04/04/2020] [Indexed: 02/01/2023] Open
Abstract
As computational biologists continue to be inundated by ever increasing amounts of metagenomic data, the need for data analysis approaches that keep up with the pace of sequence archives has remained a challenge. In recent years, the accelerated pace of genomic data availability has been accompanied by the application of a wide array of highly efficient approaches from other fields to the field of metagenomics. For instance, sketching algorithms such as MinHash have seen a rapid and widespread adoption. These techniques handle increasingly large datasets with minimal sacrifices in quality for tasks such as sequence similarity calculations. Here, we briefly review the fundamentals of the most impactful probabilistic and signal processing algorithms. We also highlight more recent advances to augment previous reviews in these areas that have taken a broader approach. We then explore the application of these techniques to metagenomics, discuss their pros and cons, and speculate on their future directions.
Collapse
|
Research Support, N.I.H., Extramural |
5 |
8 |
7
|
Huang Y, Liu Y, Steel PAD, Axsom KM, Lee JR, Tummalapalli SL, Wang F, Pathak J, Subramanian L, Zhang Y. Deep significance clustering: a novel approach for identifying risk-stratified and predictive patient subgroups. J Am Med Inform Assoc 2021; 28:2641-2653. [PMID: 34571540 PMCID: PMC8500061 DOI: 10.1093/jamia/ocab203] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2021] [Revised: 08/04/2021] [Accepted: 09/02/2021] [Indexed: 12/13/2022] Open
Abstract
OBJECTIVE Deep significance clustering (DICE) is a self-supervised learning framework. DICE identifies clinically similar and risk-stratified subgroups that neither unsupervised clustering algorithms nor supervised risk prediction algorithms alone are guaranteed to generate. MATERIALS AND METHODS Enabled by an optimization process that enforces statistical significance between the outcome and subgroup membership, DICE jointly trains 3 components, representation learning, clustering, and outcome prediction while providing interpretability to the deep representations. DICE also allows unseen patients to be predicted into trained subgroups for population-level risk stratification. We evaluated DICE using electronic health record datasets derived from 2 urban hospitals. Outcomes and patient cohorts used include discharge disposition to home among heart failure (HF) patients and acute kidney injury among COVID-19 (Cov-AKI) patients, respectively. RESULTS Compared to baseline approaches including principal component analysis, DICE demonstrated superior performance in the cluster purity metrics: Silhouette score (0.48 for HF, 0.51 for Cov-AKI), Calinski-Harabasz index (212 for HF, 254 for Cov-AKI), and Davies-Bouldin index (0.86 for HF, 0.66 for Cov-AKI), and prediction metric: area under the Receiver operating characteristic (ROC) curve (0.83 for HF, 0.78 for Cov-AKI). Clinical evaluation of DICE-generated subgroups revealed more meaningful distributions of member characteristics across subgroups, and higher risk ratios between subgroups. Furthermore, DICE-generated subgroup membership alone was moderately predictive of outcomes. DISCUSSION DICE addresses a gap in current machine learning approaches where predicted risk may not lead directly to actionable clinical steps. CONCLUSION DICE demonstrated the potential to apply in heterogeneous populations, where having the same quantitative risk does not equate with having a similar clinical profile.
Collapse
|
Research Support, N.I.H., Extramural |
4 |
8 |
8
|
Zhan Z, Jing Z, He B, Hosseini N, Westerhoff M, Choi EY, Garmire LX. Two-stage Cox-nnet: biologically interpretable neural-network model for prognosis prediction and its application in liver cancer survival using histopathology and transcriptomic data. NAR Genom Bioinform 2021; 3:lqab015. [PMID: 33778491 PMCID: PMC7985035 DOI: 10.1093/nargab/lqab015] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2020] [Revised: 02/01/2021] [Accepted: 02/24/2021] [Indexed: 12/11/2022] Open
Abstract
Pathological images are easily accessible data with the potential of prognostic biomarkers. Moreover, integration of heterogeneous data types from multi-modality, such as pathological image and gene expression data, is invaluable to help predicting cancer patient survival. However, the analytical challenges are significant. Here, we take the hepatocellular carcinoma (HCC) pathological image features extracted by CellProfiler, and apply them as the input for Cox-nnet, a neural network-based prognosis prediction model. We compare this model with the conventional Cox proportional hazards (Cox-PH) model, CoxBoost, Random Survival Forests and DeepSurv, using C-index and log-rank P-values. The results show that Cox-nnet is significantly more accurate than Cox-PH and Random Survival Forests models and comparable with CoxBoost and DeepSurv models, on pathological image features. Further, to integrate pathological image and gene expression data of the same patients, we innovatively construct a two-stage Cox-nnet model, and compare it with another complex neural-network model called PAGE-Net. The two-stage Cox-nnet complex model combining histopathology image and transcriptomic RNA-seq data achieves much better prognosis prediction, with a median C-index of 0.75 and log-rank P-value of 6e-7 in the testing datasets, compared to PAGE-Net (median C-index of 0.68 and log-rank P-value of 0.03). Imaging features present additional predictive information to gene expression features, as the combined model is more accurate than the model with gene expression alone (median C-index 0.70). Pathological image features are correlated with gene expression, as genes correlated to top imaging features present known associations with HCC patient survival and morphogenesis of liver tissue. This work proposes two-stage Cox-nnet, a new class of biologically relevant and interpretable models, to integrate multiple types of heterogenous data for survival prediction.
Collapse
|
research-article |
4 |
7 |
9
|
Hu G, Liu L, Xu D. On the Responsible Use of Chatbots in Bioinformatics. GENOMICS, PROTEOMICS & BIOINFORMATICS 2024; 22:qzae002. [PMID: 38862428 PMCID: PMC11104453 DOI: 10.1093/gpbjnl/qzae002] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/01/2023] [Revised: 11/08/2023] [Accepted: 11/14/2023] [Indexed: 06/13/2024]
|
Research Support, N.I.H., Extramural |
1 |
3 |
10
|
Delos Santos NP, Duttke S, Heinz S, Benner C. MEPP: more transparent motif enrichment by profiling positional correlations. NAR Genom Bioinform 2022; 4:lqac075. [PMID: 36267125 PMCID: PMC9575187 DOI: 10.1093/nargab/lqac075] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2022] [Revised: 08/18/2022] [Accepted: 09/23/2022] [Indexed: 11/11/2022] Open
Abstract
Score-based motif enrichment analysis (MEA) is typically applied to regulatory DNA to infer transcription factors (TFs) that may modulate transcription and chromatin state in different conditions. Most MEA methods determine motif enrichment independent of motif position within a sequence, even when those sequences harbor anchor points that motifs and their bound TFs may functionally interact with in a distance-dependent fashion, such as other TF binding motifs, transcription start sites (TSS), sequencing assay cleavage sites, or other biologically meaningful features. We developed motif enrichment positional profiling (MEPP), a novel MEA method that outputs a positional enrichment profile of a given TF's binding motif relative to key anchor points (e.g. transcription start sites, or other motifs) within the analyzed sequences while accounting for lower-order nucleotide bias. Using transcription initiation and TF binding as test cases, we demonstrate MEPP's utility in determining the sequence positions where motif presence correlates with measures of biological activity, inferring positional dependencies of binding site function. We demonstrate how MEPP can be applied to interpretation and hypothesis generation from experiments that quantify transcription initiation, chromatin structure, or TF binding measurements. MEPP is available for download from https://github.com/npdeloss/mepp.
Collapse
|
research-article |
3 |
3 |
11
|
Cooley NP, Wright ES. Accurate annotation of protein coding sequences with IDTAXA. NAR Genom Bioinform 2021; 3:lqab080. [PMID: 34541527 PMCID: PMC8445202 DOI: 10.1093/nargab/lqab080] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2021] [Revised: 07/07/2021] [Accepted: 08/25/2021] [Indexed: 11/12/2022] Open
Abstract
The observed diversity of protein coding sequences continues to increase far more rapidly than knowledge of their functions, making classification algorithms essential for assigning a function to proteins using only their sequence. Most pipelines for annotating proteins rely on searches for homologous sequences in databases of previously annotated proteins using BLAST or HMMER. Here, we develop a new approach for classifying proteins into a taxonomy of functions and demonstrate its utility for genome annotation. Our algorithm, IDTAXA, was more accurate than BLAST or HMMER at assigning sequences to KEGG ortholog groups. Moreover, IDTAXA correctly avoided classifying sequences with novel functions to existing groups, which is a common error mode for classification approaches that rely on E-values as a proxy for confidence. We demonstrate IDTAXA's utility for annotating eukaryotic and prokaryotic genomes by assigning functions to proteins within a multi-level ontology and applied IDTAXA to detect genome contamination in eukaryotic genomes. Finally, we re-annotated 8604 microbial genomes with known antibiotic resistance phenotypes to discover two novel associations between proteins and antibiotic resistance. IDTAXA is available as a web tool (http://DECIPHER.codes/Classification.html) or as part of the open source DECIPHER R package from Bioconductor.
Collapse
|
research-article |
4 |
1 |
12
|
Cirnaru MD, Song S, Tshilenge KT, Corwin C, Mleczko J, Galicia Aguirre C, Benlhabib H, Bendl J, Apontes P, Fullard J, Creus-Muncunill J, Reyahi A, Nik AM, Carlsson P, Roussos P, Mooney SD, Ellerby LM, Ehrlich ME. Unbiased identification of novel transcription factors in striatal compartmentation and striosome maturation. eLife 2021; 10:e65979. [PMID: 34609283 PMCID: PMC8492065 DOI: 10.7554/elife.65979] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2020] [Accepted: 08/20/2021] [Indexed: 02/06/2023] Open
Abstract
Many diseases are linked to dysregulation of the striatum. Striatal function depends on neuronal compartmentation into striosomes and matrix. Striatal projection neurons are GABAergic medium spiny neurons (MSNs), subtyped by selective expression of receptors, neuropeptides, and other gene families. Neurogenesis of the striosome and matrix occurs in separate waves, but the factors regulating compartmentation and neuronal differentiation are largely unidentified. We performed RNA- and ATAC-seq on sorted striosome and matrix cells at postnatal day 3, using the Nr4a1-EGFP striosome reporter mouse. Focusing on the striosome, we validated the localization and/or role of Irx1, Foxf2, Olig2, and Stat1/2 in the developing striosome and the in vivo enhancer function of a striosome-specific open chromatin region 4.4 Kb downstream of Olig2. These data provide novel tools to dissect and manipulate the networks regulating MSN compartmentation and differentiation, including in human iPSC-derived striatal neurons for disease modeling and drug discovery.
Collapse
|
Research Support, N.I.H., Extramural |
4 |
1 |
13
|
Ye Z, Mayer J, Leary EJ, Kitchner T, Dart RA, Brilliant MH, Hebbring SJ. Estimating the efficacy of pharmacogenomics over a lifetime. Front Med (Lausanne) 2023; 10:1006743. [PMID: 38020121 PMCID: PMC10645151 DOI: 10.3389/fmed.2023.1006743] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2022] [Accepted: 07/10/2023] [Indexed: 12/01/2023] Open
Abstract
It is well known that common variants in specific genes influence drug metabolism and response, but it is currently unknown what fraction of patients are given prescriptions over a lifetime that could be contraindicated by their pharmacogenomic profiles. To determine the clinical utility of pharmacogenomics over a lifetime in a general patient population, we sequenced the genomes of 300 deceased Marshfield Clinic patients linked to lifelong medical records. Genetic variants in 33 pharmacogenes were evaluated for their lifetime impact on drug prescribing using extensive electronic health records. Results show that 93% of the 300 deceased patients carried clinically relevant variants. Nearly 80% were prescribed approximately three medications on average that may have been impacted by these variants. Longitudinal data suggested that the optimal age for pharmacogenomic testing was prior to age 50, but the optimal age is greatly influenced by the stability of the population in the healthcare system. This study emphasizes the broad clinical impact of pharmacogenomic testing over a lifetime and demonstrates the potential application of genomic medicine in a general patient population for the advancement of precision medicine.
Collapse
|
brief-report |
2 |
1 |
14
|
Sun H, Vargas-Blanco D, Zhou Y, Masiello C, Kelly J, Moy J, Korkin D, Shell S. Diverse intrinsic properties shape transcript stability and stabilization in Mycolicibacterium smegmatis. NAR Genom Bioinform 2024; 6:lqae147. [PMID: 39498432 PMCID: PMC11532794 DOI: 10.1093/nargab/lqae147] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2024] [Revised: 08/23/2024] [Accepted: 10/17/2024] [Indexed: 11/07/2024] Open
Abstract
Mycobacteria regulate transcript degradation to facilitate adaptation to environmental stress. However, the mechanisms underlying this regulation are unknown. Here we sought to gain understanding of the mechanisms controlling mRNA stability by investigating the transcript properties associated with variance in transcript stability and stress-induced transcript stabilization. We measured mRNA half-lives transcriptome-wide in Mycolicibacterium smegmatis in log phase growth and hypoxia-induced growth arrest. The transcriptome was globally stabilized in response to hypoxia, but transcripts of essential genes were generally stabilized more than those of non-essential genes. We then developed machine learning models that enabled us to identify the non-linear collective effect of a compendium of transcript properties on transcript stability and stabilization. We identified properties that were more predictive of half-life in log phase as well as properties that were more predictive in hypoxia, and many of these varied between leadered and leaderless transcripts. In summary, we found that transcript properties are differentially associated with transcript stability depending on both the transcript type and the growth condition. Our results reveal the complex interplay between transcript features and microenvironment that shapes transcript stability in mycobacteria.
Collapse
|
research-article |
1 |
|
15
|
Yang J, Mwangi AW, Kantor R, Dahabreh IJ, Nyambura M, Delong A, Hogan JW, Steingrimsson JA. Tree-based subgroup discovery using electronic health record data: heterogeneity of treatment effects for DTG-containing therapies. Biostatistics 2024; 25:323-335. [PMID: 37475638 PMCID: PMC11017113 DOI: 10.1093/biostatistics/kxad014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2022] [Revised: 10/01/2022] [Accepted: 11/01/2022] [Indexed: 07/22/2023] Open
Abstract
The rich longitudinal individual level data available from electronic health records (EHRs) can be used to examine treatment effect heterogeneity. However, estimating treatment effects using EHR data poses several challenges, including time-varying confounding, repeated and temporally non-aligned measurements of covariates, treatment assignments and outcomes, and loss-to-follow-up due to dropout. Here, we develop the subgroup discovery for longitudinal data algorithm, a tree-based algorithm for discovering subgroups with heterogeneous treatment effects using longitudinal data by combining the generalized interaction tree algorithm, a general data-driven method for subgroup discovery, with longitudinal targeted maximum likelihood estimation. We apply the algorithm to EHR data to discover subgroups of people living with human immunodeficiency virus who are at higher risk of weight gain when receiving dolutegravir (DTG)-containing antiretroviral therapies (ARTs) versus when receiving non-DTG-containing ARTs.
Collapse
|
research-article |
1 |
|
16
|
Clark-Sevilla AO, Lin YC, Saxena A, Yan Q, Wapner R, Raja A, Pe’er I, Salleb-Aouissi A. Diving into CDC pregnancy data in the United States: longitudinal study and interactive application. JAMIA Open 2024; 7:ooae024. [PMID: 38516346 PMCID: PMC10955523 DOI: 10.1093/jamiaopen/ooae024] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2023] [Revised: 09/20/2023] [Accepted: 03/05/2024] [Indexed: 03/23/2024] Open
Abstract
Objective Preterm birth (PTB) is a major determinant of neonatal mortality, morbidity, and childhood disability. In this article, we present a longitudinal analysis of the risk factors associated with PTB and how they have varied over the years: starting from 1968 when the CDC first started, reporting the natality data, up until 2021. Along with this article, we are also releasing an RShiny web application that will allow for easy consumption of this voluminous dataset by the research community. Further, we hope this tool can aid clinicians in the understanding and prevention of PTB. Materials and Methods This study used the CDC Natality data from 1968 to 2021 to analyze trends in PTB outcomes across the lens of various features, including race, maternal age, education, and interval length between pregnancies. Our interactive RShiny web application, CDC NatView, allows users to explore interactions between maternal risk factors and maternal morbidity conditions and the aforementioned features. Results Our study demonstrates how CDC data can be leveraged to conduct a longitudinal analysis of natality trends in the United States. Our key findings reveal an upward trend in late PTBs, which is concerning. Moreover, a significant disparity exists between African American and White populations in terms of PTB. These disparities persist in other areas, such as education, body-mass index, and access to prenatal care later in pregnancy. Discussion Another notable finding is the increase in maternal age over time. Additionally, we confirm that short interpregnancy intervals (IPIs) are a risk factor for PTBs. To facilitate the exploration of pregnancy risk factors, infections, and maternal morbidity, we developed an open-source RShiny tool called CDC NatView. This software offers a user-friendly interface to interact with and visualize the CDC natality data, which constitutes an invaluable resource. Conclusion In conclusion, our study has shed light on the rise of late PTBs and the persistent disparities in PTB rates between African American and White populations in the US. The increase in maternal age and the confirmation of a short IPI as a risk factor for PTB are noteworthy findings. Our open-source tool, CDC NatView, can be a valuable resource for further exploration of the CDC natality data to enhance our understanding of pregnancy risk factors and the interaction of PTB outcomes and maternal morbidities.
Collapse
|
research-article |
1 |
|
17
|
Kravchenko OV, Boyce RD, Gomez-Lumbreras A, Kocis PT, Villa Zapata L, Tan M, Leonard CE, Andersen KM, Mehta H, Alexander GC, Malone DC. Drug-drug interaction between dexamethasone and direct-acting oral anticoagulants: a nested case-control study in the National COVID Cohort Collaborative (N3C). BMJ Open 2022; 12:e066846. [PMID: 36581417 PMCID: PMC9806069 DOI: 10.1136/bmjopen-2022-066846] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Open
Abstract
OBJECTIVE The goal of this work is to evaluate if there is an increase in the risk of thromboembolic events (TEEs) due to concomitant exposure to dexamethasone and apixaban or rivaroxaban. Direct oral anticoagulants (DOACs), as well as corticosteroid dexamethasone, are commonly used to treat individuals hospitalised with COVID-19. Dexamethasone induces cytochrome P450-3A4 enzyme that also metabolises DOACs apixaban and rivaroxaban. This raises a concern about possible interaction between dexamethasone and DOACs that may reduce the efficacy of the DOACs and result in an increased risk of TEE. DESIGN We used nested case-control study design. SETTING This study was conducted in the National COVID Cohort Collaborative (N3C), the largest electronic health records repository for COVID-19 in the USA. PARTICIPANTS Study participants were adults over 18 years who were exposed to a DOAC for 10 or more consecutive days. Exposure to dexamethasone was at least 5 or more consecutive days. PRIMARY AND SECONDARY OUTCOME MEASURES Our primary exposure variable was concomitant exposure to dexamethasone for 5 or more days after exposure to either rivaroxaban or apixaban for 5 or more consecutive days. We used McNemar's Χ2 test and adjusted logistic regression to evaluate association between concomitant use of dexamethasone with either apixaban or rivaroxaban. RESULTS McNemar's Χ2 test did not find a discernible association of TEE in patients concomitantly exposed to dexamethasone and a DOAC (χ2=0.5, df=1, p=0.48). In addition, a conditional logistic regression model did not find an increase in the risk of TEE (adjusted OR 1.15, 95% CI 0.32 to 4.18). CONCLUSION This nested case-control study did not find evidence of an association between concomitant exposure to dexamethasone and a DOAC with an increase in risk of TEE. Due to small sample size, an association cannot be completely ruled out.
Collapse
|
Research Support, N.I.H., Extramural |
3 |
|
18
|
Unjitwattana T, Huang Q, Yang Y, Tao L, Yang Y, Zhou M, Du Y, Garmire LX. Single-cell RNA-seq data have prevalent blood contamination but can be rescued by Originator, a computational tool separating single-cell RNA-seq by genetic and contextual information. Genome Biol 2025; 26:52. [PMID: 40069819 PMCID: PMC11895284 DOI: 10.1186/s13059-025-03495-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2024] [Accepted: 02/05/2025] [Indexed: 03/15/2025] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) data from complex human tissues have prevalent blood cell contamination during the sample preparation process. They may also comprise cells of different genetic makeups. We propose a new computational framework, Originator, which deciphers single cells by genetic origin and separates immune cells of blood contamination from those of expected tissue-resident cells. We demonstrate the accuracy of Originator at separating immune cells from the blood and tissue as well as cells of different genetic origins, using a variety of artificially mixed and real datasets, including pancreatic cancer and placentas as examples.
Collapse
|
research-article |
1 |
|
19
|
Newbury A, Liu H, Idnay B, Weng C. The suitability of UMLS and SNOMED-CT for encoding outcome concepts. J Am Med Inform Assoc 2023; 30:1895-1903. [PMID: 37615994 PMCID: PMC10654851 DOI: 10.1093/jamia/ocad161] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2023] [Revised: 06/14/2023] [Accepted: 08/02/2023] [Indexed: 08/25/2023] Open
Abstract
OBJECTIVE Outcomes are important clinical study information. Despite progress in automated extraction of PICO (Population, Intervention, Comparison, and Outcome) entities from PubMed, rarely are these entities encoded by standard terminology to achieve semantic interoperability. This study aims to evaluate the suitability of the Unified Medical Language System (UMLS) and SNOMED-CT in encoding outcome concepts in randomized controlled trial (RCT) abstracts. MATERIALS AND METHODS We iteratively developed and validated an outcome annotation guideline and manually annotated clinically significant outcome entities in the Results and Conclusions sections of 500 randomly selected RCT abstracts on PubMed. The extracted outcomes were fully, partially, or not mapped to the UMLS via MetaMap based on established heuristics. Manual UMLS browser search was performed for select unmapped outcome entities to further differentiate between UMLS and MetaMap errors. RESULTS Only 44% of 2617 outcome concepts were fully covered in the UMLS, among which 67% were complex concepts that required the combination of 2 or more UMLS concepts to represent them. SNOMED-CT was present as a source in 61% of the fully mapped outcomes. DISCUSSION Domains such as Metabolism and Nutrition, and Infections and Infectious Diseases need expanded outcome concept coverage in the UMLS and MetaMap. Future work is warranted to similarly assess the terminology coverage for P, I, C entities. CONCLUSION Computational representation of clinical outcomes is important for clinical evidence extraction and appraisal and yet faces challenges from the inherent complexity and lack of coverage of these concepts in UMLS and SNOMED-CT, as demonstrated in this study.
Collapse
|
Research Support, N.I.H., Extramural |
2 |
|
20
|
Collins BX, Bélisle-Pipon JC, Evans BJ, Ferryman K, Jiang X, Nebeker C, Novak L, Roberts K, Were M, Yin Z, Ravitsky V, Coco J, Hendricks-Sturrup R, Williams I, Clayton EW, Malin BA. Addressing ethical issues in healthcare artificial intelligence using a lifecycle-informed process. JAMIA Open 2024; 7:ooae108. [PMID: 39553826 PMCID: PMC11565898 DOI: 10.1093/jamiaopen/ooae108] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2024] [Revised: 08/19/2024] [Accepted: 10/04/2024] [Indexed: 11/19/2024] Open
Abstract
Objectives Artificial intelligence (AI) proceeds through an iterative and evaluative process of development, use, and refinement which may be characterized as a lifecycle. Within this context, stakeholders can vary in their interests and perceptions of the ethical issues associated with this rapidly evolving technology in ways that can fail to identify and avert adverse outcomes. Identifying issues throughout the AI lifecycle in a systematic manner can facilitate better-informed ethical deliberation. Materials and Methods We analyzed existing lifecycles from within the current literature for ethical issues of AI in healthcare to identify themes, which we relied upon to create a lifecycle that consolidates these themes into a more comprehensive lifecycle. We then considered the potential benefits and harms of AI through this lifecycle to identify ethical questions that can arise at each step and to identify where conflicts and errors could arise in ethical analysis. We illustrated the approach in 3 case studies that highlight how different ethical dilemmas arise at different points in the lifecycle. Results Discussion and Conclusion Through case studies, we show how a systematic lifecycle-informed approach to the ethical analysis of AI enables mapping of the effects of AI onto different steps to guide deliberations on benefits and harms. The lifecycle-informed approach has broad applicability to different stakeholders and can facilitate communication on ethical issues for patients, healthcare professionals, research participants, and other stakeholders.
Collapse
|
research-article |
1 |
|
21
|
McConeghy KW, Hur K, Dahabreh IJ, Jiang R, Pandey L, Gellad WF, Glassman P, Good CB, Miller DR, Zullo AR, Gravenstein S, Cunningham F. Early Mortality After the First Dose of COVID-19 Vaccination: A Target Trial Emulation. Clin Infect Dis 2024; 78:625-632. [PMID: 38319989 PMCID: PMC10954332 DOI: 10.1093/cid/ciad604] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2023] [Indexed: 02/08/2024] Open
Abstract
BACKGROUND Vaccine hesitancy persists alongside concerns about the safety of coronavirus disease 2019 (COVID-19) vaccines. We aimed to examine the effect of COVID-19 vaccination on risk of death among US veterans. METHODS We conducted a target trial emulation to estimate and compare risk of death up to 60 days under two COVID-19 vaccination strategies: vaccination within 7 days of enrollment versus no vaccination through follow-up. The study cohort included individuals aged ≥18 years enrolled in the Veterans Health Administration system and eligible to receive a COVID-19 vaccination according to guideline recommendations from 1 March 2021 through 1 July 2021. The outcomes of interest included deaths from any cause and excluding a COVID-19 diagnosis. Observations were cloned to both treatment strategies, censored, and weighted to estimate per-protocol effects. RESULTS We included 3 158 507 veterans. Under the vaccination strategy, 364 993 received vaccine within 7 days. At 60 days, there were 156 deaths per 100 000 veterans under the vaccination strategy versus 185 deaths under the no vaccination strategy, corresponding to an absolute risk difference of -25.9 (95% confidence limit [CL], -59.5 to 2.7) and relative risk of 0.86 (95% CL, .7 to 1.0). When those with a COVID-19 infection in the first 60 days were censored, the absolute risk difference was -20.6 (95% CL, -53.4 to 16.0) with a relative risk of 0.88 (95% CL, .7 to 1.1). CONCLUSIONS Vaccination against COVID-19 was associated with a lower but not statistically significantly different risk of death in the first 60 days. These results agree with prior scientific knowledge suggesting vaccination is safe with the potential for substantial health benefits.
Collapse
|
Clinical Study |
1 |
|