1
|
Ji Y, Zhou Z, Liu H, Davuluri RV. DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics 2021; 37:2112-2120. [PMID: 33538820 PMCID: PMC11025658 DOI: 10.1093/bioinformatics/btab083] [Citation(s) in RCA: 310] [Impact Index Per Article: 77.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2020] [Revised: 12/31/2020] [Accepted: 02/01/2021] [Indexed: 12/19/2022] Open
Abstract
MOTIVATION Deciphering the language of non-coding DNA is one of the fundamental problems in genome research. Gene regulatory code is highly complex due to the existence of polysemy and distant semantic relationship, which previous informatics methods often fail to capture especially in data-scarce scenarios. RESULTS To address this challenge, we developed a novel pre-trained bidirectional encoder representation, named DNABERT, to capture global and transferrable understanding of genomic DNA sequences based on up and downstream nucleotide contexts. We compared DNABERT to the most widely used programs for genome-wide regulatory elements prediction and demonstrate its ease of use, accuracy and efficiency. We show that the single pre-trained transformers model can simultaneously achieve state-of-the-art performance on prediction of promoters, splice sites and transcription factor binding sites, after easy fine-tuning using small task-specific labeled data. Further, DNABERT enables direct visualization of nucleotide-level importance and semantic relationship within input sequences for better interpretability and accurate identification of conserved sequence motifs and functional genetic variant candidates. Finally, we demonstrate that pre-trained DNABERT with human genome can even be readily applied to other organisms with exceptional performance. We anticipate that the pre-trained DNABERT model can be fined tuned to many other sequence analyses tasks. AVAILABILITY AND IMPLEMENTATION The source code, pretrained and finetuned model for DNABERT are available at GitHub (https://github.com/jerryji1993/DNABERT). SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
|
research-article |
4 |
310 |
2
|
Yang S, Corbett SE, Koga Y, Wang Z, Johnson WE, Yajima M, Campbell JD. Decontamination of ambient RNA in single-cell RNA-seq with DecontX. Genome Biol 2020; 21:57. [PMID: 32138770 PMCID: PMC7059395 DOI: 10.1186/s13059-020-1950-6] [Citation(s) in RCA: 219] [Impact Index Per Article: 43.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2019] [Accepted: 01/29/2020] [Indexed: 12/26/2022] Open
Abstract
Droplet-based microfluidic devices have become widely used to perform single-cell RNA sequencing (scRNA-seq). However, ambient RNA present in the cell suspension can be aberrantly counted along with a cell's native mRNA and result in cross-contamination of transcripts between different cell populations. DecontX is a novel Bayesian method to estimate and remove contamination in individual cells. DecontX accurately predicts contamination levels in a mouse-human mixture dataset and removes aberrant expression of marker genes in PBMC datasets. We also compare the contamination levels between four different scRNA-seq protocols. Overall, DecontX can be incorporated into scRNA-seq workflows to improve downstream analyses.
Collapse
|
Research Support, N.I.H., Extramural |
5 |
219 |
3
|
Sayers EW, Bolton EE, Brister J, Canese K, Chan J, Comeau D, Farrell C, Feldgarden M, Fine AM, Funk K, Hatcher E, Kannan S, Kelly C, Kim S, Klimke W, Landrum M, Lathrop S, Lu Z, Madden T, Malheiro A, Marchler-Bauer A, Murphy T, Phan L, Pujar S, Rangwala S, Schneider V, Tse T, Wang J, Ye J, Trawick B, Pruitt K, Sherry S. Database resources of the National Center for Biotechnology Information in 2023. Nucleic Acids Res 2023; 51:D29-D38. [PMID: 36370100 PMCID: PMC9825438 DOI: 10.1093/nar/gkac1032] [Citation(s) in RCA: 184] [Impact Index Per Article: 92.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2022] [Revised: 10/11/2022] [Accepted: 11/09/2022] [Indexed: 11/15/2022] Open
Abstract
The National Center for Biotechnology Information (NCBI) provides online information resources for biology, including the GenBank® nucleic acid sequence database and the PubMed® database of citations and abstracts published in life science journals. NCBI provides search and retrieval operations for most of these data from 35 distinct databases. The E-utilities serve as the programming interface for most of these databases. New resources include the Comparative Genome Resource (CGR) and the BLAST ClusteredNR database. Resources receiving significant updates in the past year include PubMed, PMC, Bookshelf, IgBLAST, GDV, RefSeq, NCBI Virus, GenBank type assemblies, iCn3D, ClinVar, GTR, dbGaP, ALFA, ClinicalTrials.gov, Pathogen Detection, antimicrobial resistance resources, and PubChem. These resources can be accessed through the NCBI home page at https://www.ncbi.nlm.nih.gov.
Collapse
|
research-article |
2 |
184 |
4
|
Hufsky F, Lamkiewicz K, Almeida A, Aouacheria A, Arighi C, Bateman A, Baumbach J, Beerenwinkel N, Brandt C, Cacciabue M, Chuguransky S, Drechsel O, Finn RD, Fritz A, Fuchs S, Hattab G, Hauschild AC, Heider D, Hoffmann M, Hölzer M, Hoops S, Kaderali L, Kalvari I, von Kleist M, Kmiecinski R, Kühnert D, Lasso G, Libin P, List M, Löchel HF, Martin MJ, Martin R, Matschinske J, McHardy AC, Mendes P, Mistry J, Navratil V, Nawrocki EP, O’Toole ÁN, Ontiveros-Palacios N, Petrov AI, Rangel-Pineros G, Redaschi N, Reimering S, Reinert K, Reyes A, Richardson L, Robertson DL, Sadegh S, Singer JB, Theys K, Upton C, Welzel M, Williams L, Marz M. Computational strategies to combat COVID-19: useful tools to accelerate SARS-CoV-2 and coronavirus research. Brief Bioinform 2021; 22:642-663. [PMID: 33147627 PMCID: PMC7665365 DOI: 10.1093/bib/bbaa232] [Citation(s) in RCA: 78] [Impact Index Per Article: 19.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2020] [Revised: 07/28/2020] [Accepted: 08/26/2020] [Indexed: 12/16/2022] Open
Abstract
SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2) is a novel virus of the family Coronaviridae. The virus causes the infectious disease COVID-19. The biology of coronaviruses has been studied for many years. However, bioinformatics tools designed explicitly for SARS-CoV-2 have only recently been developed as a rapid reaction to the need for fast detection, understanding and treatment of COVID-19. To control the ongoing COVID-19 pandemic, it is of utmost importance to get insight into the evolution and pathogenesis of the virus. In this review, we cover bioinformatics workflows and tools for the routine detection of SARS-CoV-2 infection, the reliable analysis of sequencing data, the tracking of the COVID-19 pandemic and evaluation of containment measures, the study of coronavirus evolution, the discovery of potential drug targets and development of therapeutic strategies. For each tool, we briefly describe its use case and how it advances research specifically for SARS-CoV-2. All tools are free to use and available online, either through web applications or public code repositories. Contact:evbc@unj-jena.de.
Collapse
|
Research Support, N.I.H., Extramural |
4 |
78 |
5
|
van Smeden M, Lash TL, Groenwold RHH. Reflection on modern methods: five myths about measurement error in epidemiological research. Int J Epidemiol 2020; 49:338-347. [PMID: 31821469 PMCID: PMC7124512 DOI: 10.1093/ije/dyz251] [Citation(s) in RCA: 74] [Impact Index Per Article: 14.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/16/2019] [Indexed: 02/02/2023] Open
Abstract
Epidemiologists are often confronted with datasets to analyse which contain measurement error due to, for instance, mistaken data entries, inaccurate recordings and measurement instrument or procedural errors. If the effect of measurement error is misjudged, the data analyses are hampered and the validity of the study's inferences may be affected. In this paper, we describe five myths that contribute to misjudgments about measurement error, regarding expected structure, impact and solutions to mitigate the problems resulting from mismeasurements. The aim is to clarify these measurement error misconceptions. We show that the influence of measurement error in an epidemiological data analysis can play out in ways that go beyond simple heuristics, such as heuristics about whether or not to expect attenuation of the effect estimates. Whereas we encourage epidemiologists to deliberate about the structure and potential impact of measurement error in their analyses, we also recommend exercising restraint when making claims about the magnitude or even direction of effect of measurement error if not accompanied by statistical measurement error corrections or quantitative bias analysis. Suggestions for alleviating the problems or investigating the structure and magnitude of measurement error are given.
Collapse
|
Research Support, N.I.H., Extramural |
5 |
74 |
6
|
Tian S, Jin Q, Yeganova L, Lai PT, Zhu Q, Chen X, Yang Y, Chen Q, Kim W, Comeau DC, Islamaj R, Kapoor A, Gao X, Lu Z. Opportunities and challenges for ChatGPT and large language models in biomedicine and health. Brief Bioinform 2023; 25:bbad493. [PMID: 38168838 PMCID: PMC10762511 DOI: 10.1093/bib/bbad493] [Citation(s) in RCA: 70] [Impact Index Per Article: 35.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2023] [Revised: 11/15/2023] [Accepted: 12/06/2023] [Indexed: 01/05/2024] Open
Abstract
ChatGPT has drawn considerable attention from both the general public and domain experts with its remarkable text generation capabilities. This has subsequently led to the emergence of diverse applications in the field of biomedicine and health. In this work, we examine the diverse applications of large language models (LLMs), such as ChatGPT, in biomedicine and health. Specifically, we explore the areas of biomedical information retrieval, question answering, medical text summarization, information extraction and medical education and investigate whether LLMs possess the transformative power to revolutionize these tasks or whether the distinct complexities of biomedical domain presents unique challenges. Following an extensive literature survey, we find that significant advances have been made in the field of text generation tasks, surpassing the previous state-of-the-art methods. For other applications, the advances have been modest. Overall, LLMs have not yet revolutionized biomedicine, but recent rapid progress indicates that such methods hold great potential to provide valuable means for accelerating discovery and improving health. We also find that the use of LLMs, like ChatGPT, in the fields of biomedicine and health entails various risks and challenges, including fabricated information in its generated responses, as well as legal and privacy concerns associated with sensitive patient data. We believe this survey can provide a comprehensive and timely overview to biomedical researchers and healthcare practitioners on the opportunities and challenges associated with using ChatGPT and other LLMs for transforming biomedicine and health.
Collapse
|
review-article |
2 |
70 |
7
|
Clough E, Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim I, Tomashevsky M, Marshall K, Phillippy K, Sherman P, Lee H, Zhang N, Serova N, Wagner L, Zalunin V, Kochergin A, Soboleva A. NCBI GEO: archive for gene expression and epigenomics data sets: 23-year update. Nucleic Acids Res 2024; 52:D138-D144. [PMID: 37933855 PMCID: PMC10767856 DOI: 10.1093/nar/gkad965] [Citation(s) in RCA: 66] [Impact Index Per Article: 66.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2023] [Revised: 10/10/2023] [Accepted: 10/16/2023] [Indexed: 11/08/2023] Open
Abstract
The Gene Expression Omnibus (GEO) is an international public repository that archives gene expression and epigenomics data sets generated by next-generation sequencing and microarray technologies. Data are typically submitted to GEO by researchers in compliance with widespread journal and funder mandates to make generated data publicly accessible. The resource handles raw data files, processed data files and descriptive metadata for over 200 000 studies and 6.5 million samples, all of which are indexed, searchable and downloadable. Additionally, GEO offers web-based tools that facilitate analysis and visualization of differential gene expression. This article presents the current status and recent advancements in GEO, including the generation of consistently computed gene expression count matrices for thousands of RNA-seq studies, and new interactive graphical plots in GEO2R that help users identify differentially expressed genes and assess data set quality. The GEO repository is built and maintained by the National Center for Biotechnology Information (NCBI), a division of the National Library of Medicine (NLM), and is publicly accessible at https://www.ncbi.nlm.nih.gov/geo/.
Collapse
|
research-article |
1 |
66 |
8
|
Hripcsak G, Albers DJ, Perotte A. Parameterizing time in electronic health record studies. J Am Med Inform Assoc 2015; 22:794-804. [PMID: 25725004 PMCID: PMC6169471 DOI: 10.1093/jamia/ocu051] [Citation(s) in RCA: 42] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2014] [Revised: 11/08/2014] [Accepted: 12/22/2014] [Indexed: 02/07/2023] Open
Abstract
BACKGROUND Fields like nonlinear physics offer methods for analyzing time series, but many methods require that the time series be stationary-no change in properties over time.Objective Medicine is far from stationary, but the challenge may be able to be ameliorated by reparameterizing time because clinicians tend to measure patients more frequently when they are ill and are more likely to vary. METHODS We compared time parameterizations, measuring variability of rate of change and magnitude of change, and looking for homogeneity of bins of temporal separation between pairs of time points. We studied four common laboratory tests drawn from 25 years of electronic health records on 4 million patients. RESULTS We found that sequence time-that is, simply counting the number of measurements from some start-produced more stationary time series, better explained the variation in values, and had more homogeneous bins than either traditional clock time or a recently proposed intermediate parameterization. Sequence time produced more accurate predictions in a single Gaussian process model experiment. CONCLUSIONS Of the three parameterizations, sequence time appeared to produce the most stationary series, possibly because clinicians adjust their sampling to the acuity of the patient. Parameterizing by sequence time may be applicable to association and clustering experiments on electronic health record data. A limitation of this study is that laboratory data were derived from only one institution. Sequence time appears to be an important potential parameterization.
Collapse
|
Research Support, N.I.H., Extramural |
10 |
42 |
9
|
Liu Q, He D, Xie L. Prediction of off-target specificity and cell-specific fitness of CRISPR-Cas System using attention boosted deep learning and network-based gene feature. PLoS Comput Biol 2019; 15:e1007480. [PMID: 31658261 PMCID: PMC6837542 DOI: 10.1371/journal.pcbi.1007480] [Citation(s) in RCA: 40] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2019] [Revised: 11/07/2019] [Accepted: 10/08/2019] [Indexed: 12/26/2022] Open
Abstract
CRISPR-Cas is a powerful genome editing technology and has a great potential for in vivo gene therapy. Successful translational application of CRISPR-Cas to biomedicine still faces many safety concerns, including off-target side effect, cell fitness problem after CRISPR-Cas treatment, and on-target genome editing side effect in undesired tissues. To solve these issues, it is needed to design sgRNA with high cell-specific efficacy and specificity. Existing single-guide RNA (sgRNA) design tools mainly depend on a sgRNA sequence and the local information of the targeted genome, thus are not sufficient to account for the difference in the cellular response of the same gene in different cell types. To incorporate cell-specific information into the sgRNA design, we develop novel interpretable machine learning models, which integrate features learned from advanced transformer-based deep neural network with cell-specific gene property derived from biological network and gene expression profile, for the prediction of CRISPR-Cas9 and CRISPR-Cas12a efficacy and specificity. In benchmark studies, our models significantly outperform state-of-the-art algorithms. Furthermore, we find that the network-based gene property is critical for the prediction of cell-specific post-treatment cellular response. Our results suggest that the design of efficient and safe CRISPR-Cas needs to consider cell-specific information of genes. Our findings may bolster developing more accurate predictive models of CRISPR-Cas across a broad spectrum of biological conditions as well as provide new insight into developing efficient and safe CRISPR-based gene therapy.
Collapse
|
Research Support, N.I.H., Extramural |
6 |
40 |
10
|
Bolen CR, Rubelt F, Vander Heiden JA, Davis MM. The Repertoire Dissimilarity Index as a method to compare lymphocyte receptor repertoires. BMC Bioinformatics 2017; 18:155. [PMID: 28264647 PMCID: PMC5340033 DOI: 10.1186/s12859-017-1556-5] [Citation(s) in RCA: 40] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2016] [Accepted: 02/21/2017] [Indexed: 01/15/2023] Open
Abstract
BACKGROUND The B and T cells of the human adaptive immune system leverage a highly diverse repertoire of antigen-specific receptors to protect the human body from pathogens. The sequencing and analysis of immune repertoires is emerging as an important tool to understand immune responses, whether beneficial or harmful (in the case of autoimmunity). However, methods for studying these repertoires, and for directly comparing different immune repertoires, are lacking. RESULTS In this paper, we present a non-parametric method for directly comparing sequencing repertoires, with the goal of rigorously quantifying differences in V, D, and J gene segment utilization. This method, referred to as the Repertoire Dissimilarity Index (RDI), uses a bootstrapped subsampling approach to account for variance in sequencing depth, and, coupled with a data simulation approach, allows for direct quantification of the average variation between repertoires. We use the RDI method to recapitulate known differences in the formation of the CD4+ and CD8+ T cell repertoires, and further show that antigen-driven activation of naïve CD8+ T cells is more selective than in the CD4+ repertoire, resulting in a more specialized CD8+ memory repertoire. CONCLUSIONS We prove that the RDI method is an accurate and versatile method for comparisons of immune repertoires. The RDI method has been implemented as an R package, and is available for download through Bitbucket.
Collapse
|
research-article |
8 |
40 |
11
|
Thompson HM, Sharma B, Bhalla S, Boley R, McCluskey C, Dligach D, Churpek MM, Karnik NS, Afshar M. Bias and fairness assessment of a natural language processing opioid misuse classifier: detection and mitigation of electronic health record data disadvantages across racial subgroups. J Am Med Inform Assoc 2021; 28:2393-2403. [PMID: 34383925 PMCID: PMC8510285 DOI: 10.1093/jamia/ocab148] [Citation(s) in RCA: 38] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2021] [Revised: 06/28/2021] [Accepted: 07/01/2021] [Indexed: 12/24/2022] Open
Abstract
OBJECTIVES To assess fairness and bias of a previously validated machine learning opioid misuse classifier. MATERIALS & METHODS Two experiments were conducted with the classifier's original (n = 1000) and external validation (n = 53 974) datasets from 2 health systems. Bias was assessed via testing for differences in type II error rates across racial/ethnic subgroups (Black, Hispanic/Latinx, White, Other) using bootstrapped 95% confidence intervals. A local surrogate model was estimated to interpret the classifier's predictions by race and averaged globally from the datasets. Subgroup analyses and post-hoc recalibrations were conducted to attempt to mitigate biased metrics. RESULTS We identified bias in the false negative rate (FNR = 0.32) of the Black subgroup compared to the FNR (0.17) of the White subgroup. Top features included "heroin" and "substance abuse" across subgroups. Post-hoc recalibrations eliminated bias in FNR with minimal changes in other subgroup error metrics. The Black FNR subgroup had higher risk scores for readmission and mortality than the White FNR subgroup, and a higher mortality risk score than the Black true positive subgroup (P < .05). DISCUSSION The Black FNR subgroup had the greatest severity of disease and risk for poor outcomes. Similar features were present between subgroups for predicting opioid misuse, but inequities were present. Post-hoc mitigation techniques mitigated bias in type II error rate without creating substantial type I error rates. From model design through deployment, bias and data disadvantages should be systematically addressed. CONCLUSION Standardized, transparent bias assessments are needed to improve trustworthiness in clinical machine learning models.
Collapse
|
Research Support, N.I.H., Extramural |
4 |
38 |
12
|
Yang Y, Yan W, Hall AB, Jiang X. Characterizing Transcriptional Regulatory Sequences in Coronaviruses and Their Role in Recombination. Mol Biol Evol 2021; 38:1241-1248. [PMID: 33146390 PMCID: PMC7665640 DOI: 10.1093/molbev/msaa281] [Citation(s) in RCA: 36] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open
Abstract
Novel coronaviruses, including SARS-CoV-2, SARS, and MERS, often originate from recombination events. The mechanism of recombination in RNA viruses is template switching. Coronavirus transcription also involves template switching at specific regions, called transcriptional regulatory sequences (TRS). It is hypothesized but not yet verified that TRS sites are prone to recombination events. Here, we developed a tool called SuPER to systematically identify TRS in coronavirus genomes and then investigated whether recombination is more common at TRS. We ran SuPER on 506 coronavirus genomes and identified 465 TRS-L and 3,509 TRS-B. We found that the TRS-L core sequence (CS) and the secondary structure of the leader sequence are generally conserved within coronavirus genera but different between genera. By examining the location of recombination breakpoints with respect to TRS-B CS, we observed that recombination hotspots are more frequently colocated with TRS-B sites than expected.
Collapse
|
Research Support, N.I.H., Intramural |
4 |
36 |
13
|
Schwartz JM, Moy AJ, Rossetti SC, Elhadad N, Cato KD. Clinician involvement in research on machine learning-based predictive clinical decision support for the hospital setting: A scoping review. J Am Med Inform Assoc 2021; 28:653-663. [PMID: 33325504 PMCID: PMC7936403 DOI: 10.1093/jamia/ocaa296] [Citation(s) in RCA: 35] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2020] [Accepted: 11/30/2020] [Indexed: 01/03/2023] Open
Abstract
OBJECTIVE The study sought to describe the prevalence and nature of clinical expert involvement in the development, evaluation, and implementation of clinical decision support systems (CDSSs) that utilize machine learning to analyze electronic health record data to assist nurses and physicians in prognostic and treatment decision making (ie, predictive CDSSs) in the hospital. MATERIALS AND METHODS A systematic search of PubMed, CINAHL, and IEEE Xplore and hand-searching of relevant conference proceedings were conducted to identify eligible articles. Empirical studies of predictive CDSSs using electronic health record data for nurses or physicians in the hospital setting published in the last 5 years in peer-reviewed journals or conference proceedings were eligible for synthesis. Data from eligible studies regarding clinician involvement, stage in system design, predictive CDSS intention, and target clinician were charted and summarized. RESULTS Eighty studies met eligibility criteria. Clinical expert involvement was most prevalent at the beginning and late stages of system design. Most articles (95%) described developing and evaluating machine learning models, 28% of which described involving clinical experts, with nearly half functioning to verify the clinical correctness or relevance of the model (47%). DISCUSSION Involvement of clinical experts in predictive CDSS design should be explicitly reported in publications and evaluated for the potential to overcome predictive CDSS adoption challenges. CONCLUSIONS If present, clinical expert involvement is most prevalent when predictive CDSS specifications are made or when system implementations are evaluated. However, clinical experts are less prevalent in developmental stages to verify clinical correctness, select model features, preprocess data, or serve as a gold standard.
Collapse
|
Research Support, N.I.H., Extramural |
4 |
35 |
14
|
Tang S, Davarmanesh P, Song Y, Koutra D, Sjoding MW, Wiens J. Democratizing EHR analyses with FIDDLE: a flexible data-driven preprocessing pipeline for structured clinical data. J Am Med Inform Assoc 2020; 27:1921-1934. [PMID: 33040151 PMCID: PMC7727385 DOI: 10.1093/jamia/ocaa139] [Citation(s) in RCA: 33] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2019] [Revised: 06/01/2020] [Accepted: 06/23/2020] [Indexed: 12/23/2022] Open
Abstract
OBJECTIVE In applying machine learning (ML) to electronic health record (EHR) data, many decisions must be made before any ML is applied; such preprocessing requires substantial effort and can be labor-intensive. As the role of ML in health care grows, there is an increasing need for systematic and reproducible preprocessing techniques for EHR data. Thus, we developed FIDDLE (Flexible Data-Driven Pipeline), an open-source framework that streamlines the preprocessing of data extracted from the EHR. MATERIALS AND METHODS Largely data-driven, FIDDLE systematically transforms structured EHR data into feature vectors, limiting the number of decisions a user must make while incorporating good practices from the literature. To demonstrate its utility and flexibility, we conducted a proof-of-concept experiment in which we applied FIDDLE to 2 publicly available EHR data sets collected from intensive care units: MIMIC-III and the eICU Collaborative Research Database. We trained different ML models to predict 3 clinically important outcomes: in-hospital mortality, acute respiratory failure, and shock. We evaluated models using the area under the receiver operating characteristics curve (AUROC), and compared it to several baselines. RESULTS Across tasks, FIDDLE extracted 2,528 to 7,403 features from MIMIC-III and eICU, respectively. On all tasks, FIDDLE-based models achieved good discriminative performance, with AUROCs of 0.757-0.886, comparable to the performance of MIMIC-Extract, a preprocessing pipeline designed specifically for MIMIC-III. Furthermore, our results showed that FIDDLE is generalizable across different prediction times, ML algorithms, and data sets, while being relatively robust to different settings of user-defined arguments. CONCLUSIONS FIDDLE, an open-source preprocessing pipeline, facilitates applying ML to structured EHR data. By accelerating and standardizing labor-intensive preprocessing, FIDDLE can help stimulate progress in building clinically useful ML tools for EHR data.
Collapse
|
Research Support, N.I.H., Extramural |
5 |
33 |
15
|
Kieft K, Adams A, Salamzade R, Kalan L, Anantharaman K. vRhyme enables binning of viral genomes from metagenomes. Nucleic Acids Res 2022; 50:e83. [PMID: 35544285 PMCID: PMC9371927 DOI: 10.1093/nar/gkac341] [Citation(s) in RCA: 32] [Impact Index Per Article: 10.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2022] [Revised: 04/17/2022] [Accepted: 04/22/2022] [Indexed: 01/11/2023] Open
Abstract
Genome binning has been essential for characterization of bacteria, archaea, and even eukaryotes from metagenomes. Yet, few approaches exist for viruses. We developed vRhyme, a fast and precise software for construction of viral metagenome-assembled genomes (vMAGs). vRhyme utilizes single- or multi-sample coverage effect size comparisons between scaffolds and employs supervised machine learning to identify nucleotide feature similarities, which are compiled into iterations of weighted networks and refined bins. To refine bins, vRhyme utilizes unique features of viral genomes, namely a protein redundancy scoring mechanism based on the observation that viruses seldom encode redundant genes. Using simulated viromes, we displayed superior performance of vRhyme compared to available binning tools in constructing more complete and uncontaminated vMAGs. When applied to 10,601 viral scaffolds from human skin, vRhyme advanced our understanding of resident viruses, highlighted by identification of a Herelleviridae vMAG comprised of 22 scaffolds, and another vMAG encoding a nitrate reductase metabolic gene, representing near-complete genomes post-binning. vRhyme will enable a convention of binning uncultivated viral genomes and has the potential to transform metagenome-based viral ecology.
Collapse
|
Research Support, N.I.H., Extramural |
3 |
32 |
16
|
Luo Y. Evaluating the state of the art in missing data imputation for clinical data. Brief Bioinform 2022; 23:bbab489. [PMID: 34882223 PMCID: PMC8769894 DOI: 10.1093/bib/bbab489] [Citation(s) in RCA: 30] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2021] [Revised: 10/14/2021] [Accepted: 10/24/2021] [Indexed: 11/14/2022] Open
Abstract
Clinical data are increasingly being mined to derive new medical knowledge with a goal of enabling greater diagnostic precision, better-personalized therapeutic regimens, improved clinical outcomes and more efficient utilization of health-care resources. However, clinical data are often only available at irregular intervals that vary between patients and type of data, with entries often being unmeasured or unknown. As a result, missing data often represent one of the major impediments to optimal knowledge derivation from clinical data. The Data Analytics Challenge on Missing data Imputation (DACMI) presented a shared clinical dataset with ground truth for evaluating and advancing the state of the art in imputing missing data for clinical time series. We extracted 13 commonly measured blood laboratory tests. To evaluate the imputation performance, we randomly removed one recorded result per laboratory test per patient admission and used them as the ground truth. DACMI is the first shared-task challenge on clinical time series imputation to our best knowledge. The challenge attracted 12 international teams spanning three continents across multiple industries and academia. The evaluation outcome suggests that competitive machine learning and statistical models (e.g. LightGBM, MICE and XGBoost) coupled with carefully engineered temporal and cross-sectional features can achieve strong imputation performance. However, care needs to be taken to prevent overblown model complexity. The challenge participating systems collectively experimented with a wide range of machine learning and probabilistic algorithms to combine temporal imputation and cross-sectional imputation, and their design principles will inform future efforts to better model clinical missing data.
Collapse
|
Research Support, N.I.H., Extramural |
3 |
30 |
17
|
Lim H, He D, Qiu Y, Krawczuk P, Sun X, Xie L. Rational discovery of dual-indication multi-target PDE/Kinase inhibitor for precision anti-cancer therapy using structural systems pharmacology. PLoS Comput Biol 2019; 15:e1006619. [PMID: 31206508 PMCID: PMC6576746 DOI: 10.1371/journal.pcbi.1006619] [Citation(s) in RCA: 29] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2018] [Accepted: 04/26/2019] [Indexed: 01/09/2023] Open
Abstract
Many complex diseases such as cancer are associated with multiple pathological manifestations. Moreover, the therapeutics for their treatments often lead to serious side effects. Thus, it is needed to develop multi-indication therapeutics that can simultaneously target multiple clinical indications of interest and mitigate the side effects. However, conventional one-drug-one-gene drug discovery paradigm and emerging polypharmacology approach rarely tackle the challenge of multi-indication drug design. For the first time, we propose a one-drug-multi-target-multi-indication strategy. We develop a novel structural systems pharmacology platform 3D-REMAP that uses ligand binding site comparison and protein-ligand docking to augment sparse chemical genomics data for the machine learning model of genome-scale chemical-protein interaction prediction. Experimentally validated predictions systematically show that 3D-REMAP outperforms state-of-the-art ligand-based, receptor-based, and machine learning methods alone. As a proof-of-concept, we utilize the concept of drug repurposing that is enabled by 3D-REMAP to design dual-indication anti-cancer therapy. The repurposed drug can demonstrate anti-cancer activity for cancers that do not have effective treatment as well as reduce the risk of heart failure that is associated with all types of existing anti-cancer therapies. We predict that levosimendan, a PDE inhibitor for heart failure, inhibits serine/threonine-protein kinase RIOK1 and other kinases. Subsequent experiments and systems biology analyses confirm this prediction, and suggest that levosimendan is active against multiple cancers, notably lymphoma, through the direct inhibition of RIOK1 and RNA processing pathway. We further develop machine learning models to predict cancer cell-line's and a patient's response to levosimendan. Our findings suggest that levosimendan can be a promising novel lead compound for the development of safe, effective, and precision multi-indication anti-cancer therapy. This study demonstrates the potential of structural systems pharmacology in designing polypharmacology for precision medicine. It may facilitate transforming the conventional one-drug-one-gene-one-disease drug discovery process and single-indication polypharmacology approach into a new one-drug-multi-target-multi-indication paradigm for complex diseases.
Collapse
|
Research Support, N.I.H., Extramural |
6 |
29 |
18
|
Haft DH, Badretdin A, Coulouris G, DiCuccio M, Durkin A, Jovenitti E, Li W, Mersha M, O’Neill K, Virothaisakun J, Thibaud-Nissen F. RefSeq and the prokaryotic genome annotation pipeline in the age of metagenomes. Nucleic Acids Res 2024; 52:D762-D769. [PMID: 37962425 PMCID: PMC10767926 DOI: 10.1093/nar/gkad988] [Citation(s) in RCA: 28] [Impact Index Per Article: 28.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2023] [Revised: 10/13/2023] [Accepted: 10/18/2023] [Indexed: 11/15/2023] Open
Abstract
The Reference Sequence (RefSeq) project at the National Center for Biotechnology Information (NCBI) contains over 315 000 bacterial and archaeal genomes and 236 million proteins with up-to-date and consistent annotation. In the past 3 years, we have expanded the diversity of the RefSeq collection by including the best quality metagenome-assembled genomes (MAGs) submitted to INSDC (DDBJ, ENA and GenBank), while maintaining its quality by adding validation checks. Assemblies are now more stringently evaluated for contamination and for completeness of annotation prior to acceptance into RefSeq. MAGs now account for over 17000 assemblies in RefSeq, split over 165 orders and 362 families. Changes in the Prokaryotic Genome Annotation Pipeline (PGAP), which is used to annotate nearly all RefSeq assemblies include better detection of protein-coding genes. Nearly 83% of RefSeq proteins are now named by a curated Protein Family Model, a 4.7% increase in the past three years ago. In addition to literature citations, Enzyme Commission numbers, and gene symbols, Gene Ontology terms are now assigned to 48% of RefSeq proteins, allowing for easier multi-genome comparison. RefSeq is found at https://www.ncbi.nlm.nih.gov/refseq/. PGAP is available as a stand-alone tool able to produce GenBank-ready files at https://github.com/ncbi/pgap.
Collapse
|
research-article |
1 |
28 |
19
|
Jin Q, Yang Y, Chen Q, Lu Z. GeneGPT: augmenting large language models with domain tools for improved access to biomedical information. Bioinformatics 2024; 40:btae075. [PMID: 38341654 PMCID: PMC10904143 DOI: 10.1093/bioinformatics/btae075] [Citation(s) in RCA: 27] [Impact Index Per Article: 27.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2023] [Revised: 01/08/2024] [Accepted: 02/08/2024] [Indexed: 02/12/2024] Open
Abstract
MOTIVATION While large language models (LLMs) have been successfully applied to various tasks, they still face challenges with hallucinations. Augmenting LLMs with domain-specific tools such as database utilities can facilitate easier and more precise access to specialized knowledge. In this article, we present GeneGPT, a novel method for teaching LLMs to use the Web APIs of the National Center for Biotechnology Information (NCBI) for answering genomics questions. Specifically, we prompt Codex to solve the GeneTuring tests with NCBI Web APIs by in-context learning and an augmented decoding algorithm that can detect and execute API calls. RESULTS Experimental results show that GeneGPT achieves state-of-the-art performance on eight tasks in the GeneTuring benchmark with an average score of 0.83, largely surpassing retrieval-augmented LLMs such as the new Bing (0.44), biomedical LLMs such as BioMedLM (0.08) and BioGPT (0.04), as well as GPT-3 (0.16) and ChatGPT (0.12). Our further analyses suggest that: First, API demonstrations have good cross-task generalizability and are more useful than documentations for in-context learning; second, GeneGPT can generalize to longer chains of API calls and answer multi-hop questions in GeneHop, a novel dataset introduced in this work; finally, different types of errors are enriched in different tasks, providing valuable insights for future improvements. AVAILABILITY AND IMPLEMENTATION The GeneGPT code and data are publicly available at https://github.com/ncbi/GeneGPT.
Collapse
|
research-article |
1 |
27 |
20
|
Rogers JR, Lee J, Zhou Z, Cheung YK, Hripcsak G, Weng C. Contemporary use of real-world data for clinical trial conduct in the United States: a scoping review. J Am Med Inform Assoc 2021; 28:144-154. [PMID: 33164065 PMCID: PMC7810452 DOI: 10.1093/jamia/ocaa224] [Citation(s) in RCA: 26] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2020] [Revised: 08/11/2020] [Accepted: 09/02/2020] [Indexed: 12/28/2022] Open
Abstract
OBJECTIVE Real-world data (RWD), defined as routinely collected healthcare data, can be a potential catalyst for addressing challenges faced in clinical trials. We performed a scoping review of database-specific RWD applications within clinical trial contexts, synthesizing prominent uses and themes. MATERIALS AND METHODS Querying 3 biomedical literature databases, research articles using electronic health records, administrative claims databases, or clinical registries either within a clinical trial or in tandem with methodology related to clinical trials were included. Articles were required to use at least 1 US RWD source. All abstract screening, full-text screening, and data extraction was performed by 1 reviewer. Two reviewers independently verified all decisions. RESULTS Of 2020 screened articles, 89 qualified: 59 articles used electronic health records, 29 used administrative claims, and 26 used registries. Our synthesis was driven by the general life cycle of a clinical trial, culminating into 3 major themes: trial process tasks (51 articles); dissemination strategies (6); and generalizability assessments (34). Despite a diverse set of diseases studied, <10% of trials using RWD for trial process tasks evaluated medications or procedures (5/51). All articles highlighted data-related challenges, such as missing values. DISCUSSION Database-specific RWD have been occasionally leveraged for various clinical trial tasks. We observed underuse of RWD within conducted medication or procedure trials, though it is subject to the confounder of implicit report of RWD use. CONCLUSION Enhanced incorporation of RWD should be further explored for medication or procedure trials, including better understanding of how to handle related data quality issues to facilitate RWD use.
Collapse
|
Research Support, N.I.H., Extramural |
4 |
26 |
21
|
Stemerman R, Arguello J, Brice J, Krishnamurthy A, Houston M, Kitzmiller R. Identification of social determinants of health using multi-label classification of electronic health record clinical notes. JAMIA Open 2021; 4:ooaa069. [PMID: 34514351 PMCID: PMC8423426 DOI: 10.1093/jamiaopen/ooaa069] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2020] [Revised: 11/16/2020] [Accepted: 11/20/2020] [Indexed: 11/13/2022] Open
Abstract
OBJECTIVES Social determinants of health (SDH), key contributors to health, are rarely systematically measured and collected in the electronic health record (EHR). We investigate how to leverage clinical notes using novel applications of multi-label learning (MLL) to classify SDH in mental health and substance use disorder patients who frequent the emergency department. METHODS AND MATERIALS We labeled a gold-standard corpus of EHR clinical note sentences (N = 4063) with 6 identified SDH-related domains recommended by the Institute of Medicine for inclusion in the EHR. We then trained 5 classification models: linear-Support Vector Machine, K-Nearest Neighbors, Random Forest, XGBoost, and bidirectional Long Short-Term Memory (BI-LSTM). We adopted 5 common evaluation measures: accuracy, average precision-recall (AP), area under the curve receiver operating characteristic (AUC-ROC), Hamming loss, and log loss to compare the performance of different methods for MLL classification using the F1 score as the primary evaluation metric. RESULTS Our results suggested that, overall, BI-LSTM outperformed the other classification models in terms of AUC-ROC (93.9), AP (0.76), and Hamming loss (0.12). The AUC-ROC values of MLL models of SDH related domains varied between (0.59-1.0). We found that 44.6% of our study population (N = 1119) had at least one positive documentation of SDH. DISCUSSION AND CONCLUSION The proposed approach of training an MLL model on an SDH rich data source can produce a high performing classifier using only unstructured clinical notes. We also provide evidence that model performance is associated with lexical diversity by health professionals and the auto-generation of clinical note sentences to document SDH.
Collapse
|
research-article |
4 |
23 |
22
|
Koonin EV. The meaning of biological information. PHILOSOPHICAL TRANSACTIONS. SERIES A, MATHEMATICAL, PHYSICAL, AND ENGINEERING SCIENCES 2016; 374:rsta.2015.0065. [PMID: 26857678 PMCID: PMC4760125 DOI: 10.1098/rsta.2015.0065] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Accepted: 07/27/2015] [Indexed: 06/05/2023]
Abstract
Biological information encoded in genomes is fundamentally different from and effectively orthogonal to Shannon entropy. The biologically relevant concept of information has to do with 'meaning', i.e. encoding various biological functions with various degree of evolutionary conservation. Apart from direct experimentation, the meaning, or biological information content, can be extracted and quantified from alignments of homologous nucleotide or amino acid sequences but generally not from a single sequence, using appropriately modified information theoretical formulae. For short, information encoded in genomes is defined vertically but not horizontally. Informally but substantially, biological information density seems to be equivalent to 'meaning' of genomic sequences that spans the entire range from sharply defined, universal meaning to effective meaninglessness. Large fractions of genomes, up to 90% in some plants, belong within the domain of fuzzy meaning. The sequences with fuzzy meaning can be recruited for various functions, with the meaning subsequently fixed, and also could perform generic functional roles that do not require sequence conservation. Biological meaning is continuously transferred between the genomes of selfish elements and hosts in the process of their coevolution. Thus, in order to adequately describe genome function and evolution, the concepts of information theory have to be adapted to incorporate the notion of meaning that is central to biology.
Collapse
|
discussion |
9 |
23 |
23
|
Ding NS, Tassone D, Al Bakir I, Wu K, Thompson AJ, Connell WR, Malietzis G, Lung P, Singh S, Choi CHR, Gabe S, Jenkins JT, Hart A. Systematic Review: The Impact and Importance of Body Composition in Inflammatory Bowel Disease. J Crohns Colitis 2022; 16:1475-1492. [PMID: 35325076 PMCID: PMC9455788 DOI: 10.1093/ecco-jcc/jjac041] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/11/2021] [Revised: 02/06/2022] [Accepted: 03/10/2022] [Indexed: 12/31/2022]
Abstract
BACKGROUND AND AIMS Alterations in body composition are common in inflammatory bowel disease [IBD] and have been associated with differences in patient outcomes. We sought to consolidate knowledge on the impact and importance of body composition in IBD. METHODS We performed a systematic search of MEDLINE, EMBASE and conference proceedings by combining two key research themes: inflammatory bowel disease and body composition. RESULTS Fifty-five studies were included in this review. Thirty-one focused on the impact of IBD on body composition with a total of 2279 patients with a mean age 38.4 years. Of these, 1071 [47%] were male. In total, 1470 [64.5%] patients had Crohn's disease and 809 [35.5%] had ulcerative colitis. Notably, fat mass and fat-free mass were reduced, and higher rates of sarcopaenia were observed in those with active IBD compared with those in clinical remission and healthy controls. Twenty-four additional studies focused on the impact of derangements in body composition on IBD outcomes. Alterations in body composition in IBD are associated with poorer prognoses including higher rates of surgical intervention, post-operative complications and reduced muscle strength. In addition, higher rates of early treatment failure and primary non-response are seen in patients with myopaenia. CONCLUSIONS Patients with IBD have alterations in body composition parameters in active disease and clinical remission. The impacts of body composition on disease outcome and therapy are broad and require further investigation. The augmentation of body composition parameters in the clinical setting has the potential to improve IBD outcomes in the future.
Collapse
|
Systematic Review |
3 |
21 |
24
|
Luo L, Yan S, Lai PT, Veltri D, Oler A, Xirasagar S, Ghosh R, Similuk M, Robinson PN, Lu Z. PhenoTagger: a hybrid method for phenotype concept recognition using human phenotype ontology. Bioinformatics 2021; 37:1884-1890. [PMID: 33471061 PMCID: PMC11025364 DOI: 10.1093/bioinformatics/btab019] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2020] [Revised: 11/20/2020] [Accepted: 01/11/2021] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Automatic phenotype concept recognition from unstructured text remains a challenging task in biomedical text mining research. Previous works that address the task typically use dictionary-based matching methods, which can achieve high precision but suffer from lower recall. Recently, machine learning-based methods have been proposed to identify biomedical concepts, which can recognize more unseen concept synonyms by automatic feature learning. However, most methods require large corpora of manually annotated data for model training, which is difficult to obtain due to the high cost of human annotation. RESULTS In this article, we propose PhenoTagger, a hybrid method that combines both dictionary and machine learning-based methods to recognize Human Phenotype Ontology (HPO) concepts in unstructured biomedical text. We first use all concepts and synonyms in HPO to construct a dictionary, which is then used to automatically build a distantly supervised training dataset for machine learning. Next, a cutting-edge deep learning model is trained to classify each candidate phrase (n-gram from input sentence) into a corresponding concept label. Finally, the dictionary and machine learning-based prediction results are combined for improved performance. Our method is validated with two HPO corpora, and the results show that PhenoTagger compares favorably to previous methods. In addition, to demonstrate the generalizability of our method, we retrained PhenoTagger using the disease ontology MEDIC for disease concept recognition to investigate the effect of training on different ontologies. Experimental results on the NCBI disease corpus show that PhenoTagger without requiring manually annotated training data achieves competitive performance as compared with state-of-the-art supervised methods. AVAILABILITYAND IMPLEMENTATION The source code, API information and data for PhenoTagger are freely available at https://github.com/ncbi-nlp/PhenoTagger. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
|
research-article |
4 |
20 |
25
|
Reichberg SB, Mitra PP, Haghamad A, Ramrattan G, Crawford JM, Berry GJ, Davidson KW, Drach A, Duong S, Juretschko S, Maria NI, Yang Y, Ziemba YC. Rapid Emergence of SARS-CoV-2 in the Greater New York Metropolitan Area: Geolocation, Demographics, Positivity Rates, and Hospitalization for 46 793 Persons Tested by Northwell Health. Clin Infect Dis 2020; 71:3204-3213. [PMID: 32640030 PMCID: PMC7454448 DOI: 10.1093/cid/ciaa922] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2020] [Accepted: 07/07/2020] [Indexed: 01/19/2023] Open
Abstract
BACKGROUND In March 2020, the greater New York metropolitan area became an epicenter for severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection. The initial evolution of case incidence has not been well characterized. METHODS Northwell Health Laboratories tested 46 793 persons for SARS-CoV-2 from 4 March through 10 April. The primary outcome measure was a positive reverse transcription-polymerase chain reaction test for SARS-CoV-2. The secondary outcomes included patient age, sex, and race, if stated; dates the specimen was obtained and the test result; clinical practice site sources; geolocation of patient residence; and hospitalization. RESULTS From 8 March through 10 April, a total of 26 735 of 46 793 persons (57.1%) tested positive for SARS-CoV-2. Males of each race were disproportionally more affected than females above age 25, with a progressive male predominance as age increased. Of the positive persons, 7292 were hospitalized directly upon presentation; an additional 882 persons tested positive in an ambulatory setting before subsequent hospitalization, a median of 4.8 days later. Total hospitalization rate was thus 8174 persons (30.6% of positive persons). There was a broad range (>10-fold) in the cumulative number of positive cases across individual zip codes following documented first caseincidence. Test positivity was greater for persons living in zip codes with lower annual household income. CONCLUSIONS Our data reveal that SARS-CoV-2 incidence emerged rapidly and almost simultaneously across a broad demographic population in the region. These findings support the premise that SARS-CoV-2 infection was widely distributed prior to virus testing availability.
Collapse
|
Research Support, N.I.H., Extramural |
5 |
19 |