1
|
Nguyen E, Poli M, Faizi M, Thomas A, Birch-Sykes C, Wornow M, Patel A, Rabideau C, Massaroli S, Bengio Y, Ermon S, Baccus SA, Ré C. HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution. ArXiv 2023:arXiv:2306.15794v2. [PMID: 37426456 PMCID: PMC10327243] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Subscribe] [Scholar Register] [Indexed: 07/11/2023]
Abstract
Genomic (DNA) sequences encode an enormous amount of information for gene regulation and protein synthesis. Similar to natural language models, researchers have proposed foundation models in genomics to learn generalizable features from unlabeled genome data that can then be fine-tuned for downstream tasks such as identifying regulatory elements. Due to the quadratic scaling of attention, previous Transformer-based genomic models have used 512 to 4k tokens as context (<0.001% of the human genome), significantly limiting the modeling of long-range interactions in DNA. In addition, these methods rely on tokenizers or fixed k-mers to aggregate meaningful DNA units, losing single nucleotide resolution where subtle genetic variations can completely alter protein function via single nucleotide polymorphisms (SNPs). Recently, Hyena, a large language model based on implicit convolutions was shown to match attention in quality while allowing longer context lengths and lower time complexity. Leveraging Hyena's new long-range capabilities, we present HyenaDNA, a genomic foundation model pretrained on the human reference genome with context lengths of up to 1 million tokens at the single nucleotide-level - an up to 500x increase over previous dense attention-based models. HyenaDNA scales sub-quadratically in sequence length (training up to 160x faster than Transformer), uses single nucleotide tokens, and has full global context at each layer. We explore what longer context enables - including the first use of in-context learning in genomics. On fine-tuned benchmarks from the Nucleotide Transformer, HyenaDNA reaches state-of-the-art (SotA) on 12 of 18 datasets using a model with orders of magnitude less parameters and pretraining data. On the GenomicBenchmarks, HyenaDNA surpasses SotA on 7 of 8 datasets on average by +10 accuracy points. Code at https://github.com/HazyResearch/hyena-dna.
Collapse
|
2
|
Wornow M, Xu Y, Thapa R, Patel B, Steinberg E, Fleming S, Pfeffer MA, Fries J, Shah NH. The shaky foundations of large language models and foundation models for electronic health records. NPJ Digit Med 2023; 6:135. [PMID: 37516790 PMCID: PMC10387101 DOI: 10.1038/s41746-023-00879-8] [Citation(s) in RCA: 23] [Impact Index Per Article: 23.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2023] [Accepted: 07/13/2023] [Indexed: 07/31/2023] Open
Abstract
The success of foundation models such as ChatGPT and AlphaFold has spurred significant interest in building similar models for electronic medical records (EMRs) to improve patient care and hospital operations. However, recent hype has obscured critical gaps in our understanding of these models' capabilities. In this narrative review, we examine 84 foundation models trained on non-imaging EMR data (i.e., clinical text and/or structured data) and create a taxonomy delineating their architectures, training data, and potential use cases. We find that most models are trained on small, narrowly-scoped clinical datasets (e.g., MIMIC-III) or broad, public biomedical corpora (e.g., PubMed) and are evaluated on tasks that do not provide meaningful insights on their usefulness to health systems. Considering these findings, we propose an improved evaluation framework for measuring the benefits of clinical foundation models that is more closely grounded to metrics that matter in healthcare.
Collapse
Affiliation(s)
- Michael Wornow
- Department of Computer Science, Stanford University, Stanford, CA, USA.
| | - Yizhe Xu
- Center for Biomedical Informatics Research, Stanford University School of Medicine, Stanford, CA, USA
| | - Rahul Thapa
- Center for Biomedical Informatics Research, Stanford University School of Medicine, Stanford, CA, USA
| | - Birju Patel
- Center for Biomedical Informatics Research, Stanford University School of Medicine, Stanford, CA, USA
| | - Ethan Steinberg
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | - Scott Fleming
- Center for Biomedical Informatics Research, Stanford University School of Medicine, Stanford, CA, USA
| | - Michael A Pfeffer
- Center for Biomedical Informatics Research, Stanford University School of Medicine, Stanford, CA, USA
- Technology and Digital Services, Stanford Health Care, Palo Alto, CA, USA
| | - Jason Fries
- Center for Biomedical Informatics Research, Stanford University School of Medicine, Stanford, CA, USA
| | - Nigam H Shah
- Center for Biomedical Informatics Research, Stanford University School of Medicine, Stanford, CA, USA
- Technology and Digital Services, Stanford Health Care, Palo Alto, CA, USA
- Department of Medicine, Stanford University School of Medicine, Stanford, CA, USA
- Clinical Excellence Research Center, Stanford University School of Medicine, Stanford, CA, USA
| |
Collapse
|
3
|
Chen JC, Chen JP, Shen MW, Wornow M, Bae M, Yeh WH, Hsu A, Liu DR. Generating experimentally unrelated target molecule-binding highly functionalized nucleic-acid polymers using machine learning. Nat Commun 2022; 13:4541. [PMID: 35927274 PMCID: PMC9352670 DOI: 10.1038/s41467-022-31955-4] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2021] [Accepted: 07/11/2022] [Indexed: 11/09/2022] Open
Abstract
In vitro selection queries large combinatorial libraries for sequence-defined polymers with target binding and reaction catalysis activity. While the total sequence space of these libraries can extend beyond 1022 sequences, practical considerations limit starting sequences to ≤~1015 distinct molecules. Selection-induced sequence convergence and limited sequencing depth further constrain experimentally observable sequence space. To address these limitations, we integrate experimental and machine learning approaches to explore regions of sequence space unrelated to experimentally derived variants. We perform in vitro selections to discover highly side-chain-functionalized nucleic acid polymers (HFNAPs) with potent affinities for a target small molecule (daunomycin KD = 5-65 nM). We then use the selection data to train a conditional variational autoencoder (CVAE) machine learning model to generate diverse and unique HFNAP sequences with high daunomycin affinities (KD = 9-26 nM), even though they are unrelated in sequence to experimental polymers. Coupling in vitro selection with a machine learning model thus enables direct generation of active variants, demonstrating a new approach to the discovery of functional biopolymers.
Collapse
Affiliation(s)
- Jonathan C. Chen
- grid.66859.340000 0004 0546 1623Merkin Institute of Transformative Technologies in Healthcare, Broad Institute of Harvard and MIT, Cambridge, MA USA ,grid.38142.3c000000041936754XDepartment of Chemistry and Chemical Biology, Harvard University, Cambridge, MA USA ,grid.38142.3c000000041936754XHoward Hughes Medical Institute, Harvard University, Cambridge, MA USA
| | - Jonathan P. Chen
- grid.512059.aWork conducted at Uber AI Labs, Uber Technologies, Inc., San Francisco, CA USA ,Meta Platforms, Menlo Park, CA USA
| | - Max W. Shen
- grid.66859.340000 0004 0546 1623Merkin Institute of Transformative Technologies in Healthcare, Broad Institute of Harvard and MIT, Cambridge, MA USA ,grid.38142.3c000000041936754XDepartment of Chemistry and Chemical Biology, Harvard University, Cambridge, MA USA ,grid.38142.3c000000041936754XHoward Hughes Medical Institute, Harvard University, Cambridge, MA USA ,grid.116068.80000 0001 2341 2786Computational and Systems Biology Program, Massachusetts Institute of Technology, Cambridge, MA USA
| | - Michael Wornow
- grid.66859.340000 0004 0546 1623Merkin Institute of Transformative Technologies in Healthcare, Broad Institute of Harvard and MIT, Cambridge, MA USA ,grid.38142.3c000000041936754XDepartment of Chemistry and Chemical Biology, Harvard University, Cambridge, MA USA
| | - Minwoo Bae
- grid.66859.340000 0004 0546 1623Merkin Institute of Transformative Technologies in Healthcare, Broad Institute of Harvard and MIT, Cambridge, MA USA ,grid.38142.3c000000041936754XDepartment of Chemistry and Chemical Biology, Harvard University, Cambridge, MA USA
| | - Wei-Hsi Yeh
- grid.66859.340000 0004 0546 1623Merkin Institute of Transformative Technologies in Healthcare, Broad Institute of Harvard and MIT, Cambridge, MA USA ,grid.38142.3c000000041936754XDepartment of Chemistry and Chemical Biology, Harvard University, Cambridge, MA USA ,grid.38142.3c000000041936754XHoward Hughes Medical Institute, Harvard University, Cambridge, MA USA ,grid.38142.3c000000041936754XProgram in Speech and Hearing Bioscience and Technology, Harvard Medical School, Boston, MA USA
| | - Alvin Hsu
- grid.66859.340000 0004 0546 1623Merkin Institute of Transformative Technologies in Healthcare, Broad Institute of Harvard and MIT, Cambridge, MA USA ,grid.38142.3c000000041936754XDepartment of Chemistry and Chemical Biology, Harvard University, Cambridge, MA USA ,grid.38142.3c000000041936754XHoward Hughes Medical Institute, Harvard University, Cambridge, MA USA
| | - David R. Liu
- grid.66859.340000 0004 0546 1623Merkin Institute of Transformative Technologies in Healthcare, Broad Institute of Harvard and MIT, Cambridge, MA USA ,grid.38142.3c000000041936754XDepartment of Chemistry and Chemical Biology, Harvard University, Cambridge, MA USA ,grid.38142.3c000000041936754XHoward Hughes Medical Institute, Harvard University, Cambridge, MA USA
| |
Collapse
|
4
|
Yeh WH, Shubina-Oleinik O, Levy JM, Pan B, Newby GA, Wornow M, Burt R, Chen JC, Holt JR, Liu DR. In vivo base editing restores sensory transduction and transiently improves auditory function in a mouse model of recessive deafness. Sci Transl Med 2021; 12:12/546/eaay9101. [PMID: 32493795 DOI: 10.1126/scitranslmed.aay9101] [Citation(s) in RCA: 100] [Impact Index Per Article: 33.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2019] [Accepted: 04/05/2020] [Indexed: 12/11/2022]
Abstract
Most genetic diseases arise from recessive point mutations that require correction, rather than disruption, of the pathogenic allele to benefit patients. Base editing has the potential to directly repair point mutations and provide therapeutic restoration of gene function. Mutations of transmembrane channel-like 1 gene (TMC1) can cause dominant or recessive deafness. We developed a base editing strategy to treat Baringo mice, which carry a recessive, loss-of-function point mutation (c.A545G; resulting in the substitution p.Y182C) in Tmc1 that causes deafness. Tmc1 encodes a protein that forms mechanosensitive ion channels in sensory hair cells of the inner ear and is required for normal auditory function. We found that sensory hair cells of Baringo mice have a complete loss of auditory sensory transduction. To repair the mutation, we tested several optimized cytosine base editors (CBEmax variants) and guide RNAs in Baringo mouse embryonic fibroblasts. We packaged the most promising CBE, derived from an activation-induced cytidine deaminase (AID), into dual adeno-associated viruses (AAVs) using a split-intein delivery system. The dual AID-CBEmax AAVs were injected into the inner ears of Baringo mice at postnatal day 1. Injected mice showed up to 51% reversion of the Tmc1 c.A545G point mutation to wild-type sequence (c.A545A) in Tmc1 transcripts. Repair of Tmc1 in vivo restored inner hair cell sensory transduction and hair cell morphology and transiently rescued low-frequency hearing 4 weeks after injection. These findings provide a foundation for a potential one-time treatment for recessive hearing loss and support further development of base editing to correct pathogenic point mutations.
Collapse
Affiliation(s)
- Wei-Hsi Yeh
- Merkin Institute of Transformative Technologies in Healthcare, Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.,Department of Chemistry and Chemical Biology, Harvard University, Cambridge, MA 02138, USA.,Program in Speech and Hearing Bioscience and Technology, Harvard Medical School, Boston, MA 02115, USA
| | - Olga Shubina-Oleinik
- Department of Otolaryngology, F.M. Kirby Neurobiology Center, Boston Children's Hospital and Harvard Medical School, Boston, MA 02115, USA
| | - Jonathan M Levy
- Merkin Institute of Transformative Technologies in Healthcare, Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.,Department of Chemistry and Chemical Biology, Harvard University, Cambridge, MA 02138, USA
| | - Bifeng Pan
- Department of Otolaryngology, F.M. Kirby Neurobiology Center, Boston Children's Hospital and Harvard Medical School, Boston, MA 02115, USA
| | - Gregory A Newby
- Merkin Institute of Transformative Technologies in Healthcare, Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.,Department of Chemistry and Chemical Biology, Harvard University, Cambridge, MA 02138, USA
| | - Michael Wornow
- Merkin Institute of Transformative Technologies in Healthcare, Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.,Department of Chemistry and Chemical Biology, Harvard University, Cambridge, MA 02138, USA
| | - Rachel Burt
- Murdoch Children's Research Institute, The Royal Children's Hospital, Parkville, VIC 3052, Australia
| | - Jonathan C Chen
- Merkin Institute of Transformative Technologies in Healthcare, Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.,Department of Chemistry and Chemical Biology, Harvard University, Cambridge, MA 02138, USA
| | - Jeffrey R Holt
- Department of Otolaryngology, F.M. Kirby Neurobiology Center, Boston Children's Hospital and Harvard Medical School, Boston, MA 02115, USA. .,Department of Neurology, F.M. Kirby Neurobiology Center, Boston Children's Hospital and Harvard Medical School, Boston, MA 02115, USA
| | - David R Liu
- Merkin Institute of Transformative Technologies in Healthcare, Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA. .,Department of Chemistry and Chemical Biology, Harvard University, Cambridge, MA 02138, USA.,Howard Hughes Medical Institute, Harvard University, Cambridge, MA 02138, USA
| |
Collapse
|
5
|
Michelson KA, Rees CA, Sarathy J, VonAchen P, Wornow M, Monuteaux MC, Neuman MI. Inter-Region Transfers for Pandemic Surges. Clin Infect Dis 2020; 73:e4103-e4110. [PMID: 33038215 PMCID: PMC7665371 DOI: 10.1093/cid/ciaa1549] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2020] [Indexed: 11/12/2022] Open
Abstract
Background Hospital inpatient and intensive care unit (ICU) bed shortfalls may arise due to regional surges in volume. We sought to determine how inter-region transfers could alleviate bed shortfalls during a pandemic. Methods We used estimates of past and projected inpatient and ICU cases of COVID-19 from February 4, 2020 to October 1, 2020. For regions with bed shortfalls (where the number of patients exceeded bed capacity), transfers to the nearest region with unused beds were simulated using an algorithm that minimized total inter-region transfer distances across the U.S. Model scenarios used a range of predicted COVID-19 volumes (lower, mean, and upper bounds) and non-COVID-19 volumes (20%, 50%, or 80% of baseline hospital volumes). Scenarios were created for each day of data, and worst-case scenarios were created treating all regions’ peak volumes as simultaneous. Mean per-patient transfer distances were calculated by scenario. Results For the worst-case scenarios, national bed shortfalls ranged from 669 to 58,562 inpatient beds and 3,208 to 31,190 ICU beds, depending on model volume parameters. Mean transfer distances to alleviate daily bed shortfalls ranged from 23 to 352 miles for inpatient and 28 to 423 miles for ICU patients, depending on volume. Under all worst-case scenarios except the highest-volume ICU scenario, inter-regional transfers could fully resolve bed shortfalls. To do so, mean transfer distances would be 24 to 405 miles for inpatients and 73 to 476 miles for ICU patients. Conclusions Inter-region transfers could mitigate regional bed shortfalls during pandemic hospital surges.
Collapse
Affiliation(s)
- Kenneth A Michelson
- Division of Emergency Medicine, Boston Children's Hospital, Boston, MA, United States
| | - Chris A Rees
- Division of Emergency Medicine, Boston Children's Hospital, Boston, MA, United States
| | - Jayshree Sarathy
- Harvard John A. Paulson School of Engineering and Applied Sciences, Cambridge, MA, United States
| | - Paige VonAchen
- Department of Pediatrics, Boston Children's Hospital and Boston Medical Center, Boston, MA, United States
| | - Michael Wornow
- Harvard John A. Paulson School of Engineering and Applied Sciences, Cambridge, MA, United States
| | - Michael C Monuteaux
- Division of Emergency Medicine, Boston Children's Hospital, Boston, MA, United States
| | - Mark I Neuman
- Division of Emergency Medicine, Boston Children's Hospital, Boston, MA, United States
| |
Collapse
|