1
|
Ribeiro-Dos-Santos AM, Maurano MT. Iterative improvement of deep learning models using synthetic regulatory genomics. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.02.04.636130. [PMID: 39974895 PMCID: PMC11838587 DOI: 10.1101/2025.02.04.636130] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 02/21/2025]
Abstract
Generative deep learning models can accurately reconstruct genome-wide epigenetic tracks from the reference genome sequence alone. But it is unclear what predictive power they have on sequence diverging from the reference, such as disease- and trait-associated variants or engineered sequences. Recent work has applied synthetic regulatory genomics to characterized dozens of deletions, inversions, and rearrangements of DNase I hypersensitive sites (DHSs). Here, we use the state-of-the-art model Enformer to predict DNA accessibility across these engineered sequences when delivered at their endogenous loci. At high level, we observe a good correlation between accessibility predicted by Enformer and experimentally measured values. But model performance was best for sequences that more resembled the reference, such as single deletions or combinations of multiple DHSs. Predictive power was poorer for rearrangements affecting DHS order or orientation. We use these data to fine-tune Enformer, yielding significant reduction in prediction error. We show that this fine-tuning retains strong predictive performance for other tracks. Our results show that current deep learning models perform poorly when presented with novel sequence diverging in certain critical features from their training set. Thus an iterative approach incorporating profiling of synthetic constructs can improve model generalizability, and ultimately enable functional classification of regulatory variants identified by population studies.
Collapse
Affiliation(s)
| | - Matthew T Maurano
- Institute for Systems Genetics, NYU Grossman School of Medicine, New York, NY 10016, USA
- Department of Pathology, NYU Grossman School of Medicine, New York, NY 10016, USA
- Corresponding author:
| |
Collapse
|
2
|
Fishman V, Kuratov Y, Shmelev A, Petrov M, Penzar D, Shepelin D, Chekanov N, Kardymon O, Burtsev M. GENA-LM: a family of open-source foundational DNA language models for long sequences. Nucleic Acids Res 2025; 53:gkae1310. [PMID: 39817513 PMCID: PMC11734698 DOI: 10.1093/nar/gkae1310] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2024] [Accepted: 12/26/2024] [Indexed: 01/18/2025] Open
Abstract
Recent advancements in genomics, propelled by artificial intelligence, have unlocked unprecedented capabilities in interpreting genomic sequences, mitigating the need for exhaustive experimental analysis of complex, intertwined molecular processes inherent in DNA function. A significant challenge, however, resides in accurately decoding genomic sequences, which inherently involves comprehending rich contextual information dispersed across thousands of nucleotides. To address this need, we introduce GENA language model (GENA-LM), a suite of transformer-based foundational DNA language models capable of handling input lengths up to 36 000 base pairs. Notably, integrating the newly developed recurrent memory mechanism allows these models to process even larger DNA segments. We provide pre-trained versions of GENA-LM, including multispecies and taxon-specific models, demonstrating their capability for fine-tuning and addressing a spectrum of complex biological tasks with modest computational demands. While language models have already achieved significant breakthroughs in protein biology, GENA-LM showcases a similarly promising potential for reshaping the landscape of genomics and multi-omics data analysis. All models are publicly available on GitHub (https://github.com/AIRI-Institute/GENA_LM) and on HuggingFace (https://huggingface.co/AIRI-Institute). In addition, we provide a web service (https://dnalm.airi.net/) allowing user-friendly DNA annotation with GENA-LM models.
Collapse
Affiliation(s)
- Veniamin Fishman
- AIRI, Presnenskaya embankment, 6 st22, Moscow, 123112, Russia
- Institute of Cytology and Genetics, Prospekt Akademika Lavrent'yeva, 10, Novosibirsk, 630090, Russia
| | - Yuri Kuratov
- AIRI, Presnenskaya embankment, 6 st22, Moscow, 123112, Russia
- Moscow Institute of Physics and Technology, 9 Institutskiy per., Dolgoprudny, Moscow, 141701, Russia
| | - Aleksei Shmelev
- AIRI, Presnenskaya embankment, 6 st22, Moscow, 123112, Russia
- HSE University, International laboratory of statistical and computational genomics, Moscow, 109028, Russia
| | - Maxim Petrov
- AIRI, Presnenskaya embankment, 6 st22, Moscow, 123112, Russia
| | - Dmitry Penzar
- AIRI, Presnenskaya embankment, 6 st22, Moscow, 123112, Russia
| | - Denis Shepelin
- AIRI, Presnenskaya embankment, 6 st22, Moscow, 123112, Russia
| | | | - Olga Kardymon
- AIRI, Presnenskaya embankment, 6 st22, Moscow, 123112, Russia
| | - Mikhail Burtsev
- London Institute for Mathematical Sciences Royal Institution, 21 Albemarle St, London W1S 4BS, UK
| |
Collapse
|
3
|
Koeppel J, Weller J, Vanderstichele T, Parts L. Engineering structural variants to interrogate genome function. Nat Genet 2024; 56:2623-2635. [PMID: 39533047 DOI: 10.1038/s41588-024-01981-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2024] [Accepted: 10/10/2024] [Indexed: 11/16/2024]
Abstract
Structural variation, such as deletions, duplications, inversions and complex rearrangements, can have profound effects on gene expression, genome stability, phenotypic diversity and disease susceptibility. Structural variants can encompass up to millions of bases and have the potential to rearrange substantial segments of the genome. They contribute considerably more to genetic diversity in human populations and have larger effects on phenotypic traits than point mutations. Until recently, our understanding of the effects of structural variants was driven mainly by studying naturally occurring variation. New genome-engineering tools capable of generating deletions, insertions, inversions and translocations, together with the discovery of new recombinases and advances in creating synthetic DNA constructs, now enable the design and generation of an extended range of structural variation. Here, we discuss these tools and examples of their application and highlight existing challenges that will need to be overcome to fully harness their potential.
Collapse
|
4
|
Zhu Z, Han C, Huang S. New insights shed light on the enigma of genetic diversity and species complexity. SCIENCE CHINA. LIFE SCIENCES 2024; 67:2774-2776. [PMID: 39167323 DOI: 10.1007/s11427-023-2610-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/09/2023] [Accepted: 05/04/2024] [Indexed: 08/23/2024]
Affiliation(s)
- Zuobin Zhu
- Xuzhou Engineering Research Center of Medical Genetics and Transformation, Key Laboratory of Genetic Foundation and Clinical Application, Xuzhou Medical University, Xuzhou, 221004, China.
| | - Conghui Han
- Department of Urology, Xuzhou Clinical School of Xuzhou Medical University, Xuzhou Central Hospital, Xuzhou, 221009, China.
| | - Shi Huang
- Xuzhou Engineering Research Center of Medical Genetics and Transformation, Key Laboratory of Genetic Foundation and Clinical Application, Xuzhou Medical University, Xuzhou, 221004, China.
- Center for Medical Genetics, School of Life Sciences, Central South University, Changsha, 410078, China.
| |
Collapse
|
5
|
Nyerges A, Chiappino-Pepe A, Budnik B, Baas-Thomas M, Flynn R, Yan S, Ostrov N, Liu M, Wang M, Zheng Q, Hu F, Chen K, Rudolph A, Chen D, Ahn J, Spencer O, Ayalavarapu V, Tarver A, Harmon-Smith M, Hamilton M, Blaby I, Yoshikuni Y, Hajian B, Jin A, Kintses B, Szamel M, Seregi V, Shen Y, Li Z, Church GM. Synthetic genomes unveil the effects of synonymous recoding. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.06.16.599206. [PMID: 38915524 PMCID: PMC11195188 DOI: 10.1101/2024.06.16.599206] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/26/2024]
Abstract
Engineering the genetic code of an organism provides the basis for (i) making any organism safely resistant to natural viruses and (ii) preventing genetic information flow into and out of genetically modified organisms while (iii) allowing the biosynthesis of genetically encoded unnatural polymers1-4. Achieving these three goals requires the reassignment of multiple of the 64 codons nature uses to encode proteins. However, synonymous codon replacement-recoding-is frequently lethal, and how recoding impacts fitness remains poorly explored. Here, we explore these effects using whole-genome synthesis, multiplexed directed evolution, and genome-transcriptome-translatome-proteome co-profiling on multiple recoded genomes. Using this information, we assemble a synthetic Escherichia coli genome in seven sections using only 57 codons to encode proteins. By discovering the rules responsible for the lethality of synonymous recoding and developing a data-driven multi-omics-based genome construction workflow that troubleshoots synthetic genomes, we overcome the lethal effects of 62,007 synonymous codon swaps and 11,108 additional genomic edits. We show that synonymous recoding induces transcriptional noise including new antisense RNAs, leading to drastic transcriptome and proteome perturbation. As the elimination of select codons from an organism's genetic code results in the widespread appearance of cryptic promoters, we show that synonymous codon choice may naturally evolve to minimize transcriptional noise. Our work provides the first genome-scale description of how synonymous codon changes influence organismal fitness and paves the way for the construction of functional genomes that provide genetic firewalls from natural ecosystems and safely produce biopolymers, drugs, and enzymes with an expanded chemistry.
Collapse
Affiliation(s)
- Akos Nyerges
- Department of Genetics, Harvard Medical School, Boston, MA 02115, USA
| | | | - Bogdan Budnik
- Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA 02115, USA
| | | | - Regan Flynn
- Department of Genetics, Harvard Medical School, Boston, MA 02115, USA
| | - Shirui Yan
- Department of Genetics, Harvard Medical School, Boston, MA 02115, USA
- BGI Research, Shenzhen 518083, China
| | - Nili Ostrov
- Department of Genetics, Harvard Medical School, Boston, MA 02115, USA
| | - Min Liu
- GenScript USA Inc., Piscataway, NJ 08854, USA
| | | | | | | | | | - Alexandra Rudolph
- Department of Genetics, Harvard Medical School, Boston, MA 02115, USA
| | - Dawn Chen
- Department of Genetics, Harvard Medical School, Boston, MA 02115, USA
| | - Jenny Ahn
- Department of Genetics, Harvard Medical School, Boston, MA 02115, USA
| | - Owen Spencer
- Department of Genetics, Harvard Medical School, Boston, MA 02115, USA
| | | | - Angela Tarver
- DOE Joint Genome Institute (JGI), Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Miranda Harmon-Smith
- DOE Joint Genome Institute (JGI), Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Matthew Hamilton
- DOE Joint Genome Institute (JGI), Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Ian Blaby
- DOE Joint Genome Institute (JGI), Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Yasuo Yoshikuni
- DOE Joint Genome Institute (JGI), Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Behnoush Hajian
- Center for the Development of Therapeutics, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Adeline Jin
- GenScript USA Inc., Piscataway, NJ 08854, USA
| | - Balint Kintses
- Institute of Biochemistry, HUN-REN Biological Research Centre, Szeged, 6726, Hungary
| | - Monika Szamel
- Institute of Biochemistry, HUN-REN Biological Research Centre, Szeged, 6726, Hungary
| | - Viktoria Seregi
- Institute of Biochemistry, HUN-REN Biological Research Centre, Szeged, 6726, Hungary
| | - Yue Shen
- BGI Research, Shenzhen 518083, China
- BGI Research, Changzhou 213299, China
- Guangdong Provincial Key Laboratory of Genome Read and Write, BGI Research, Shenzhen 518083, China
| | - Zilong Li
- GenScript USA Inc., Piscataway, NJ 08854, USA
| | - George M. Church
- Department of Genetics, Harvard Medical School, Boston, MA 02115, USA
- Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA 02115, USA
| |
Collapse
|
6
|
Ordoñez R, Zhang W, Ellis G, Zhu Y, Ashe HJ, Ribeiro-Dos-Santos AM, Brosh R, Huang E, Hogan MS, Boeke JD, Maurano MT. Genomic context sensitizes regulatory elements to genetic disruption. Mol Cell 2024; 84:1842-1854.e7. [PMID: 38759624 PMCID: PMC11104518 DOI: 10.1016/j.molcel.2024.04.013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2023] [Revised: 03/11/2024] [Accepted: 04/18/2024] [Indexed: 05/19/2024]
Abstract
Genomic context critically modulates regulatory function but is difficult to manipulate systematically. The murine insulin-like growth factor 2 (Igf2)/H19 locus is a paradigmatic model of enhancer selectivity, whereby CTCF occupancy at an imprinting control region directs downstream enhancers to activate either H19 or Igf2. We used synthetic regulatory genomics to repeatedly replace the native locus with 157-kb payloads, and we systematically dissected its architecture. Enhancer deletion and ectopic delivery revealed previously uncharacterized long-range regulatory dependencies at the native locus. Exchanging the H19 enhancer cluster with the Sox2 locus control region (LCR) showed that the H19 enhancers relied on their native surroundings while the Sox2 LCR functioned autonomously. Analysis of regulatory DNA actuation across cell types revealed that these enhancer clusters typify broader classes of context sensitivity genome wide. These results show that unexpected dependencies influence even well-studied loci, and our approach permits large-scale manipulation of complete loci to investigate the relationship between regulatory architecture and function.
Collapse
Affiliation(s)
- Raquel Ordoñez
- Institute for Systems Genetics, NYU School of Medicine, New York, NY 10016, USA
| | - Weimin Zhang
- Institute for Systems Genetics, NYU School of Medicine, New York, NY 10016, USA
| | - Gwen Ellis
- Institute for Systems Genetics, NYU School of Medicine, New York, NY 10016, USA
| | - Yinan Zhu
- Institute for Systems Genetics, NYU School of Medicine, New York, NY 10016, USA
| | - Hannah J Ashe
- Institute for Systems Genetics, NYU School of Medicine, New York, NY 10016, USA
| | | | - Ran Brosh
- Institute for Systems Genetics, NYU School of Medicine, New York, NY 10016, USA
| | - Emily Huang
- Institute for Systems Genetics, NYU School of Medicine, New York, NY 10016, USA
| | - Megan S Hogan
- Institute for Systems Genetics, NYU School of Medicine, New York, NY 10016, USA
| | - Jef D Boeke
- Institute for Systems Genetics, NYU School of Medicine, New York, NY 10016, USA; Department of Biochemistry Molecular Pharmacology, NYU School of Medicine, New York, NY 10016, USA; Department of Biomedical Engineering, NYU Tandon School of Engineering, Brooklyn, NY 11201, USA
| | - Matthew T Maurano
- Institute for Systems Genetics, NYU School of Medicine, New York, NY 10016, USA; Department of Pathology, NYU School of Medicine, New York, NY 10016, USA.
| |
Collapse
|
7
|
|
8
|
Musings on art and science. Nat Struct Mol Biol 2024; 31:391-392. [PMID: 38499831 DOI: 10.1038/s41594-024-01266-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/20/2024]
|