1
|
Yu T, Cheng L, Khalitov R, Olsson EB, Yang Z. Self-distillation improves self-supervised learning for DNA sequence inference. Neural Netw 2025; 183:106978. [PMID: 39667220 DOI: 10.1016/j.neunet.2024.106978] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2024] [Revised: 10/28/2024] [Accepted: 11/26/2024] [Indexed: 12/14/2024]
Abstract
Self-supervised Learning (SSL) has been recognized as a method to enhance prediction accuracy in various downstream tasks. However, its efficacy for DNA sequences remains somewhat constrained. This limitation stems primarily from the fact that most existing SSL approaches in genomics focus on masked language modeling of individual sequences, neglecting the crucial aspect of encoding statistics across multiple sequences. To overcome this challenge, we introduce an innovative deep neural network model, which incorporates collaborative learning between a 'student' and a 'teacher' subnetwork. In this model, the student subnetwork employs masked learning on nucleotides and progressively adapts its parameters to the teacher subnetwork through an exponential moving average approach. Concurrently, both subnetworks engage in contrastive learning, deriving insights from two augmented representations of the input sequences. This self-distillation process enables our model to effectively assimilate both contextual information from individual sequences and distributional data across the sequence population. We validated our approach with preliminary pretraining using the human reference genome, followed by applying it to 20 downstream inference tasks. The empirical results from these experiments demonstrate that our novel method significantly boosts inference performance across the majority of these tasks. Our code is available at https://github.com/wiedersehne/FinDNA.
Collapse
Affiliation(s)
- Tong Yu
- Norwegian University of Science and Technology, Trondheim, Norway.
| | - Lei Cheng
- Norwegian University of Science and Technology, Trondheim, Norway
| | - Ruslan Khalitov
- Norwegian University of Science and Technology, Trondheim, Norway
| | - Erland B Olsson
- Norwegian University of Science and Technology, Trondheim, Norway
| | - Zhirong Yang
- Norwegian University of Science and Technology, Trondheim, Norway
| |
Collapse
|
2
|
Friedman RZ, Ramu A, Lichtarge S, Wu Y, Tripp L, Lyon D, Myers CA, Granas DM, Gause M, Corbo JC, Cohen BA, White MA. Active learning of enhancers and silencers in the developing neural retina. Cell Syst 2025; 16:101163. [PMID: 39778579 DOI: 10.1016/j.cels.2024.12.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2024] [Revised: 10/17/2024] [Accepted: 12/06/2024] [Indexed: 01/11/2025]
Abstract
Deep learning is a promising strategy for modeling cis-regulatory elements. However, models trained on genomic sequences often fail to explain why the same transcription factor can activate or repress transcription in different contexts. To address this limitation, we developed an active learning approach to train models that distinguish between enhancers and silencers composed of binding sites for the photoreceptor transcription factor cone-rod homeobox (CRX). After training the model on nearly all bound CRX sites from the genome, we coupled synthetic biology with uncertainty sampling to generate additional rounds of informative training data. This allowed us to iteratively train models on data from multiple rounds of massively parallel reporter assays. The ability of the resulting models to discriminate between CRX sites with identical sequence but opposite functions establishes active learning as an effective strategy to train models of regulatory DNA. A record of this paper's transparent peer review process is included in the supplemental information.
Collapse
Affiliation(s)
- Ryan Z Friedman
- The Edison Family Center for Genome Sciences & Systems Biology, Saint Louis, MO 63110, USA; Department of Genetics, Saint Louis, MO 63110, USA
| | - Avinash Ramu
- The Edison Family Center for Genome Sciences & Systems Biology, Saint Louis, MO 63110, USA; Department of Genetics, Saint Louis, MO 63110, USA
| | - Sara Lichtarge
- The Edison Family Center for Genome Sciences & Systems Biology, Saint Louis, MO 63110, USA; Department of Genetics, Saint Louis, MO 63110, USA
| | - Yawei Wu
- The Edison Family Center for Genome Sciences & Systems Biology, Saint Louis, MO 63110, USA; Department of Genetics, Saint Louis, MO 63110, USA
| | - Lloyd Tripp
- The Edison Family Center for Genome Sciences & Systems Biology, Saint Louis, MO 63110, USA; Department of Genetics, Saint Louis, MO 63110, USA
| | - Daniel Lyon
- The Edison Family Center for Genome Sciences & Systems Biology, Saint Louis, MO 63110, USA; Department of Genetics, Saint Louis, MO 63110, USA
| | - Connie A Myers
- Department of Pathology and Immunology, Washington University School of Medicine, Saint Louis, MO 63110, USA
| | - David M Granas
- The Edison Family Center for Genome Sciences & Systems Biology, Saint Louis, MO 63110, USA; Department of Genetics, Saint Louis, MO 63110, USA
| | - Maria Gause
- Department of Pathology and Immunology, Washington University School of Medicine, Saint Louis, MO 63110, USA
| | - Joseph C Corbo
- Department of Pathology and Immunology, Washington University School of Medicine, Saint Louis, MO 63110, USA
| | - Barak A Cohen
- The Edison Family Center for Genome Sciences & Systems Biology, Saint Louis, MO 63110, USA; Department of Genetics, Saint Louis, MO 63110, USA
| | - Michael A White
- The Edison Family Center for Genome Sciences & Systems Biology, Saint Louis, MO 63110, USA; Department of Genetics, Saint Louis, MO 63110, USA.
| |
Collapse
|
3
|
Fradkin P, Shi R, Isaev K, Frey BJ, Morris Q, Lee LJ, Wang B. Orthrus: Towards Evolutionary and Functional RNA Foundation Models. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.10.10.617658. [PMID: 39416135 PMCID: PMC11482885 DOI: 10.1101/2024.10.10.617658] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/19/2024]
Abstract
In the face of rapidly accumulating genomic data, our ability to accurately predict key mature RNA properties that underlie transcript function and regulation remains limited. Pre-trained genomic foundation models offer an avenue to adapt learned RNA representations to biological prediction tasks. However, existing genomic foundation models are trained using strategies borrowed from textual or visual domains that do not leverage biological domain knowledge. Here, we introduce Orthrus, a Mamba-based mature RNA foundation model pre-trained using a novel self-supervised contrastive learning objective with biological augmentations. Orthrus is trained by maximizing embedding similarity between curated pairs of RNA transcripts, where pairs are formed from splice isoforms of 10 model organisms and transcripts from orthologous genes in 400+ mammalian species from the Zoonomia Project. This training objective results in a latent representation that clusters RNA sequences with functional and evolutionary similarities. We find that the generalized mature RNA isoform representations learned by Orthrus significantly outperform existing genomic foundation models on five mRNA property prediction tasks, and requires only a fraction of fine-tuning data to do so. Finally, we show that Orthrus is capable of capturing divergent biological function of individual transcript isoforms.
Collapse
Affiliation(s)
- Philip Fradkin
- Vector Institute, Ontario, Canada
- Computer Science, University of Toronto, Ontario, Canada
| | - Ruian Shi
- Vector Institute, Ontario, Canada
- Computer Science, University of Toronto, Ontario, Canada
- Computational and Systems Biology Program, Sloan Kettering Institute, New York, United States
| | - Keren Isaev
- New York Genome Center, New York, United States
- Systems Biology, Columbia University, New York, United States
| | - Brendan J Frey
- Vector Institute, Ontario, Canada
- Computer Science, University of Toronto, Ontario, Canada
- Electrical and Computer Engineering, University of Toronto, Ontario, Canada
| | - Quaid Morris
- Computational and Systems Biology Program, Sloan Kettering Institute, New York, United States
| | - Leo J Lee
- Vector Institute, Ontario, Canada
- Electrical and Computer Engineering, University of Toronto, Ontario, Canada
| | - Bo Wang
- Vector Institute, Ontario, Canada
- Computer Science, University of Toronto, Ontario, Canada
- Peter Munk Cardiac Center, University Health Network, Ontario, Canada
| |
Collapse
|
4
|
Zhou J, Rizzo K, Tang Z, Koo PK. Uncertainty-aware genomic deep learning with knowledge distillation. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.11.13.623485. [PMID: 39605624 PMCID: PMC11601481 DOI: 10.1101/2024.11.13.623485] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 11/29/2024]
Abstract
Deep neural networks (DNNs) have advanced predictive modeling for regulatory genomics, but challenges remain in ensuring the reliability of their predictions and understanding the key factors behind their decision making. Here we introduce DEGU (Distilling Ensembles for Genomic Uncertainty-aware models), a method that integrates ensemble learning and knowledge distillation to improve the robustness and explainability of DNN predictions. DEGU distills the predictions of an ensemble of DNNs into a single model, capturing both the average of the ensemble's predictions and the variability across them, with the latter representing epistemic (or model-based) uncertainty. DEGU also includes an optional auxiliary task to estimate aleatoric, or data-based, uncertainty by modeling variability across experimental replicates. By applying DEGU across various functional genomic prediction tasks, we demonstrate that DEGU-trained models inherit the performance benefits of ensembles in a single model, with improved generalization to out-of-distribution sequences and more consistent explanations of cis-regulatory mechanisms through attribution analysis. Moreover, DEGU-trained models provide calibrated uncertainty estimates, with conformal prediction offering coverage guarantees under minimal assumptions. Overall, DEGU paves the way for robust and trustworthy applications of deep learning in genomics research.
Collapse
Affiliation(s)
- Jessica Zhou
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, NY, USA
| | - Kaeli Rizzo
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, NY, USA
| | - Ziqi Tang
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, NY, USA
- Currently at InstaDeep, Cambridge, MA, USA
| | - Peter K Koo
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, NY, USA
| |
Collapse
|
5
|
Phan H, Brouard C, Mourad R. Semi-supervised learning with pseudo-labeling compares favorably with large language models for regulatory sequence prediction. Brief Bioinform 2024; 25:bbae560. [PMID: 39489607 PMCID: PMC11531863 DOI: 10.1093/bib/bbae560] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2024] [Revised: 09/13/2024] [Accepted: 10/17/2024] [Indexed: 11/05/2024] Open
Abstract
Predicting molecular processes using deep learning is a promising approach to provide biological insights for non-coding single nucleotide polymorphisms identified in genome-wide association studies. However, most deep learning methods rely on supervised learning, which requires DNA sequences associated with functional data, and whose amount is severely limited by the finite size of the human genome. Conversely, the amount of mammalian DNA sequences is growing exponentially due to ongoing large-scale sequencing projects, but in most cases without functional data. To alleviate the limitations of supervised learning, we propose a novel semi-supervised learning (SSL) based on pseudo-labeling, which allows to exploit unlabeled DNA sequences from numerous genomes during model pre-training. We further improved it incorporating principles from the Noisy Student algorithm to predict the confidence in pseudo-labeled data used for pre-training, which showed improvements for transcription factor with very few binding (very small training data). The approach is very flexible and can be used to train any neural architecture including state-of-the-art models, and shows in most cases strong predictive performance improvements compared to standard supervised learning. Moreover, small models trained by SSL showed similar or better performance than large language model DNABERT2.
Collapse
Affiliation(s)
- Han Phan
- INRAE, MIAT, 31326 Castanet-Tolosan, France
| | | | - Raphaël Mourad
- INRAE, MIAT, 31326 Castanet-Tolosan, France
- University of Toulouse, UPS, 31062 Toulouse, France
| |
Collapse
|
6
|
Chen K, Nan J, Xiong X. Genetic regulation of m 6A RNA methylation and its contribution in human complex diseases. SCIENCE CHINA. LIFE SCIENCES 2024; 67:1591-1600. [PMID: 38764000 DOI: 10.1007/s11427-024-2609-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/08/2024] [Accepted: 05/02/2024] [Indexed: 05/21/2024]
Abstract
N6-methyladenosine (m6A) has been established as the most prevalent chemical modification in message RNA (mRNA), playing an essential role in determining the fate of RNA molecules. Dysregulation of m6A has been revealed to lead to abnormal physiological conditions and cause various types of human diseases. Recent studies have delineated the genetic regulatory maps for m6A methylation by mapping the quantitative trait loci of m6A (m6A-QTLs), thereby building up the regulatory circuits linking genetic variants, m6A, and human complex traits. Here, we review the recent discoveries concerning the genetic regulatory maps of m6A, describing the methodological and technical details of m6A-QTL identification, and introducing the key findings of the cis- and trans-acting drivers of m6A. We further delve into the tissue- and ethnicity-specificity of m6A-QTL, the association with other molecular phenotypes in light of genetic regulation, the regulators underlying m6A genetics, and importantly, the functional roles of m6A in mediating human complex diseases. Lastly, we discuss potential research avenues that can accelerate the translation of m6A genetics studies toward the development of therapies for human genetic diseases.
Collapse
Affiliation(s)
- Kexuan Chen
- The Second Affiliated Hospital & Liangzhu Laboratory, Zhejiang University School of Medicine, Hangzhou, 311121, China
- State Key Laboratory of Transvascular Implantation Devices, The Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, 311121, China
| | - Jiuhong Nan
- The Second Affiliated Hospital & Liangzhu Laboratory, Zhejiang University School of Medicine, Hangzhou, 311121, China
- State Key Laboratory of Transvascular Implantation Devices, The Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, 311121, China
| | - Xushen Xiong
- The Second Affiliated Hospital & Liangzhu Laboratory, Zhejiang University School of Medicine, Hangzhou, 311121, China.
- State Key Laboratory of Transvascular Implantation Devices, The Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, 311121, China.
| |
Collapse
|
7
|
Lee H, Ozbulak U, Park H, Depuydt S, De Neve W, Vankerschaver J. Assessing the reliability of point mutation as data augmentation for deep learning with genomic data. BMC Bioinformatics 2024; 25:170. [PMID: 38689247 PMCID: PMC11059627 DOI: 10.1186/s12859-024-05787-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2024] [Accepted: 04/15/2024] [Indexed: 05/02/2024] Open
Abstract
BACKGROUND Deep neural networks (DNNs) have the potential to revolutionize our understanding and treatment of genetic diseases. An inherent limitation of deep neural networks, however, is their high demand for data during training. To overcome this challenge, other fields, such as computer vision, use various data augmentation techniques to artificially increase the available training data for DNNs. Unfortunately, most data augmentation techniques used in other domains do not transfer well to genomic data. RESULTS Most genomic data possesses peculiar properties and data augmentations may significantly alter the intrinsic properties of the data. In this work, we propose a novel data augmentation technique for genomic data inspired by biology: point mutations. By employing point mutations as substitutes for codons, we demonstrate that our newly proposed data augmentation technique enhances the performance of DNNs across various genomic tasks that involve coding regions, such as translation initiation and splice site detection. CONCLUSION Silent and missense mutations are found to positively influence effectiveness, while nonsense mutations and random mutations in non-coding regions generally lead to degradation. Overall, point mutation-based augmentations in genomic datasets present valuable opportunities for improving the accuracy and reliability of predictive models for DNA sequences.
Collapse
Affiliation(s)
| | - Utku Ozbulak
- Center for Biosystems and Biotech Data Science, Ghent University Global Campus, Incheon, South Korea
| | - Homin Park
- Center for Biosystems and Biotech Data Science, Ghent University Global Campus, Incheon, South Korea
- IDLab, Department of Electronics and Information Systems, Ghent University, Ghent, Belgium
| | - Stephen Depuydt
- Erasmus Brussels University of Applied Sciences and Arts, Brussels, Belgium
| | - Wesley De Neve
- Center for Biosystems and Biotech Data Science, Ghent University Global Campus, Incheon, South Korea
- IDLab, Department of Electronics and Information Systems, Ghent University, Ghent, Belgium
| | - Joris Vankerschaver
- Center for Biosystems and Biotech Data Science, Ghent University Global Campus, Incheon, South Korea.
- Department of Applied Mathematics, Computer Science and Statistics, Ghent University, Ghent, Belgium.
| |
Collapse
|
8
|
Duncan AG, Mitchell JA, Moses AM. Improving the performance of supervised deep learning for regulatory genomics using phylogenetic augmentation. Bioinformatics 2024; 40:btae190. [PMID: 38588559 PMCID: PMC11042905 DOI: 10.1093/bioinformatics/btae190] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2023] [Revised: 01/12/2024] [Accepted: 04/05/2024] [Indexed: 04/10/2024] Open
Abstract
MOTIVATION Supervised deep learning is used to model the complex relationship between genomic sequence and regulatory function. Understanding how these models make predictions can provide biological insight into regulatory functions. Given the complexity of the sequence to regulatory function mapping (the cis-regulatory code), it has been suggested that the genome contains insufficient sequence variation to train models with suitable complexity. Data augmentation is a widely used approach to increase the data variation available for model training, however current data augmentation methods for genomic sequence data are limited. RESULTS Inspired by the success of comparative genomics, we show that augmenting genomic sequences with evolutionarily related sequences from other species, which we term phylogenetic augmentation, improves the performance of deep learning models trained on regulatory genomic sequences to predict high-throughput functional assay measurements. Additionally, we show that phylogenetic augmentation can rescue model performance when the training set is down-sampled and permits deep learning on a real-world small dataset, demonstrating that this approach improves data efficiency. Overall, this data augmentation method represents a solution for improving model performance that is applicable to many supervised deep-learning problems in genomics. AVAILABILITY AND IMPLEMENTATION The open-source GitHub repository agduncan94/phylogenetic_augmentation_paper includes the code for rerunning the analyses here and recreating the figures.
Collapse
Affiliation(s)
- Andrew G Duncan
- Cell & Systems Biology, University of Toronto, Toronto, ON M5S 3G5, Canada
| | | | - Alan M Moses
- Cell & Systems Biology, University of Toronto, Toronto, ON M5S 3G5, Canada
| |
Collapse
|
9
|
Unger M, Kather JN. Deep learning in cancer genomics and histopathology. Genome Med 2024; 16:44. [PMID: 38539231 PMCID: PMC10976780 DOI: 10.1186/s13073-024-01315-6] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2023] [Accepted: 03/13/2024] [Indexed: 07/08/2024] Open
Abstract
Histopathology and genomic profiling are cornerstones of precision oncology and are routinely obtained for patients with cancer. Traditionally, histopathology slides are manually reviewed by highly trained pathologists. Genomic data, on the other hand, is evaluated by engineered computational pipelines. In both applications, the advent of modern artificial intelligence methods, specifically machine learning (ML) and deep learning (DL), have opened up a fundamentally new way of extracting actionable insights from raw data, which could augment and potentially replace some aspects of traditional evaluation workflows. In this review, we summarize current and emerging applications of DL in histopathology and genomics, including basic diagnostic as well as advanced prognostic tasks. Based on a growing body of evidence, we suggest that DL could be the groundwork for a new kind of workflow in oncology and cancer research. However, we also point out that DL models can have biases and other flaws that users in healthcare and research need to know about, and we propose ways to address them.
Collapse
Affiliation(s)
- Michaela Unger
- Else Kroener Fresenius Center for Digital Health, Medical Faculty Carl Gustav Carus, TUD Dresden University of Technology, Dresden, Germany.
| | - Jakob Nikolas Kather
- Else Kroener Fresenius Center for Digital Health, Medical Faculty Carl Gustav Carus, TUD Dresden University of Technology, Dresden, Germany.
- Department of Medicine I, University Hospital Dresden, Dresden, Germany.
- Medical Oncology, National Center for Tumor Diseases (NCT), University Hospital Heidelberg, Heidelberg, Germany.
| |
Collapse
|
10
|
Yu Y, Muthukumar S, Koo PK. EvoAug-TF: extending evolution-inspired data augmentations for genomic deep learning to TensorFlow. Bioinformatics 2024; 40:btae092. [PMID: 38366935 PMCID: PMC10918628 DOI: 10.1093/bioinformatics/btae092] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2023] [Revised: 02/01/2024] [Accepted: 02/14/2024] [Indexed: 02/19/2024] Open
Abstract
SUMMARY Deep neural networks (DNNs) have been widely applied to predict the molecular functions of the non-coding genome. DNNs are data hungry and thus require many training examples to fit data well. However, functional genomics experiments typically generate limited amounts of data, constrained by the activity levels of the molecular function under study inside the cell. Recently, EvoAug was introduced to train a genomic DNN with evolution-inspired augmentations. EvoAug-trained DNNs have demonstrated improved generalization and interpretability with attribution analysis. However, EvoAug only supports PyTorch-based models, which limits its applications to a broad class of genomic DNNs based in TensorFlow. Here, we extend EvoAug's functionality to TensorFlow in a new package, we call EvoAug-TF. Through a systematic benchmark, we find that EvoAug-TF yields comparable performance with the original EvoAug package. AVAILABILITY AND IMPLEMENTATION EvoAug-TF is freely available for users and is distributed under an open-source MIT license. Researchers can access the open-source code on GitHub (https://github.com/p-koo/evoaug-tf). The pre-compiled package is provided via PyPI (https://pypi.org/project/evoaug-tf) with in-depth documentation on ReadTheDocs (https://evoaug-tf.readthedocs.io). The scripts for reproducing the results are available at (https://github.com/p-koo/evoaug-tf_analysis).
Collapse
Affiliation(s)
- Yiyang Yu
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724, United States
| | | | - Peter K Koo
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724, United States
| |
Collapse
|
11
|
Yu Y, Muthukumar S, Koo PK. EvoAug-TF: Extending evolution-inspired data augmentations for genomic deep learning to TensorFlow. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.01.17.575961. [PMID: 38293144 PMCID: PMC10827165 DOI: 10.1101/2024.01.17.575961] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/01/2024]
Abstract
Deep neural networks (DNNs) have been widely applied to predict the molecular functions of regulatory regions in the non-coding genome. DNNs are data hungry and thus require many training examples to fit data well. However, functional genomics experiments typically generate limited amounts of data, constrained by the activity levels of the molecular function under study inside the cell. Recently, EvoAug was introduced to train a genomic DNN with evolution-inspired augmentations. EvoAug-trained DNNs have demonstrated improved generalization and interpretability with attribution analysis. However, EvoAug only supports PyTorch-based models, which limits its applications to a broad class of genomic DNNs based in TensorFlow. Here, we extend EvoAug's functionality to TensorFlow in a new package we call EvoAug-TF. Through a systematic benchmark, we find that EvoAug-TF yields comparable performance with the original EvoAug package. Availability EvoAug-TF is freely available for users and is distributed under an open-source MIT license. Researchers can access the open-source code on GitHub (https://github.com/p-koo/evoaug-tf). The pre-compiled package is provided via PyPI (https://pypi.org/project/evoaug-tf) with in-depth documentation on ReadTheDocs (https://evoaug-tf.readthedocs.io). The scripts for reproducing the results are available at (https://github.com/p-koo/evoaug-tf_analysis).
Collapse
Affiliation(s)
- Yiyang Yu
- Columbia University, New york, NY, USA
| | | | - Peter K Koo
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, NY, USA
| |
Collapse
|