1
|
Gharavi E, LeRoy NJ, Zheng G, Zhang A, Brown DE, Sheffield NC. Joint Representation Learning for Retrieval and Annotation of Genomic Interval Sets. Bioengineering (Basel) 2024; 11:263. [PMID: 38534537 DOI: 10.3390/bioengineering11030263] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2023] [Revised: 02/20/2024] [Accepted: 02/22/2024] [Indexed: 03/28/2024] Open
Abstract
As available genomic interval data increase in scale, we require fast systems to search them. A common approach is simple string matching to compare a search term to metadata, but this is limited by incomplete or inaccurate annotations. An alternative is to compare data directly through genomic region overlap analysis, but this approach leads to challenges like sparsity, high dimensionality, and computational expense. We require novel methods to quickly and flexibly query large, messy genomic interval databases. Here, we develop a genomic interval search system using representation learning. We train numerical embeddings for a collection of region sets simultaneously with their metadata labels, capturing similarity between region sets and their metadata in a low-dimensional space. Using these learned co-embeddings, we develop a system that solves three related information retrieval tasks using embedding distance computations: retrieving region sets related to a user query string, suggesting new labels for database region sets, and retrieving database region sets similar to a query region set. We evaluate these use cases and show that jointly learned representations of region sets and metadata are a promising approach for fast, flexible, and accurate genomic region information retrieval.
Collapse
Affiliation(s)
- Erfaneh Gharavi
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- School of Data Science, University of Virginia, Charlottesville, VA 22904, USA
| | - Nathan J LeRoy
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- Department of Biomedical Engineering, School of Medicine, University of Virginia, Charlottesville, VA 22904, USA
| | - Guangtao Zheng
- Department of Computer Science, School of Engineering, University of Virginia, Charlottesville, VA 22908, USA
| | - Aidong Zhang
- School of Data Science, University of Virginia, Charlottesville, VA 22904, USA
- Department of Biomedical Engineering, School of Medicine, University of Virginia, Charlottesville, VA 22904, USA
- Department of Computer Science, School of Engineering, University of Virginia, Charlottesville, VA 22908, USA
| | - Donald E Brown
- School of Data Science, University of Virginia, Charlottesville, VA 22904, USA
- Department of Systems and Information Engineering, University of Virginia, Charlottesville, VA 22908, USA
| | - Nathan C Sheffield
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- School of Data Science, University of Virginia, Charlottesville, VA 22904, USA
- Department of Biomedical Engineering, School of Medicine, University of Virginia, Charlottesville, VA 22904, USA
- Department of Computer Science, School of Engineering, University of Virginia, Charlottesville, VA 22908, USA
- Department of Public Health Sciences, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- Department of Biochemistry and Molecular Genetics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- Child Health Research Center, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
| |
Collapse
|
2
|
Kupkova K, Mosquera JV, Smith JP, Stolarczyk M, Danehy TL, Lawson JT, Xue B, Stubbs JT, LeRoy N, Sheffield NC. GenomicDistributions: fast analysis of genomic intervals with Bioconductor. BMC Genomics 2022; 23:299. [PMID: 35413804 PMCID: PMC9003978 DOI: 10.1186/s12864-022-08467-y] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2021] [Accepted: 03/13/2022] [Indexed: 11/10/2022] Open
Abstract
Background Epigenome analysis relies on defined sets of genomic regions output by widely used assays such as ChIP-seq and ATAC-seq. Statistical analysis and visualization of genomic region sets is essential to answer biological questions in gene regulation. As the epigenomics community continues generating data, there will be an increasing need for software tools that can efficiently deal with more abundant and larger genomic region sets. Here, we introduce GenomicDistributions, an R package for fast and easy summarization and visualization of genomic region data. Results GenomicDistributions offers a broad selection of functions to calculate properties of genomic region sets, such as feature distances, genomic partition overlaps, and more. GenomicDistributions functions are meticulously optimized for best-in-class speed and generally outperform comparable functions in existing R packages. GenomicDistributions also offers plotting functions that produce editable ggplot objects. All GenomicDistributions functions follow a uniform naming scheme and can handle either single or multiple region set inputs. Conclusions GenomicDistributions offers a fast and scalable tool for exploratory genomic region set analysis and visualization. GenomicDistributions excels in user-friendliness, flexibility of outputs, breadth of functions, and computational performance. GenomicDistributions is available from Bioconductor (https://bioconductor.org/packages/release/bioc/html/GenomicDistributions.html). Supplementary Information The online version contains supplementary material available at 10.1186/s12864-022-08467-y.
Collapse
Affiliation(s)
- Kristyna Kupkova
- Center for Public Health Genomics, University of Virginia, Charlottesville, USA.,Department of Biochemistry and Molecular Genetics, University of Virginia, Charlottesville, USA
| | - Jose Verdezoto Mosquera
- Center for Public Health Genomics, University of Virginia, Charlottesville, USA.,Department of Biochemistry and Molecular Genetics, University of Virginia, Charlottesville, USA
| | - Jason P Smith
- Center for Public Health Genomics, University of Virginia, Charlottesville, USA.,Department of Biochemistry and Molecular Genetics, University of Virginia, Charlottesville, USA
| | - Michał Stolarczyk
- Center for Public Health Genomics, University of Virginia, Charlottesville, USA
| | - Tessa L Danehy
- Center for Public Health Genomics, University of Virginia, Charlottesville, USA
| | - John T Lawson
- Center for Public Health Genomics, University of Virginia, Charlottesville, USA.,Department of Biomedical Engineering, University of Virginia, Charlottesville, USA
| | - Bingjie Xue
- Center for Public Health Genomics, University of Virginia, Charlottesville, USA.,Department of Biomedical Engineering, University of Virginia, Charlottesville, USA
| | - John T Stubbs
- Center for Public Health Genomics, University of Virginia, Charlottesville, USA.,Department of Biochemistry and Molecular Genetics, University of Virginia, Charlottesville, USA
| | - Nathan LeRoy
- Center for Public Health Genomics, University of Virginia, Charlottesville, USA.,Department of Biomedical Engineering, University of Virginia, Charlottesville, USA
| | - Nathan C Sheffield
- Center for Public Health Genomics, University of Virginia, Charlottesville, USA. .,Department of Biochemistry and Molecular Genetics, University of Virginia, Charlottesville, USA. .,Department of Biomedical Engineering, University of Virginia, Charlottesville, USA. .,Department of Public Health Sciences, University of Virginia, Charlottesville, USA.
| |
Collapse
|
3
|
Gu A, Cho HJ, Sheffield NC. Bedshift: perturbation of genomic interval sets. Genome Biol 2021; 22:238. [PMID: 34416909 PMCID: PMC8379854 DOI: 10.1186/s13059-021-02440-w] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2020] [Accepted: 07/26/2021] [Indexed: 12/25/2022] Open
Abstract
Functional genomics experiments, like ChIP-Seq or ATAC-Seq, produce results that are summarized as a region set. There is no way to objectively evaluate the effectiveness of region set similarity metrics. We present Bedshift, a tool for perturbing BED files by randomly shifting, adding, and dropping regions from a reference file. The perturbed files can be used to benchmark similarity metrics, as well as for other applications. We highlight differences in behavior between metrics, such as that the Jaccard score is most sensitive to added or dropped regions, while coverage score is most sensitive to shifted regions.
Collapse
Affiliation(s)
- Aaron Gu
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA, USA
- Department of Computer Science, University of Virginia School of Engineering, Charlottesville, VA, USA
| | - Hyun Jae Cho
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA, USA
- Department of Computer Science, University of Virginia School of Engineering, Charlottesville, VA, USA
| | - Nathan C Sheffield
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA, USA.
- Department of Public Health Sciences, University of Virginia, Charlottesville, VA, USA.
- Department of Biomedical Engineering, University of Virginia, Charlottesville, VA, USA.
- Department of Biochemistry and Molecular Genetics, University of Virginia, Charlottesville, VA, USA.
| |
Collapse
|