1
|
Gharavi E, LeRoy NJ, Zheng G, Zhang A, Brown DE, Sheffield NC. Joint Representation Learning for Retrieval and Annotation of Genomic Interval Sets. Bioengineering (Basel) 2024; 11:263. [PMID: 38534537 DOI: 10.3390/bioengineering11030263] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2023] [Revised: 02/20/2024] [Accepted: 02/22/2024] [Indexed: 03/28/2024] Open
Abstract
As available genomic interval data increase in scale, we require fast systems to search them. A common approach is simple string matching to compare a search term to metadata, but this is limited by incomplete or inaccurate annotations. An alternative is to compare data directly through genomic region overlap analysis, but this approach leads to challenges like sparsity, high dimensionality, and computational expense. We require novel methods to quickly and flexibly query large, messy genomic interval databases. Here, we develop a genomic interval search system using representation learning. We train numerical embeddings for a collection of region sets simultaneously with their metadata labels, capturing similarity between region sets and their metadata in a low-dimensional space. Using these learned co-embeddings, we develop a system that solves three related information retrieval tasks using embedding distance computations: retrieving region sets related to a user query string, suggesting new labels for database region sets, and retrieving database region sets similar to a query region set. We evaluate these use cases and show that jointly learned representations of region sets and metadata are a promising approach for fast, flexible, and accurate genomic region information retrieval.
Collapse
Affiliation(s)
- Erfaneh Gharavi
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- School of Data Science, University of Virginia, Charlottesville, VA 22904, USA
| | - Nathan J LeRoy
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- Department of Biomedical Engineering, School of Medicine, University of Virginia, Charlottesville, VA 22904, USA
| | - Guangtao Zheng
- Department of Computer Science, School of Engineering, University of Virginia, Charlottesville, VA 22908, USA
| | - Aidong Zhang
- School of Data Science, University of Virginia, Charlottesville, VA 22904, USA
- Department of Biomedical Engineering, School of Medicine, University of Virginia, Charlottesville, VA 22904, USA
- Department of Computer Science, School of Engineering, University of Virginia, Charlottesville, VA 22908, USA
| | - Donald E Brown
- School of Data Science, University of Virginia, Charlottesville, VA 22904, USA
- Department of Systems and Information Engineering, University of Virginia, Charlottesville, VA 22908, USA
| | - Nathan C Sheffield
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- School of Data Science, University of Virginia, Charlottesville, VA 22904, USA
- Department of Biomedical Engineering, School of Medicine, University of Virginia, Charlottesville, VA 22904, USA
- Department of Computer Science, School of Engineering, University of Virginia, Charlottesville, VA 22908, USA
- Department of Public Health Sciences, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- Department of Biochemistry and Molecular Genetics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- Child Health Research Center, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
| |
Collapse
|
2
|
Wang Y, Wei Z, Su J, Coenen F, Meng J. RgnTX: Colocalization analysis of transcriptome elements in the presence of isoform heterogeneity and ambiguity. Comput Struct Biotechnol J 2023; 21:4110-4117. [PMID: 37671241 PMCID: PMC10475473 DOI: 10.1016/j.csbj.2023.08.021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2023] [Revised: 08/13/2023] [Accepted: 08/23/2023] [Indexed: 09/07/2023] Open
Abstract
Colocalization analysis of genomic region sets has been widely adopted to unveil potential functional interactions between corresponding biological attributes, which often serves as the basis for further investigation. A number of methods have been developed for colocalization analysis of genomic elements. However, none of them explicitly considered the transcriptome heterogeneity and isoform ambiguity, making them less appropriate for analyzing transcriptome elements. Here, we developed RgnTX, an R/Bioconductor tool for the colocalization analysis of transcriptome elements with permutation tests. Different from existing approaches, RgnTX directly takes advantage of transcriptome annotation, and offers high flexibility in the null model to simulate realistic transcriptome-wide background, such as the complex alternative splicing patterns. Importantly, it supports the testing of transcriptome elements without clear isoform association, which is often the real scenario due to technical limitations. Proposed package offers a wide selection of pre-defined functions, easy to be utilized by users for visualizing permutation results, calculating shifted z-scores and conducting multiple hypothesis testing under Benjamini-Hochberg correction. Moreover, with synthetic and real datasets, we show that RgnTX novel testing modes return distinct and more significant results compared to existing genome-based methods. We believe RgnTX should make a useful tool to characterize the randomness of the transcriptome, and for conducting statistical association analysis for genomic region sets within the heterogeneous transcriptome. The package now has been accepted by Bioconductor and is freely available at: https://bioconductor.org/packages/RgnTX.
Collapse
Affiliation(s)
- Yue Wang
- Department of Mathematical Sciences, Xi’an Jiaotong-Liverpool University, Suzhou, Jiangsu 215123, China
- Department of Computer Science, University of Liverpool, L69 7ZB Liverpool, United Kingdom
| | - Zhen Wei
- Department of Biological Sciences, Xi’an Jiaotong-Liverpool University, Suzhou, Jiangsu 215123, China
- Institute of Systems, Molecular and Integrative Biology, University of Liverpool, L69 7ZB Liverpool, United Kingdom
| | - Jionglong Su
- School of AI and Advanced Computing, Xi’an Jiaotong-Liverpool University, Suzhou, Jiangsu 215123, China
| | - Frans Coenen
- Department of Computer Science, University of Liverpool, L69 7ZB Liverpool, United Kingdom
| | - Jia Meng
- Department of Biological Sciences, Xi’an Jiaotong-Liverpool University, Suzhou, Jiangsu 215123, China
- AI University Research Centre, Xi’an Jiaotong-Liverpool University, Suzhou, Jiangsu 215123, China
- Institute of Systems, Molecular and Integrative Biology, University of Liverpool, L69 7ZB Liverpool, United Kingdom
| |
Collapse
|
3
|
Gafurov A, Brejová B, Medvedev P. OUP accepted manuscript. Bioinformatics 2022; 38:i203-i211. [PMID: 35758770 PMCID: PMC9235476 DOI: 10.1093/bioinformatics/btac255] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
Abstract
Motivation Genome annotations are a common way to represent genomic features such as genes, regulatory elements or epigenetic modifications. The amount of overlap between two annotations is often used to ascertain if there is an underlying biological connection between them. In order to distinguish between true biological association and overlap by pure chance, a robust measure of significance is required. One common way to do this is to determine if the number of intervals in the reference annotation that intersect the query annotation is statistically significant. However, currently employed statistical frameworks are often either inefficient or inaccurate when computing P-values on the scale of the whole human genome. Results We show that finding the P-values under the typically used ‘gold’ null hypothesis is NP-hard. This motivates us to reformulate the null hypothesis using Markov chains. To be able to measure the fidelity of our Markovian null hypothesis, we develop a fast direct sampling algorithm to estimate the P-value under the gold null hypothesis. We then present an open-source software tool MCDP that computes the P-values under the Markovian null hypothesis in O(m2+n) time and O(m) memory, where m and n are the numbers of intervals in the reference and query annotations, respectively. Notably, MCDP runtime and memory usage are independent from the genome length, allowing it to outperform previous approaches in runtime and memory usage by orders of magnitude on human genome annotations, while maintaining the same level of accuracy. Availability and implementation The software is available at https://github.com/fmfi-compbio/mc-overlaps. All data for reproducibility are available at https://github.com/fmfi-compbio/mc-overlaps-reproducibility. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Broňa Brejová
- Department of Computer Science, Comenius University, Bratislava 84248, Slovakia
| | - Paul Medvedev
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA 16802, USA
- Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, PA 16802, USA
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA 16802, USA
| |
Collapse
|
4
|
Stilianoudakis SC, Marshall MA, Dozmorov MG. preciseTAD: a transfer learning framework for 3D domain boundary prediction at base-pair resolution. Bioinformatics 2021; 38:621-630. [PMID: 34741515 PMCID: PMC8756196 DOI: 10.1093/bioinformatics/btab743] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2021] [Revised: 10/07/2021] [Accepted: 11/02/2021] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Chromosome conformation capture technologies (Hi-C) revealed extensive DNA folding into discrete 3D domains, such as Topologically Associating Domains and chromatin loops. The correct binding of CTCF and cohesin at domain boundaries is integral in maintaining the proper structure and function of these 3D domains. 3D domains have been mapped at the resolutions of 1 kilobase and above. However, it has not been possible to define their boundaries at the resolution of boundary-forming proteins. RESULTS To predict domain boundaries at base-pair resolution, we developed preciseTAD, an optimized transfer learning framework trained on high-resolution genome annotation data. In contrast to current TAD/loop callers, preciseTAD-predicted boundaries are strongly supported by experimental evidence. Importantly, this approach can accurately delineate boundaries in cells without Hi-C data. preciseTAD provides a powerful framework to improve our understanding of how genomic regulators are shaping the 3D structure of the genome at base-pair resolution. AVAILABILITY AND IMPLEMENTATION preciseTAD is an R/Bioconductor package available at https://bioconductor.org/packages/preciseTAD/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
|
5
|
Mendieta JP, Marand AP, Ricci WA, Zhang X, Schmitz RJ. Leveraging histone modifications to improve genome annotations. G3 (BETHESDA, MD.) 2021; 11:jkab263. [PMID: 34568920 PMCID: PMC8473982 DOI: 10.1093/g3journal/jkab263] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/21/2021] [Accepted: 07/15/2021] [Indexed: 12/27/2022]
Abstract
Accurate genome annotations are essential to modern biology; however, they remain challenging to produce. Variation in gene structure and expression across species, as well as within an organism, make correctly annotating genes arduous; an issue exacerbated by pitfalls in current in silico methods. These issues necessitate complementary approaches to add additional confidence and rectify potential misannotations. Integration of epigenomic data into genome annotation is one such approach. In this study, we utilized sets of histone modification data, which are precisely distributed at either gene bodies or promoters to evaluate the annotation of the Zea mays genome. We leveraged these data genome wide, allowing for identification of annotations discordant with empirical data. In total, 13,159 annotation discrepancies were found in Z. mays upon integrating data across three different tissues, which were corroborated using RNA-based approaches. Upon correction, genes were extended by an average of 2128 base pairs, and we identified 2529 novel genes. Application of this method to five additional plant genomes identified a series of misannotations, as well as identified novel genes, including 13,836 in Asparagus officinalis, 2724 in Setaria viridis, 2446 in Sorghum bicolor, 8631 in Glycine max, and 2585 in Phaseolous vulgaris. This study demonstrates that histone modification data can be leveraged to rapidly improve current genome annotations across diverse plant lineages.
Collapse
Affiliation(s)
| | | | - William A Ricci
- Department of Plant Biology, University of Georgia, Athens, GA 30602, USA
| | - Xuan Zhang
- Department of Genetics, University of Georgia, Athens, GA 30602, USA
| | - Robert J Schmitz
- Department of Genetics, University of Georgia, Athens, GA 30602, USA
| |
Collapse
|
6
|
Wu T, Hu E, Xu S, Chen M, Guo P, Dai Z, Feng T, Zhou L, Tang W, Zhan L, Fu X, Liu S, Bo X, Yu G. clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. Innovation (N Y) 2021; 2:100141. [PMID: 34557778 PMCID: PMC8454663 DOI: 10.1016/j.xinn.2021.100141] [Citation(s) in RCA: 2611] [Impact Index Per Article: 870.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2021] [Accepted: 06/29/2021] [Indexed: 12/15/2022] Open
Abstract
Functional enrichment analysis is pivotal for interpreting high-throughput omics data in life science. It is crucial for this type of tool to use the latest annotation databases for as many organisms as possible. To meet these requirements, we present here an updated version of our popular Bioconductor package, clusterProfiler 4.0. This package has been enhanced considerably compared with its original version published 9 years ago. The new version provides a universal interface for functional enrichment analysis in thousands of organisms based on internally supported ontologies and pathways as well as annotation data provided by users or derived from online databases. It also extends the dplyr and ggplot2 packages to offer tidy interfaces for data operation and visualization. Other new features include gene set enrichment analysis and comparison of enrichment results from multiple gene lists. We anticipate that clusterProfiler 4.0 will be applied to a wide range of scenarios across diverse organisms. clusterProfiler supports exploring functional characteristics of both coding and non-coding genomics data for thousands of species with up-to-date gene annotation It provides a universal interface for gene functional annotation from a variety of sources and thus can be applied in diverse scenarios It provides a tidy interface to access, manipulate, and visualize enrichment results to help users achieve efficient data interpretation Datasets obtained from multiple treatments and time points can be analyzed and compared in a single run, easily revealing functional consensus and differences among distinct conditions
Collapse
Affiliation(s)
- Tianzhi Wu
- Department of Bioinformatics, School of Basic Medical Sciences, Southern Medical University, Guangzhou 510515, China
| | - Erqiang Hu
- Department of Bioinformatics, School of Basic Medical Sciences, Southern Medical University, Guangzhou 510515, China
| | - Shuangbin Xu
- Department of Bioinformatics, School of Basic Medical Sciences, Southern Medical University, Guangzhou 510515, China
| | - Meijun Chen
- Department of Bioinformatics, School of Basic Medical Sciences, Southern Medical University, Guangzhou 510515, China
| | - Pingfan Guo
- Department of Bioinformatics, School of Basic Medical Sciences, Southern Medical University, Guangzhou 510515, China
| | - Zehan Dai
- Department of Bioinformatics, School of Basic Medical Sciences, Southern Medical University, Guangzhou 510515, China
| | - Tingze Feng
- Department of Bioinformatics, School of Basic Medical Sciences, Southern Medical University, Guangzhou 510515, China
| | - Lang Zhou
- Department of Bioinformatics, School of Basic Medical Sciences, Southern Medical University, Guangzhou 510515, China
| | - Wenli Tang
- Department of Bioinformatics, School of Basic Medical Sciences, Southern Medical University, Guangzhou 510515, China
| | - Li Zhan
- Department of Bioinformatics, School of Basic Medical Sciences, Southern Medical University, Guangzhou 510515, China
| | - Xiaocong Fu
- Department of Bioinformatics, School of Basic Medical Sciences, Southern Medical University, Guangzhou 510515, China
| | - Shanshan Liu
- Department of Bioinformatics, School of Basic Medical Sciences, Southern Medical University, Guangzhou 510515, China
| | - Xiaochen Bo
- Department of Biotechnology, Beijing Institute of Radiation Medicine, Beijing 100850, China
| | - Guangchuang Yu
- Department of Bioinformatics, School of Basic Medical Sciences, Southern Medical University, Guangzhou 510515, China.,Guangdong Provincial Key Laboratory of Proteomics, School of Basic Medical Sciences, Southern Medical University, Guangzhou 510515, China.,Microbiome Medicine Center, Department of Laboratory Medicine, Zhujiang Hospital, Southern Medical University, Guangzhou 510515, China
| |
Collapse
|
7
|
Cade BE, Lee J, Sofer T, Wang H, Zhang M, Chen H, Gharib SA, Gottlieb DJ, Guo X, Lane JM, Liang J, Lin X, Mei H, Patel SR, Purcell SM, Saxena R, Shah NA, Evans DS, Hanis CL, Hillman DR, Mukherjee S, Palmer LJ, Stone KL, Tranah GJ, Abecasis GR, Boerwinkle EA, Correa A, Cupples LA, Kaplan RC, Nickerson DA, North KE, Psaty BM, Rotter JI, Rich SS, Tracy RP, Vasan RS, Wilson JG, Zhu X, Redline S. Whole-genome association analyses of sleep-disordered breathing phenotypes in the NHLBI TOPMed program. Genome Med 2021; 13:136. [PMID: 34446064 PMCID: PMC8394596 DOI: 10.1186/s13073-021-00917-8] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2020] [Accepted: 05/28/2021] [Indexed: 12/27/2022] Open
Abstract
BACKGROUND Sleep-disordered breathing is a common disorder associated with significant morbidity. The genetic architecture of sleep-disordered breathing remains poorly understood. Through the NHLBI Trans-Omics for Precision Medicine (TOPMed) program, we performed the first whole-genome sequence analysis of sleep-disordered breathing. METHODS The study sample was comprised of 7988 individuals of diverse ancestry. Common-variant and pathway analyses included an additional 13,257 individuals. We examined five complementary traits describing different aspects of sleep-disordered breathing: the apnea-hypopnea index, average oxyhemoglobin desaturation per event, average and minimum oxyhemoglobin saturation across the sleep episode, and the percentage of sleep with oxyhemoglobin saturation < 90%. We adjusted for age, sex, BMI, study, and family structure using MMSKAT and EMMAX mixed linear model approaches. Additional bioinformatics analyses were performed with MetaXcan, GIGSEA, and ReMap. RESULTS We identified a multi-ethnic set-based rare-variant association (p = 3.48 × 10-8) on chromosome X with ARMCX3. Additional rare-variant associations include ARMCX3-AS1, MRPS33, and C16orf90. Novel common-variant loci were identified in the NRG1 and SLC45A2 regions, and previously associated loci in the IL18RAP and ATP2B4 regions were associated with novel phenotypes. Transcription factor binding site enrichment identified associations with genes implicated with respiratory and craniofacial traits. Additional analyses identified significantly associated pathways. CONCLUSIONS We have identified the first gene-based rare-variant associations with objectively measured sleep-disordered breathing traits. Our results increase the understanding of the genetic architecture of sleep-disordered breathing and highlight associations in genes that modulate lung development, inflammation, respiratory rhythmogenesis, and HIF1A-mediated hypoxic response.
Collapse
Affiliation(s)
- Brian E. Cade
- grid.38142.3c000000041936754XDivision of Sleep and Circadian Disorders, Brigham and Women’s Hospital, Harvard Medical School, 221 Longwood Avenue, Boston, MA 02115 USA ,grid.38142.3c000000041936754XDivision of Sleep Medicine, Harvard Medical School, Boston, MA 02115 USA ,grid.66859.34Program in Medical and Population Genetics, Broad Institute, Cambridge, MA 02142 USA
| | - Jiwon Lee
- grid.38142.3c000000041936754XDivision of Sleep and Circadian Disorders, Brigham and Women’s Hospital, Harvard Medical School, 221 Longwood Avenue, Boston, MA 02115 USA
| | - Tamar Sofer
- grid.38142.3c000000041936754XDivision of Sleep and Circadian Disorders, Brigham and Women’s Hospital, Harvard Medical School, 221 Longwood Avenue, Boston, MA 02115 USA ,grid.38142.3c000000041936754XDivision of Sleep Medicine, Harvard Medical School, Boston, MA 02115 USA
| | - Heming Wang
- grid.38142.3c000000041936754XDivision of Sleep and Circadian Disorders, Brigham and Women’s Hospital, Harvard Medical School, 221 Longwood Avenue, Boston, MA 02115 USA ,grid.38142.3c000000041936754XDivision of Sleep Medicine, Harvard Medical School, Boston, MA 02115 USA ,grid.66859.34Program in Medical and Population Genetics, Broad Institute, Cambridge, MA 02142 USA
| | - Man Zhang
- grid.411024.20000 0001 2175 4264Department of Medicine, University of Maryland School of Medicine, Baltimore, MD 21201 USA
| | - Han Chen
- grid.267308.80000 0000 9206 2401Human Genetics Center, Department of Epidemiology, Human Genetics and Environmental Sciences, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX 77030 USA ,grid.267308.80000 0000 9206 2401Center for Precision Health, School of Public Health and School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030 USA
| | - Sina A. Gharib
- grid.34477.330000000122986657Computational Medicine Core, Center for Lung Biology, UW Medicine Sleep Center, Division of Pulmonary, Critical Care and Sleep Medicine, University of Washington, Seattle, WA 98195 USA
| | - Daniel J. Gottlieb
- grid.38142.3c000000041936754XDivision of Sleep and Circadian Disorders, Brigham and Women’s Hospital, Harvard Medical School, 221 Longwood Avenue, Boston, MA 02115 USA ,grid.38142.3c000000041936754XDivision of Sleep Medicine, Harvard Medical School, Boston, MA 02115 USA ,grid.410370.10000 0004 4657 1992VA Boston Healthcare System, Boston, MA 02132 USA
| | - Xiuqing Guo
- grid.239844.00000 0001 0157 6501The Institute for Translational Genomics and Population Sciences, Department of Pediatrics, The Lundquist Institute for Biomedical Innovation at Harbor-UCLA Medical Center, Torrance, CA 90502 USA
| | - Jacqueline M. Lane
- grid.38142.3c000000041936754XDivision of Sleep and Circadian Disorders, Brigham and Women’s Hospital, Harvard Medical School, 221 Longwood Avenue, Boston, MA 02115 USA ,grid.38142.3c000000041936754XDivision of Sleep Medicine, Harvard Medical School, Boston, MA 02115 USA ,grid.66859.34Program in Medical and Population Genetics, Broad Institute, Cambridge, MA 02142 USA ,grid.32224.350000 0004 0386 9924Center for Genomic Medicine and Department of Anesthesia, Pain, and Critical Care Medicine, Massachusetts General Hospital, Boston, MA 02114 USA
| | - Jingjing Liang
- grid.67105.350000 0001 2164 3847Department of Population and Quantitative Health Sciences, School of Medicine, Case Western Reserve University, Cleveland, OH 44106 USA
| | - Xihong Lin
- grid.38142.3c000000041936754XDepartment of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA 02115 USA
| | - Hao Mei
- grid.410721.10000 0004 1937 0407Department of Data Science, University of Mississippi Medical Center, Jackson, MS 29216 USA
| | - Sanjay R. Patel
- grid.21925.3d0000 0004 1936 9000Division of Pulmonary, Allergy, and Critical Care Medicine, University of Pittsburgh, Pittsburgh, PA 15213 USA
| | - Shaun M. Purcell
- grid.38142.3c000000041936754XDivision of Sleep and Circadian Disorders, Brigham and Women’s Hospital, Harvard Medical School, 221 Longwood Avenue, Boston, MA 02115 USA ,grid.38142.3c000000041936754XDivision of Sleep Medicine, Harvard Medical School, Boston, MA 02115 USA ,grid.66859.34Program in Medical and Population Genetics, Broad Institute, Cambridge, MA 02142 USA
| | - Richa Saxena
- grid.38142.3c000000041936754XDivision of Sleep and Circadian Disorders, Brigham and Women’s Hospital, Harvard Medical School, 221 Longwood Avenue, Boston, MA 02115 USA ,grid.38142.3c000000041936754XDivision of Sleep Medicine, Harvard Medical School, Boston, MA 02115 USA ,grid.66859.34Program in Medical and Population Genetics, Broad Institute, Cambridge, MA 02142 USA ,grid.32224.350000 0004 0386 9924Center for Genomic Medicine and Department of Anesthesia, Pain, and Critical Care Medicine, Massachusetts General Hospital, Boston, MA 02114 USA
| | - Neomi A. Shah
- grid.59734.3c0000 0001 0670 2351Division of Pulmonary, Critical Care and Sleep Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029 USA
| | - Daniel S. Evans
- grid.17866.3e0000000098234542California Pacific Medical Center Research Institute, San Francisco, CA 94107 USA
| | - Craig L. Hanis
- grid.267308.80000 0000 9206 2401Human Genetics Center, Department of Epidemiology, Human Genetics and Environmental Sciences, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX 77030 USA
| | - David R. Hillman
- grid.3521.50000 0004 0437 5942Department of Pulmonary Physiology and Sleep Medicine, Sir Charles Gairdner Hospital, Perth, Western Australia 6009 Australia
| | - Sutapa Mukherjee
- Sleep Health Service, Respiratory and Sleep Services, Southern Adelaide Local Health Network, Adelaide, South Australia Australia ,grid.1014.40000 0004 0367 2697Adelaide Institute for Sleep Health, Flinders University, Adelaide, South Australia Australia
| | - Lyle J. Palmer
- grid.1010.00000 0004 1936 7304School of Public Health, University of Adelaide, Adelaide, South Australia 5000 Australia
| | - Katie L. Stone
- grid.17866.3e0000000098234542California Pacific Medical Center Research Institute, San Francisco, CA 94107 USA
| | - Gregory J. Tranah
- grid.17866.3e0000000098234542California Pacific Medical Center Research Institute, San Francisco, CA 94107 USA
| | | | - Gonçalo R. Abecasis
- grid.214458.e0000000086837370Department of Biostatistics and Center for Statistical Genetics, University of Michigan School of Public Health, Ann Arbor, MI 48109 USA
| | - Eric A. Boerwinkle
- grid.267308.80000 0000 9206 2401Human Genetics Center, Department of Epidemiology, Human Genetics and Environmental Sciences, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX 77030 USA ,grid.39382.330000 0001 2160 926XHuman Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030 USA
| | - Adolfo Correa
- grid.410721.10000 0004 1937 0407Department of Medicine, University of Mississippi Medical Center, Jackson, MS 39216 USA ,Jackson Heart Study, Jackson, MS 39216 USA
| | - L. Adrienne Cupples
- grid.189504.10000 0004 1936 7558Department of Biostatistics, Boston University School of Public Health, Boston, MA 02118 USA ,grid.510954.c0000 0004 0444 3861Framingham Heart Study, Framingham, MA 01702 USA
| | - Robert C. Kaplan
- grid.251993.50000000121791997Department of Epidemiology and Population Health, Albert Einstein College of Medicine, Bronx, New York, 10461 USA
| | - Deborah A. Nickerson
- grid.34477.330000000122986657Department of Genome Sciences, University of Washington, Seattle, WA 98195 USA ,grid.34477.330000000122986657Northwest Genomics Center, Seattle, WA 98105 USA
| | - Kari E. North
- grid.410711.20000 0001 1034 1720Department of Epidemiology and Carolina Center of Genome Sciences, University of North Carolina, Chapel Hill, NC 27514 USA
| | - Bruce M. Psaty
- grid.34477.330000000122986657Cardiovascular Health Study, Departments of Medicine, Epidemiology, and Health Services, University of Washington, Seattle, WA 98101 USA ,grid.488833.c0000 0004 0615 7519Kaiser Permanente Washington Health Research Institute, Seattle, WA 98101 USA
| | - Jerome I. Rotter
- grid.239844.00000 0001 0157 6501The Institute for Translational Genomics and Population Sciences, Department of Pediatrics, The Lundquist Institute for Biomedical Innovation at Harbor-UCLA Medical Center, Torrance, CA 90502 USA
| | - Stephen S. Rich
- grid.27755.320000 0000 9136 933XCenter for Public Health Genomics, University of Virginia, Charlottesville, VA 22908 USA
| | - Russell P. Tracy
- grid.59062.380000 0004 1936 7689Department of Pathology, University of Vermont, Colchester, VT 05405 USA
| | - Ramachandran S. Vasan
- grid.510954.c0000 0004 0444 3861Framingham Heart Study, Framingham, MA 01702 USA ,grid.189504.10000 0004 1936 7558Sections of Preventive Medicine and Epidemiology and Cardiology, Department of Medicine, Boston University School of Medicine, Boston, MA 02118 USA ,grid.189504.10000 0004 1936 7558Department of Epidemiology, Boston University School of Public Health, Boston, MA 02118 USA
| | - James G. Wilson
- grid.410721.10000 0004 1937 0407Department of Physiology and Biophysics, University of Mississippi Medical Center, Jackson, MS 39216 USA
| | - Xiaofeng Zhu
- grid.67105.350000 0001 2164 3847Department of Population and Quantitative Health Sciences, School of Medicine, Case Western Reserve University, Cleveland, OH 44106 USA
| | - Susan Redline
- grid.38142.3c000000041936754XDivision of Sleep and Circadian Disorders, Brigham and Women’s Hospital, Harvard Medical School, 221 Longwood Avenue, Boston, MA 02115 USA ,grid.38142.3c000000041936754XDivision of Sleep Medicine, Harvard Medical School, Boston, MA 02115 USA ,grid.239395.70000 0000 9011 8547Division of Pulmonary, Critical Care, and Sleep Medicine, Beth Israel Deaconess Medical Center, Boston, MA 02215 USA
| | | |
Collapse
|
8
|
Gu A, Cho HJ, Sheffield NC. Bedshift: perturbation of genomic interval sets. Genome Biol 2021; 22:238. [PMID: 34416909 PMCID: PMC8379854 DOI: 10.1186/s13059-021-02440-w] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2020] [Accepted: 07/26/2021] [Indexed: 12/25/2022] Open
Abstract
Functional genomics experiments, like ChIP-Seq or ATAC-Seq, produce results that are summarized as a region set. There is no way to objectively evaluate the effectiveness of region set similarity metrics. We present Bedshift, a tool for perturbing BED files by randomly shifting, adding, and dropping regions from a reference file. The perturbed files can be used to benchmark similarity metrics, as well as for other applications. We highlight differences in behavior between metrics, such as that the Jaccard score is most sensitive to added or dropped regions, while coverage score is most sensitive to shifted regions.
Collapse
Affiliation(s)
- Aaron Gu
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA, USA
- Department of Computer Science, University of Virginia School of Engineering, Charlottesville, VA, USA
| | - Hyun Jae Cho
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA, USA
- Department of Computer Science, University of Virginia School of Engineering, Charlottesville, VA, USA
| | - Nathan C Sheffield
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA, USA.
- Department of Public Health Sciences, University of Virginia, Charlottesville, VA, USA.
- Department of Biomedical Engineering, University of Virginia, Charlottesville, VA, USA.
- Department of Biochemistry and Molecular Genetics, University of Virginia, Charlottesville, VA, USA.
| |
Collapse
|
9
|
Devailly G, Joshi A. Comprehensive analysis of epigenetic signatures of human transcription control. Mol Omics 2021; 17:692-705. [PMID: 34291238 DOI: 10.1039/d0mo00130a] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
Advances in sequencing technologies have enabled exploration of epigenetic and transcriptional profiles at a genome-wide level. The epigenetic and transcriptional landscapes are now available in hundreds of mammalian cell and tissue contexts. Many studies have performed multi-omics analyses using these datasets to enhance our understanding of relationships between epigenetic modifications and transcription regulation. Nevertheless, most studies so far have focused on the promoters/enhancers and transcription start sites, and other features of transcription control including exons, introns and transcription termination remain underexplored. We investigated the interplay between epigenetic modifications and diverse transcription features using the data generated by the Roadmap Epigenomics project. A comprehensive analysis of histone modifications, DNA methylation, and RNA-seq data of thirty-three human cell lines and tissue types allowed us to confirm the generality of previously described relationships, as well as to generate new hypotheses about the interplay between epigenetic modifications and transcription features. Importantly, our analysis included previously under-explored features of transcription control, namely, transcription termination sites, exon-intron boundaries, and the exon inclusion ratio. We have made the analyses freely available to the scientific community at joshiapps.cbu.uib.no/perepigenomics_app/ for easy exploration, validation and hypothesis generation.
Collapse
Affiliation(s)
- Guillaume Devailly
- GenPhySE, Université de Toulouse, INRAE, ENVT, 31326, Castanet Tolosan, France.
| | - Anagha Joshi
- Computational Biology Unit, Department of Clinical Science, University of Bergen, 5021, Bergen, Norway.
| |
Collapse
|
10
|
Gharavi E, Gu A, Zheng G, Smith JP, Cho HJ, Zhang A, Brown DE, Sheffield NC. Embeddings of genomic region sets capture rich biological associations in lower dimensions. Bioinformatics 2021; 37:4299-4306. [PMID: 34156475 PMCID: PMC8652032 DOI: 10.1093/bioinformatics/btab439] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2020] [Revised: 06/07/2021] [Accepted: 06/15/2021] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Genomic region sets summarize functional genomics data and define locations of interest in the genome such as regulatory regions or transcription factor binding sites. The number of publicly available region sets has increased dramatically, leading to challenges in data analysis. RESULTS We propose a new method to represent genomic region sets as vectors, or embeddings, using an adapted word2vec approach. We compared our approach to two simpler methods based on interval unions or term frequency-inverse document frequency and evaluated the methods in three ways: First, by classifying the cell line, antibody, or tissue type of the region set; second, by assessing whether similarity among embeddings can reflect simulated random perturbations of genomic regions; and third, by testing robustness of the proposed representations to different signal thresholds for calling peaks. Our word2vec-based region set embeddings reduce dimensionality from more than a hundred thousand to 100 without significant loss in classification performance. The vector representation could identify cell line, antibody, and tissue type with over 90% accuracy. We also found that the vectors could quantitatively summarize simulated random perturbations to region sets and are more robust to subsampling the data derived from different peak calling thresholds. Our evaluations demonstrate that the vectors retain useful biological information in relatively lower-dimensional spaces. We propose that vector representation of region sets is a promising approach for efficient analysis of genomic region data. AVAILABILITY https://github.com/databio/regionset-embedding.
Collapse
Affiliation(s)
- Erfaneh Gharavi
- Center for Public Health Genomics, University of Virginia.,School of Data Science, University of Virginia
| | - Aaron Gu
- Center for Public Health Genomics, University of Virginia.,Department of Computer Science, University of Virginia
| | | | - Jason P Smith
- Center for Public Health Genomics, University of Virginia.,Department of Biochemistry and Molecular Genetics, University of Virginia
| | - Hyun Jae Cho
- Center for Public Health Genomics, University of Virginia.,Department of Computer Science, University of Virginia
| | - Aidong Zhang
- Department of Computer Science, University of Virginia
| | | | - Nathan C Sheffield
- Center for Public Health Genomics, University of Virginia.,Department of Public Health Sciences, University of Virginia.,Department of Biomedical Engineering, University of Virginia.,Department of Biochemistry and Molecular Genetics, University of Virginia.,School of Data Science, University of Virginia
| |
Collapse
|
11
|
Gundersen S, Boddu S, Capella-Gutierrez S, Drabløs F, Fernández JM, Kompova R, Taylor K, Titov D, Zerbino D, Hovig E. Recommendations for the FAIRification of genomic track metadata. F1000Res 2021; 10. [PMID: 34249331 PMCID: PMC8226415 DOI: 10.12688/f1000research.28449.1] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 03/17/2021] [Indexed: 01/25/2023] Open
Abstract
Background: Many types of data from genomic analyses can be represented as genomic tracks,
i.e. features linked to the genomic coordinates of a reference genome. Examples of such data are epigenetic DNA methylation data, ChIP-seq peaks, germline or somatic DNA variants, as well as RNA-seq expression levels. Researchers often face difficulties in locating, accessing and combining relevant tracks from external sources, as well as locating the raw data, reducing the value of the generated information. Description of work: We propose to advance the application of FAIR data principles (Findable, Accessible, Interoperable, and Reusable) to produce searchable metadata for genomic tracks. Findability and Accessibility of metadata can then be ensured by a track search service that integrates globally identifiable metadata from various track hubs in the Track Hub Registry and other relevant repositories. Interoperability and Reusability need to be ensured by the specification and implementation of a basic set of recommendations for metadata. We have tested this concept by developing such a specification in a JSON Schema, called FAIRtracks, and have integrated it into a novel track search service, called TrackFind. We demonstrate practical usage by importing datasets through TrackFind into existing examples of relevant analytical tools for genomic tracks: EPICO and the GSuite HyperBrowser. Conclusion: We here provide a first iteration of a draft standard for genomic track metadata, as well as the accompanying software ecosystem. It can easily be adapted or extended to future needs of the research community regarding data, methods and tools, balancing the requirements of both data submitters and analytical end-users.
Collapse
Affiliation(s)
| | - Sanjay Boddu
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK
| | | | - Finn Drabløs
- Department of Clinical and Molecular Medicine, NTNU - Norwegian University of Science and Technology, Trondheim, Norway
| | - José M Fernández
- Life Sciences Department, Barcelona Supercomputing Center (BSC), Barcelona, Spain
| | - Radmila Kompova
- Center for Bioinformatics, University of Oslo (UiO), Oslo, Norway
| | - Kieron Taylor
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK
| | - Dmytro Titov
- Center for Bioinformatics, University of Oslo (UiO), Oslo, Norway
| | - Daniel Zerbino
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK
| | - Eivind Hovig
- Center for Bioinformatics, University of Oslo (UiO), Oslo, Norway.,Department of Tumor Biology, Institute for Cancer Research, Oslo University Hospital (OUH), Oslo, Norway
| |
Collapse
|
12
|
Feng J, Sheffield NC. IGD: high-performance search for large-scale genomic interval datasets. Bioinformatics 2020; 37:118-120. [PMID: 33367484 DOI: 10.1093/bioinformatics/btaa1062] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2020] [Revised: 10/19/2020] [Accepted: 12/15/2020] [Indexed: 01/04/2023] Open
Abstract
SUMMARY Databases of large-scale genome projects now contain thousands of genomic interval datasets. These data are a critical resource for understanding the function of DNA. However, our ability to examine and integrate interval data of this scale is limited. Here, we introduce the integrated genome database (IGD), a method and tool for searching genome interval datasets more than three orders of magnitude faster than existing approaches, while using only one hundredth of the memory. IGD uses a novel linear binning method that allows us to scale analysis to billions of genomic regions. AVAILABILITY https://github.com/databio/IGD.
Collapse
Affiliation(s)
- Jianglin Feng
- Center for Public Health Genomics, University of Virginia
| | - Nathan C Sheffield
- Center for Public Health Genomics, University of Virginia.,Department of Public Health Sciences, University of Virginia.,Department of Biomedical Engineering, University of Virginia.,Department of Biochemistry and Molecular Genetics, University of Virginia
| |
Collapse
|
13
|
COCOA: coordinate covariation analysis of epigenetic heterogeneity. Genome Biol 2020; 21:240. [PMID: 32894181 PMCID: PMC7487606 DOI: 10.1186/s13059-020-02139-4] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2020] [Accepted: 08/07/2020] [Indexed: 12/20/2022] Open
Abstract
A key challenge in epigenetics is to determine the biological significance of epigenetic variation among individuals. We present Coordinate Covariation Analysis (COCOA), a computational framework that uses covariation of epigenetic signals across individuals and a database of region sets to annotate epigenetic heterogeneity. COCOA is the first such tool for DNA methylation data and can also analyze any epigenetic signal with genomic coordinates. We demonstrate COCOA’s utility by analyzing DNA methylation, ATAC-seq, and multi-omic data in supervised and unsupervised analyses, showing that COCOA provides new understanding of inter-sample epigenetic variation. COCOA is available on Bioconductor (http://bioconductor.org/packages/COCOA).
Collapse
|
14
|
Smith JP, Sheffield NC. Analytical Approaches for ATAC-seq Data Analysis. CURRENT PROTOCOLS IN HUMAN GENETICS 2020; 106:e101. [PMID: 32543102 PMCID: PMC8191135 DOI: 10.1002/cphg.101] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Abstract
ATAC-seq, the assay for transposase-accessible chromatin using sequencing, is a quick and efficient approach to investigating the chromatin accessibility landscape. Investigating chromatin accessibility has broad utility for answering many biological questions, such as mapping nucleosomes, identifying transcription factor binding sites, and measuring differential activity of DNA regulatory elements. Because the ATAC-seq protocol is both simple and relatively inexpensive, there has been a rapid increase in the availability of chromatin accessibility data. Furthermore, advances in ATAC-seq protocols are rapidly extending its breadth to additional experimental conditions, cell types, and species. Accompanying the increase in data, there has also been an explosion of new tools and analytical approaches for analyzing it. Here, we explain the fundamentals of ATAC-seq data processing, summarize common analysis approaches, and review computational tools to provide recommendations for different research questions. This primer provides a starting point and a reference for analysis of ATAC-seq data. © 2020 Wiley Periodicals LLC.
Collapse
Affiliation(s)
- Jason P. Smith
- Center for Public Health Genomics, University of Virginia, Charlottesville, Virginia
- Department of Biochemistry and Molecular Genetics, University of Virginia, Charlottesville, Virginia
| | - Nathan C. Sheffield
- Center for Public Health Genomics, University of Virginia, Charlottesville, Virginia
- Department of Biochemistry and Molecular Genetics, University of Virginia, Charlottesville, Virginia
- Department of Public Health Sciences, University of Virginia, Charlottesville, Virginia
- Department of Biomedical Engineering, University of Virginia, Charlottesville, Virginia
| |
Collapse
|
15
|
Kanduri C, Bock C, Gundersen S, Hovig E, Sandve GK. Colocalization analyses of genomic elements: approaches, recommendations and challenges. Bioinformatics 2020; 35:1615-1624. [PMID: 30307532 PMCID: PMC6499241 DOI: 10.1093/bioinformatics/bty835] [Citation(s) in RCA: 34] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2018] [Revised: 09/03/2018] [Accepted: 10/10/2018] [Indexed: 12/23/2022] Open
Abstract
Motivation Many high-throughput methods produce sets of genomic regions as one of their main outputs. Scientists often use genomic colocalization analysis to interpret such region sets, for example to identify interesting enrichments and to understand the interplay between the underlying biological processes. Although widely used, there is little standardization in how these analyses are performed. Different practices can substantially affect the conclusions of colocalization analyses. Results Here, we describe the different approaches and provide recommendations for performing genomic colocalization analysis, while also discussing common methodological challenges that may influence the conclusions. As illustrated by concrete example cases, careful attention to analysis details is needed in order to meet these challenges and to obtain a robust and biologically meaningful interpretation of genomic region set data. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Chakravarthi Kanduri
- Department of Informatics, University of Oslo, Oslo, Norway.,K. G. Jebsen Coeliac Disease Research Centre, Oslo, Norway
| | - Christoph Bock
- CeMM Research Center for Molecular Medicine of the Austrian Academy of Sciences, Vienna, Austria.,Department of Laboratory Medicine, Medical University of Vienna, Vienna, Austria.,Max Planck Institute for Informatics, Saarbrücken, Germany
| | - Sveinung Gundersen
- Department of Informatics, University of Oslo, Oslo, Norway.,Elixir Norway, Oslo Node, University of Oslo, Oslo, Norway
| | - Eivind Hovig
- Department of Informatics, University of Oslo, Oslo, Norway.,Elixir Norway, Oslo Node, University of Oslo, Oslo, Norway.,Department of Tumor Biology, Institute for Cancer Research, Oslo, Norway.,Institute for Cancer Genetics and Informatics, The Norwegian Radium Hospital, Oslo, Norway, UK
| | - Geir Kjetil Sandve
- Department of Informatics, University of Oslo, Oslo, Norway.,K. G. Jebsen Coeliac Disease Research Centre, Oslo, Norway
| |
Collapse
|
16
|
Zhou Y, Sun Y, Huang D, Li MJ. epiCOLOC: Integrating Large-Scale and Context-Dependent Epigenomics Features for Comprehensive Colocalization Analysis. Front Genet 2020; 11:53. [PMID: 32117461 PMCID: PMC7029718 DOI: 10.3389/fgene.2020.00053] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2019] [Accepted: 01/17/2020] [Indexed: 12/18/2022] Open
Abstract
High-throughput genome-wide epigenomic assays, such as ChIP-seq, DNase-seq and ATAC-seq, have profiled a huge number of functional elements across numerous human tissues/cell types, which provide an unprecedented opportunity to interpret human genome and disease in context-dependent manner. Colocalization analysis determines whether genomic features are functionally related to a given search and will facilitate identifying the underlying biological functions characterizing intricate relationships with queries for genomic regions. Existing colocalization methods leveraged diverse assumptions and background models to assess the significance of enrichment, however, they only provided limited and predefined sets of epigenomic features. Here, we comprehensively collected and integrated over 44,385 bulk or single-cell epigenomic assays across 53 human tissues/cell types, such as transcription factor binding, histone modification, open chromatin and transcriptional event. By classifying these profiles into hierarchy of tissue/cell type, we developed a web portal, epiCOLOC (http://mulinlab.org/epicoloc or http://mulinlab.tmu.edu.cn/epicoloc), for users to perform context-dependent colocalization analysis in a convenient way.
Collapse
Affiliation(s)
- Yao Zhou
- Department of Pharmacology, Tianjin Key Laboratory of Inflammation Biology, School of Basic Medical Sciences, National Clinical Research Center for Cancer, Tianjin Medical University Cancer Institute and Hospital, Tianjin Medical University, Tianjin, China
| | - Yongzheng Sun
- Department of Pharmacology, Tianjin Key Laboratory of Inflammation Biology, School of Basic Medical Sciences, National Clinical Research Center for Cancer, Tianjin Medical University Cancer Institute and Hospital, Tianjin Medical University, Tianjin, China
| | - Dandan Huang
- Department of Pharmacology, Tianjin Key Laboratory of Inflammation Biology, School of Basic Medical Sciences, National Clinical Research Center for Cancer, Tianjin Medical University Cancer Institute and Hospital, Tianjin Medical University, Tianjin, China
| | - Mulin Jun Li
- Department of Pharmacology, Tianjin Key Laboratory of Inflammation Biology, School of Basic Medical Sciences, National Clinical Research Center for Cancer, Tianjin Medical University Cancer Institute and Hospital, Tianjin Medical University, Tianjin, China.,Collaborative Innovation Center of Tianjin for Medical Epigenetics, Tianjin Key Laboratory of Medical Epigenetics, Tianjin Medical University, Tianjin, China
| |
Collapse
|
17
|
Wang Z, Civelek M, Miller CL, Sheffield NC, Guertin MJ, Zang C. BART: a transcription factor prediction tool with query gene sets or epigenomic profiles. Bioinformatics 2019; 34:2867-2869. [PMID: 29608647 DOI: 10.1093/bioinformatics/bty194] [Citation(s) in RCA: 71] [Impact Index Per Article: 14.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2017] [Accepted: 03/27/2018] [Indexed: 01/09/2023] Open
Abstract
Summary Identification of functional transcription factors that regulate a given gene set is an important problem in gene regulation studies. Conventional approaches for identifying transcription factors, such as DNA sequence motif analysis, are unable to predict functional binding of specific factors and not sensitive enough to detect factors binding at distal enhancers. Here, we present binding analysis for regulation of transcription (BART), a novel computational method and software package for predicting functional transcription factors that regulate a query gene set or associate with a query genomic profile, based on more than 6000 existing ChIP-seq datasets for over 400 factors in human or mouse. This method demonstrates the advantage of utilizing publicly available data for functional genomics research. Availability and implementation BART is implemented in Python and available at http://faculty.virginia.edu/zanglab/bart. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Zhenjia Wang
- Center for Public Health Genomics, Charlottesville, VA, USA
| | - Mete Civelek
- Center for Public Health Genomics, Charlottesville, VA, USA.,Department of Biomedical Engineering, Charlottesville, VA, USA
| | - Clint L Miller
- Center for Public Health Genomics, Charlottesville, VA, USA.,Department of Biomedical Engineering, Charlottesville, VA, USA.,Department of Public Health Sciences, Charlottesville, VA, USA.,Department of Biochemistry and Molecular Genetics, Charlottesville, VA, USA
| | - Nathan C Sheffield
- Center for Public Health Genomics, Charlottesville, VA, USA.,Department of Biomedical Engineering, Charlottesville, VA, USA.,Department of Public Health Sciences, Charlottesville, VA, USA.,Department of Biochemistry and Molecular Genetics, Charlottesville, VA, USA
| | - Michael J Guertin
- Center for Public Health Genomics, Charlottesville, VA, USA.,Department of Biochemistry and Molecular Genetics, Charlottesville, VA, USA
| | - Chongzhi Zang
- Center for Public Health Genomics, Charlottesville, VA, USA.,Department of Biomedical Engineering, Charlottesville, VA, USA.,Department of Public Health Sciences, Charlottesville, VA, USA.,Department of Biochemistry and Molecular Genetics, Charlottesville, VA, USA.,Cancer Center, University of Virginia, Charlottesville, VA, USA
| |
Collapse
|
18
|
Nagraj VP, Magee NE, Sheffield NC. LOLAweb: a containerized web server for interactive genomic locus overlap enrichment analysis. Nucleic Acids Res 2019; 46:W194-W199. [PMID: 29878235 PMCID: PMC6030814 DOI: 10.1093/nar/gky464] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2018] [Accepted: 05/21/2018] [Indexed: 12/21/2022] Open
Abstract
The past few years have seen an explosion of interest in understanding the role of regulatory DNA. This interest has driven large-scale production of functional genomics data and analytical methods. One popular analysis is to test for enrichment of overlaps between a query set of genomic regions and a database of region sets. In this way, new genomic data can be easily connected to annotations from external data sources. Here, we present an interactive interface for enrichment analysis of genomic locus overlaps using a web server called LOLAweb. LOLAweb accepts a set of genomic ranges from the user and tests it for enrichment against a database of region sets. LOLAweb renders results in an R Shiny application to provide interactive visualization features, enabling users to filter, sort, and explore enrichment results dynamically. LOLAweb is built and deployed in a Linux container, making it scalable to many concurrent users on our servers and also enabling users to download and run LOLAweb locally.
Collapse
Affiliation(s)
- V P Nagraj
- School of Medicine Research Computing, University of Virginia, USA
| | - Neal E Magee
- School of Medicine Research Computing, University of Virginia, USA
| | - Nathan C Sheffield
- Center for Public Health Genomics, University of Virginia, USA.,Departments of Public Health Sciences, Biomedical Engineering, and Biochemistry and Molecular Genetics, University of Virginia, USA
| |
Collapse
|
19
|
Masser DR, Hadad N, Porter H, Stout MB, Unnikrishnan A, Stanford DR, Freeman WM. Analysis of DNA modifications in aging research. GeroScience 2018; 40:11-29. [PMID: 29327208 PMCID: PMC5832665 DOI: 10.1007/s11357-018-0005-3] [Citation(s) in RCA: 32] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2017] [Accepted: 01/05/2018] [Indexed: 12/22/2022] Open
Abstract
As geroscience research extends into the role of epigenetics in aging and age-related disease, researchers are being confronted with unfamiliar molecular techniques and data analysis methods that can be difficult to integrate into their work. In this review, we focus on the analysis of DNA modifications, namely cytosine methylation and hydroxymethylation, through next-generation sequencing methods. While older techniques for modification analysis performed relative quantitation across regions of the genome or examined average genome levels, these analyses lack the desired specificity, rigor, and genomic coverage to firmly establish the nature of genomic methylation patterns and their response to aging. With recent methodological advances, such as whole genome bisulfite sequencing (WGBS), bisulfite oligonucleotide capture sequencing (BOCS), and bisulfite amplicon sequencing (BSAS), cytosine modifications can now be readily analyzed with base-specific, absolute quantitation at both cytosine-guanine dinucleotide (CG) and non-CG sites throughout the genome or within specific regions of interest by next-generation sequencing. Additional advances, such as oxidative bisulfite conversion to differentiate methylation from hydroxymethylation and analysis of limited input/single-cells, have great promise for continuing to expand epigenomic capabilities. This review provides a background on DNA modifications, the current state-of-the-art for sequencing methods, bioinformatics tools for converting these large data sets into biological insights, and perspectives on future directions for the field.
Collapse
Affiliation(s)
- Dustin R Masser
- Reynolds Oklahoma Center on Aging, University of Oklahoma Health Sciences Center, Oklahoma City, OK, USA
- Department of Physiology, University of Oklahoma Health Sciences Center, Oklahoma City, OK, USA
- Oklahoma Nathan Shock Center for Aging, University of Oklahoma Health Sciences Center, Oklahoma City, OK, USA
| | - Niran Hadad
- Reynolds Oklahoma Center on Aging, University of Oklahoma Health Sciences Center, Oklahoma City, OK, USA
- Oklahoma Nathan Shock Center for Aging, University of Oklahoma Health Sciences Center, Oklahoma City, OK, USA
- Oklahoma Center for Neuroscience, University of Oklahoma Health Sciences Center, Oklahoma City, OK, USA
| | - Hunter Porter
- Reynolds Oklahoma Center on Aging, University of Oklahoma Health Sciences Center, Oklahoma City, OK, USA
- Oklahoma Nathan Shock Center for Aging, University of Oklahoma Health Sciences Center, Oklahoma City, OK, USA
- Oklahoma Center for Neuroscience, University of Oklahoma Health Sciences Center, Oklahoma City, OK, USA
| | - Michael B Stout
- Reynolds Oklahoma Center on Aging, University of Oklahoma Health Sciences Center, Oklahoma City, OK, USA
- Department of Nutritional Sciences, University of Oklahoma Health Sciences Center, Oklahoma City, OK, USA
| | - Archana Unnikrishnan
- Reynolds Oklahoma Center on Aging, University of Oklahoma Health Sciences Center, Oklahoma City, OK, USA
- Department of Geriatric Medicine, University of Oklahoma Health Sciences Center, Oklahoma City, OK, USA
| | - David R Stanford
- Reynolds Oklahoma Center on Aging, University of Oklahoma Health Sciences Center, Oklahoma City, OK, USA
- Department of Physiology, University of Oklahoma Health Sciences Center, Oklahoma City, OK, USA
- Oklahoma Center for Neuroscience, University of Oklahoma Health Sciences Center, Oklahoma City, OK, USA
| | - Willard M Freeman
- Reynolds Oklahoma Center on Aging, University of Oklahoma Health Sciences Center, Oklahoma City, OK, USA.
- Department of Physiology, University of Oklahoma Health Sciences Center, Oklahoma City, OK, USA.
- Oklahoma Nathan Shock Center for Aging, University of Oklahoma Health Sciences Center, Oklahoma City, OK, USA.
- Oklahoma Center for Neuroscience, University of Oklahoma Health Sciences Center, Oklahoma City, OK, USA.
- Department of Nutritional Sciences, University of Oklahoma Health Sciences Center, Oklahoma City, OK, USA.
| |
Collapse
|