1
|
Viner C, Ishak CA, Johnson J, Walker NJ, Shi H, Sjöberg-Herrera MK, Shen SY, Lardo SM, Adams DJ, Ferguson-Smith AC, De Carvalho DD, Hainer SJ, Bailey TL, Hoffman MM. Modeling methyl-sensitive transcription factor motifs with an expanded epigenetic alphabet. Genome Biol 2024; 25:11. [PMID: 38191487 PMCID: PMC10773111 DOI: 10.1186/s13059-023-03070-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2023] [Accepted: 09/21/2023] [Indexed: 01/10/2024] Open
Abstract
BACKGROUND Transcription factors bind DNA in specific sequence contexts. In addition to distinguishing one nucleobase from another, some transcription factors can distinguish between unmodified and modified bases. Current models of transcription factor binding tend not to take DNA modifications into account, while the recent few that do often have limitations. This makes a comprehensive and accurate profiling of transcription factor affinities difficult. RESULTS Here, we develop methods to identify transcription factor binding sites in modified DNA. Our models expand the standard A/C/G/T DNA alphabet to include cytosine modifications. We develop Cytomod to create modified genomic sequences and we also enhance the MEME Suite, adding the capacity to handle custom alphabets. We adapt the well-established position weight matrix (PWM) model of transcription factor binding affinity to this expanded DNA alphabet. Using these methods, we identify modification-sensitive transcription factor binding motifs. We confirm established binding preferences, such as the preference of ZFP57 and C/EBPβ for methylated motifs and the preference of c-Myc for unmethylated E-box motifs. CONCLUSIONS Using known binding preferences to tune model parameters, we discover novel modified motifs for a wide array of transcription factors. Finally, we validate our binding preference predictions for OCT4 using cleavage under targets and release using nuclease (CUT&RUN) experiments across conventional, methylation-, and hydroxymethylation-enriched sequences. Our approach readily extends to other DNA modifications. As more genome-wide single-base resolution modification data becomes available, we expect that our method will yield insights into altered transcription factor binding affinities across many different modifications.
Collapse
Affiliation(s)
- Coby Viner
- Department of Computer Science, University of Toronto, Toronto, ON, Canada
- Princess Margaret Cancer Centre, University Health Network, Toronto, ON, Canada
| | - Charles A Ishak
- Princess Margaret Cancer Centre, University Health Network, Toronto, ON, Canada
- Department of Epigenetics and Molecular Carcinogenesis, University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - James Johnson
- Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD, Australia
| | - Nicolas J Walker
- Department of Genetics, University of Cambridge, Cambridge, England
| | - Hui Shi
- Department of Genetics, University of Cambridge, Cambridge, England
| | - Marcela K Sjöberg-Herrera
- Wellcome Sanger Institute, Cambridge, England
- Faculty of Biological Sciences, Pontificia Universidad Católica de Chile, Santiago, Chile
| | - Shu Yi Shen
- Princess Margaret Cancer Centre, University Health Network, Toronto, ON, Canada
| | - Santana M Lardo
- Department of Biological Sciences, University of Pittsburgh, Pittsburgh, PA, USA
| | | | | | - Daniel D De Carvalho
- Princess Margaret Cancer Centre, University Health Network, Toronto, ON, Canada
- Department of Medical Biophysics, University of Toronto, Toronto, ON, Canada
| | - Sarah J Hainer
- Department of Biological Sciences, University of Pittsburgh, Pittsburgh, PA, USA
| | - Timothy L Bailey
- Department of Pharmacology, University of Nevada, Reno, Reno, NV, USA
| | - Michael M Hoffman
- Department of Computer Science, University of Toronto, Toronto, ON, Canada.
- Princess Margaret Cancer Centre, University Health Network, Toronto, ON, Canada.
- Department of Medical Biophysics, University of Toronto, Toronto, ON, Canada.
- Vector Institute for Artificial Intelligence, Toronto, ON, Canada.
| |
Collapse
|
2
|
Daneshpajouh H, Chen B, Shokraneh N, Masoumi S, Wiese KC, Libbrecht MW. Continuous chromatin state feature annotation of the human epigenome. Bioinformatics 2022; 38:3029-3036. [PMID: 35451453 PMCID: PMC9154241 DOI: 10.1093/bioinformatics/btac283] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2021] [Revised: 02/18/2022] [Accepted: 04/18/2022] [Indexed: 12/02/2022] Open
Abstract
Motivation Segmentation and genome annotation (SAGA) algorithms are widely used to understand genome activity and gene regulation. These methods take as input a set of sequencing-based assays of epigenomic activity, such as ChIP-seq measurements of histone modification and transcription factor binding. They output an annotation of the genome that assigns a chromatin state label to each genomic position. Existing SAGA methods have several limitations caused by the discrete annotation framework: such annotations cannot easily represent varying strengths of genomic elements, and they cannot easily represent combinatorial elements that simultaneously exhibit multiple types of activity. To remedy these limitations, we propose an annotation strategy that instead outputs a vector of chromatin state features at each position rather than a single discrete label. Continuous modeling is common in other fields, such as in topic modeling of text documents. We propose a method, epigenome-ssm-nonneg, that uses a non-negative state space model to efficiently annotate the genome with chromatin state features. We also propose several measures of the quality of a chromatin state feature annotation and we compare the performance of several alternative methods according to these quality measures. Results We show that chromatin state features from epigenome-ssm-nonneg are more useful for several downstream applications than both continuous and discrete alternatives, including their ability to identify expressed genes and enhancers. Therefore, we expect that these continuous chromatin state features will be valuable reference annotations to be used in visualization and downstream analysis. Availability and implementation Source code for epigenome-ssm is available at https://github.com/habibdanesh/epigenome-ssm and Zenodo (DOI: 10.5281/zenodo.6507585). Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Habib Daneshpajouh
- School of Computing Science, Simon Fraser University, Burnaby, BC, V5A 1S6, Canada
| | - Bowen Chen
- School of Computing Science, Simon Fraser University, Burnaby, BC, V5A 1S6, Canada
| | - Neda Shokraneh
- School of Computing Science, Simon Fraser University, Burnaby, BC, V5A 1S6, Canada
| | - Shohre Masoumi
- School of Computing Science, Simon Fraser University, Burnaby, BC, V5A 1S6, Canada
| | - Kay C Wiese
- School of Computing Science, Simon Fraser University, Burnaby, BC, V5A 1S6, Canada
| | - Maxwell W Libbrecht
- School of Computing Science, Simon Fraser University, Burnaby, BC, V5A 1S6, Canada
| |
Collapse
|
3
|
Kyoda K, Ho KHL, Tohsato Y, Itoga H, Onami S. BD5: An open HDF5-based data format to represent quantitative biological dynamics data. PLoS One 2020; 15:e0237468. [PMID: 32785254 PMCID: PMC7423140 DOI: 10.1371/journal.pone.0237468] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2020] [Accepted: 07/27/2020] [Indexed: 11/18/2022] Open
Abstract
BD5 is a new binary data format based on HDF5 (hierarchical data format version 5). It can be used for representing quantitative biological dynamics data obtained from bioimage informatics techniques and mechanobiological simulations. Biological Dynamics Markup Language (BDML) is an XML (Extensible Markup Language)-based open format that is also used to represent such data; however, it becomes difficult to access quantitative data in BDML files when the file size is large because parsing XML-based files requires large computational resources to first read the whole file sequentially into computer memory. BD5 enables fast random (i.e., direct) access to quantitative data on disk without parsing the entire file. Therefore, it allows practical reuse of data for understanding biological mechanisms underlying the dynamics.
Collapse
Affiliation(s)
- Koji Kyoda
- Laboratory for Developmental Dynamics, RIKEN Center for Biosystems Dynamics Research, Kobe, Japan
- Laboratory for Developmental Dynamics, RIKEN Quantitative Biology Center, Kobe, Japan
| | - Kenneth H. L. Ho
- Laboratory for Developmental Dynamics, RIKEN Center for Biosystems Dynamics Research, Kobe, Japan
- Laboratory for Developmental Dynamics, RIKEN Quantitative Biology Center, Kobe, Japan
| | - Yukako Tohsato
- Laboratory for Developmental Dynamics, RIKEN Center for Biosystems Dynamics Research, Kobe, Japan
- Laboratory for Developmental Dynamics, RIKEN Quantitative Biology Center, Kobe, Japan
- Department of Information Science and Engineering, Ritsumeikan University, Shiga, Japan
| | - Hiroya Itoga
- Laboratory for Developmental Dynamics, RIKEN Center for Biosystems Dynamics Research, Kobe, Japan
| | - Shuichi Onami
- Laboratory for Developmental Dynamics, RIKEN Center for Biosystems Dynamics Research, Kobe, Japan
- Laboratory for Developmental Dynamics, RIKEN Quantitative Biology Center, Kobe, Japan
- * E-mail: ,
| |
Collapse
|
4
|
Nti-Addae Y, Matthews D, Ulat VJ, Syed R, Sempéré G, Pétel A, Renner J, Larmande P, Guignon V, Jones E, Robbins K. Benchmarking database systems for Genomic Selection implementation. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2020; 2019:5566651. [PMID: 31508797 PMCID: PMC6737464 DOI: 10.1093/database/baz096] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/26/2019] [Revised: 05/29/2019] [Accepted: 07/01/2019] [Indexed: 01/07/2023]
Abstract
MOTIVATION With high-throughput genotyping systems now available, it has become feasible to fully integrate genotyping information into breeding programs. To make use of this information effectively requires DNA extraction facilities and marker production facilities that can efficiently deploy the desired set of markers across samples with a rapid turnaround time that allows for selection before crosses needed to be made. In reality, breeders often have a short window of time to make decisions by the time they are able to collect all their phenotyping data and receive corresponding genotyping data. This presents a challenge to organize information and utilize it in downstream analyses to support decisions made by breeders. In order to implement genomic selection routinely as part of breeding programs, one would need an efficient genotyping data storage system. We selected and benchmarked six popular open-source data storage systems, including relational database management and columnar storage systems. RESULTS We found that data extract times are greatly influenced by the orientation in which genotype data is stored in a system. HDF5 consistently performed best, in part because it can more efficiently work with both orientations of the allele matrix. AVAILABILITY http://gobiin1.bti.cornell.edu:6083/projects/GBM/repos/benchmarking/browse.
Collapse
Affiliation(s)
| | | | - Victor Jun Ulat
- Centro Internacional de Mejoramiento de Maíz y Trigo (CIMMYT)
| | - Raza Syed
- Institute of Biotechnology, Cornell University
| | | | | | | | | | | | | | - Kelly Robbins
- Section of Plant Breeding and Genetics, School of Integrative Plants Sciences, Cornell University
| |
Collapse
|
5
|
Dronamraju R, Jha DK, Eser U, Adams AT, Dominguez D, Choudhury R, Chiang YC, Rathmell WK, Emanuele MJ, Churchman LS, Strahl BD. Set2 methyltransferase facilitates cell cycle progression by maintaining transcriptional fidelity. Nucleic Acids Res 2019; 46:1331-1344. [PMID: 29294086 PMCID: PMC5814799 DOI: 10.1093/nar/gkx1276] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2017] [Accepted: 12/18/2017] [Indexed: 12/14/2022] Open
Abstract
Methylation of histone H3 lysine 36 (H3K36me) by yeast Set2 is critical for the maintenance of chromatin structure and transcriptional fidelity. However, we do not know the full range of Set2/H3K36me functions or the scope of mechanisms that regulate Set2-dependent H3K36 methylation. Here, we show that the APC/CCDC20 complex regulates Set2 protein abundance during the cell cycle. Significantly, absence of Set2-mediated H3K36me causes a loss of cell cycle control and pronounced defects in the transcriptional fidelity of cell cycle regulatory genes, a class of genes that are generally long, hence highly dependent on Set2/H3K36me for their transcriptional fidelity. Because APC/C also controls human SETD2, and SETD2 likewise regulates cell cycle progression, our data imply an evolutionarily conserved cell cycle function for Set2/SETD2 that may explain why recurrent mutations of SETD2 contribute to human disease.
Collapse
Affiliation(s)
- Raghuvar Dronamraju
- Department of Biochemistry & Biophysics, University of North Carolina School of Medicine, Chapel Hill, NC 27599, USA
| | - Deepak Kumar Jha
- Department of Biochemistry & Biophysics, University of North Carolina School of Medicine, Chapel Hill, NC 27599, USA
| | - Umut Eser
- Department of Genetics, Harvard Medical School, Harvard University, Boston, MA 02115, USA
| | - Alexander T Adams
- Department of Biochemistry & Biophysics, University of North Carolina School of Medicine, Chapel Hill, NC 27599, USA
| | - Daniel Dominguez
- Department of Biology, Massachusetts Institute of Technology, Cambridge, MA 02115, USA
| | - Rajarshi Choudhury
- Department of Pharmacology, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA.,Lineberger Comprehensive Cancer Center, University of North Carolina School of Medicine, Chapel Hill, NC 27599, USA
| | - Yun-Chen Chiang
- Lineberger Comprehensive Cancer Center, University of North Carolina School of Medicine, Chapel Hill, NC 27599, USA
| | - W Kimryn Rathmell
- Lineberger Comprehensive Cancer Center, University of North Carolina School of Medicine, Chapel Hill, NC 27599, USA
| | - Michael J Emanuele
- Department of Pharmacology, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA.,Lineberger Comprehensive Cancer Center, University of North Carolina School of Medicine, Chapel Hill, NC 27599, USA
| | - L Stirling Churchman
- Department of Genetics, Harvard Medical School, Harvard University, Boston, MA 02115, USA
| | - Brian D Strahl
- Department of Biochemistry & Biophysics, University of North Carolina School of Medicine, Chapel Hill, NC 27599, USA.,Lineberger Comprehensive Cancer Center, University of North Carolina School of Medicine, Chapel Hill, NC 27599, USA
| |
Collapse
|
6
|
Kumar R, Sobhy H, Stenberg P, Lizana L. Genome contact map explorer: a platform for the comparison, interactive visualization and analysis of genome contact maps. Nucleic Acids Res 2017; 45:e152. [PMID: 28973466 PMCID: PMC5622372 DOI: 10.1093/nar/gkx644] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/25/2016] [Accepted: 07/19/2017] [Indexed: 12/23/2022] Open
Abstract
Hi-C experiments generate data in form of large genome contact maps (Hi-C maps). These show that chromosomes are arranged in a hierarchy of three-dimensional compartments. But to understand how these compartments form and by how much they affect genetic processes such as gene regulation, biologists and bioinformaticians need efficient tools to visualize and analyze Hi-C data. However, this is technically challenging because these maps are big. In this paper, we remedied this problem, partly by implementing an efficient file format and developed the genome contact map explorer platform. Apart from tools to process Hi-C data, such as normalization methods and a programmable interface, we made a graphical interface that let users browse, scroll and zoom Hi-C maps to visually search for patterns in the Hi-C data. In the software, it is also possible to browse several maps simultaneously and plot related genomic data. The software is openly accessible to the scientific community.
Collapse
Affiliation(s)
- Rajendra Kumar
- Integrated Science Lab, Umeå University, 901 87, Umeå, Sweden.,Department of Physics, Umeå University, 901 87, Umeå, Sweden
| | - Haitham Sobhy
- Department of Molecular Biology, Umeå University, 901 87, Umeå, Sweden
| | - Per Stenberg
- Department of Molecular Biology, Umeå University, 901 87, Umeå, Sweden.,Division of CBRN Security and Defence, FOI-Swedish Defence Research Agency, 906 21, Umeå, Sweden
| | - Ludvig Lizana
- Integrated Science Lab, Umeå University, 901 87, Umeå, Sweden.,Department of Physics, Umeå University, 901 87, Umeå, Sweden
| |
Collapse
|
7
|
Huang F, Shen J, Guo Q, Shi Y. eRFSVM: a hybrid classifier to predict enhancers-integrating random forests with support vector machines. Hereditas 2016; 153:6. [PMID: 28096768 PMCID: PMC5226099 DOI: 10.1186/s41065-016-0012-2] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2016] [Accepted: 06/16/2016] [Indexed: 01/03/2023] Open
Abstract
BACKGROUND Enhancers are tissue specific distal regulation elements, playing vital roles in gene regulation and expression. The prediction and identification of enhancers are important but challenging issues for bioinformatics studies. Existing computational methods, mostly single classifiers, can only predict the transcriptional coactivator EP300 based enhancers and show low generalization performance. RESULTS We built a hybrid classifier called eRFSVM in this study, using random forests as a base classifier, and support vector machines as a main classifier. eRFSVM integrated two components as eRFSVM-ENCODE and eRFSVM-FANTOM5 with diverse features and labels. The base classifier trained datasets from a single tissue or cell with random forests. The main classifier made the final decision by support vector machines algorithm, with the predicting results of base classifiers as inputs. For eRFSVM-ENCODE, we trained datasets from cell lines including Gm12878, Hep, H1-hesc and Huvec, using ChIP-Seq datasets as features and EP300 based enhancers as labels. We tested eRFSVM-ENCODE on K562 dataset, and resulted in a predicting precision of 83.69 %, which was much better than existing classifiers. For eRFSVM-FANTOM5, with enhancers identified by RNA in FANTOM5 project as labels, the precision, recall, F-score and accuracy were 86.17 %, 36.06 %, 50.84 % and 93.38 % using eRFSVM, increasing 23.24 % (69.92 %), 97.05 % (18.30 %), 76.90 % (28.74 %), 4.69 % (89.20 %) than the existing algorithm, respectively. CONCLUSIONS All these results demonstrated that eRFSVM was a better classifier in predicting both EP300 based and FAMTOM5 RNAs based enhancers.
Collapse
Affiliation(s)
- Fang Huang
- Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders (Ministry of Education) and the Collaborative Innovation Center for Brain Science, Shanghai Jiao Tong University, Shanghai, 200030 People’s Republic of China
| | - Jiawei Shen
- Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders (Ministry of Education) and the Collaborative Innovation Center for Brain Science, Shanghai Jiao Tong University, Shanghai, 200030 People’s Republic of China
| | - Qingli Guo
- Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders (Ministry of Education) and the Collaborative Innovation Center for Brain Science, Shanghai Jiao Tong University, Shanghai, 200030 People’s Republic of China
| | - Yongyong Shi
- Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders (Ministry of Education) and the Collaborative Innovation Center for Brain Science, Shanghai Jiao Tong University, Shanghai, 200030 People’s Republic of China
- Shanghai Changning Mental Health Center, Shanghai, 200042 People’s Republic of China
- Department of Psychiatry, The First Teaching Hospital of Xinjiang Medical University, Urumqi, 830054 People’s Republic of China
- The Bio-X Little White Building, Shanghai Jiao Tong University, No.55 Guang Yuan Xi Road, Shanghai, 200030 China
| |
Collapse
|
8
|
Abstract
MOTIVATION BigWig, a format to represent read density data, is one of the most popular data types. They can represent the peak intensity in ChIP-seq, the transcript expression in RNA-seq, the copy number variation in whole genome sequencing, etc. UCSC Encode project uses the bigWig format heavily for storage and visualization. Of 5.2 TB Encode hg19 database, 1.6 TB (31% of the total space) is used to store bigWig files. BigWig format not only saves a lot of space but also supports fast queries that are crucial for interactive analysis and browsing. In our benchmark, bigWig often has similar size to the gzipped raw data, while is still able to support ∼ 5000 random queries per second. RESULTS Although bigWig is good enough at the moment, both storage space and query time are expected to become limited when sequencing gets cheaper. This article describes a new method to store density data named CWig. The format uses on average one-third of the size of existing bigWig files and improves random query speed up to 100 times. AVAILABILITY AND IMPLEMENTATION http://genome.ddns.comp.nus.edu.sg/∼cwig.
Collapse
Affiliation(s)
- Do Huy Hoang
- Department of Computational and Systems Biology, Genome Institute of Singapore, Singapore 138672 and Department of Computer Science, School of Computing, National University of Singapore, Singapore 117417
| | - Wing-Kin Sung
- Department of Computational and Systems Biology, Genome Institute of Singapore, Singapore 138672 and Department of Computer Science, School of Computing, National University of Singapore, Singapore 117417 Department of Computational and Systems Biology, Genome Institute of Singapore, Singapore 138672 and Department of Computer Science, School of Computing, National University of Singapore, Singapore 117417
| |
Collapse
|
9
|
Dale RK, Matzat LH, Lei EP. metaseq: a Python package for integrative genome-wide analysis reveals relationships between chromatin insulators and associated nuclear mRNA. Nucleic Acids Res 2014; 42:9158-70. [PMID: 25063299 DOI: 10.1093/nar/gku644] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Here we introduce metaseq, a software library written in Python, which enables loading multiple genomic data formats into standard Python data structures and allows flexible, customized manipulation and visualization of data from high-throughput sequencing studies. We demonstrate its practical use by analyzing multiple datasets related to chromatin insulators, which are DNA-protein complexes proposed to organize the genome into distinct transcriptional domains. Recent studies in Drosophila and mammals have implicated RNA in the regulation of chromatin insulator activities. Moreover, the Drosophila RNA-binding protein Shep has been shown to antagonize gypsy insulator activity in a tissue-specific manner, but the precise role of RNA in this process remains unclear. Better understanding of chromatin insulator regulation requires integration of multiple datasets, including those from chromatin-binding, RNA-binding, and gene expression experiments. We use metaseq to integrate RIP- and ChIP-seq data for Shep and the core gypsy insulator protein Su(Hw) in two different cell types, along with publicly available ChIP-chip and RNA-seq data. Based on the metaseq-enabled analysis presented here, we propose a model where Shep associates with chromatin cotranscriptionally, then is recruited to insulator complexes in trans where it plays a negative role in insulator activity.
Collapse
Affiliation(s)
- Ryan K Dale
- Laboratory of Cellular and Developmental Biology, National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, Maryland, 20892, USA
| | - Leah H Matzat
- Laboratory of Cellular and Developmental Biology, National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, Maryland, 20892, USA
| | - Elissa P Lei
- Laboratory of Cellular and Developmental Biology, National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, Maryland, 20892, USA
| |
Collapse
|
10
|
Hoffman MM, Buske OJ, Wang J, Weng Z, Bilmes JA, Noble WS. Unsupervised pattern discovery in human chromatin structure through genomic segmentation. Nat Methods 2012; 9:473-6. [PMID: 22426492 DOI: 10.1038/nmeth.1937] [Citation(s) in RCA: 395] [Impact Index Per Article: 32.9] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2011] [Accepted: 02/14/2012] [Indexed: 01/24/2023]
Abstract
We trained Segway, a dynamic Bayesian network method, simultaneously on chromatin data from multiple experiments, including positions of histone modifications, transcription-factor binding and open chromatin, all derived from a human chronic myeloid leukemia cell line. In an unsupervised fashion, we identified patterns associated with transcription start sites, gene ends, enhancers, transcriptional regulator CTCF-binding regions and repressed regions. Software and genome browser tracks are at http://noble.gs.washington.edu/proj/segway/.
Collapse
Affiliation(s)
- Michael M Hoffman
- Department of Genome Sciences, University of Washington, Seattle, Washington, USA
| | | | | | | | | | | |
Collapse
|
11
|
Identifying elemental genomic track types and representing them uniformly. BMC Bioinformatics 2011; 12:494. [PMID: 22208806 PMCID: PMC3315820 DOI: 10.1186/1471-2105-12-494] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2011] [Accepted: 12/30/2011] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND With the recent advances and availability of various high-throughput sequencing technologies, data on many molecular aspects, such as gene regulation, chromatin dynamics, and the three-dimensional organization of DNA, are rapidly being generated in an increasing number of laboratories. The variation in biological context, and the increasingly dispersed mode of data generation, imply a need for precise, interoperable and flexible representations of genomic features through formats that are easy to parse. A host of alternative formats are currently available and in use, complicating analysis and tool development. The issue of whether and how the multitude of formats reflects varying underlying characteristics of data has to our knowledge not previously been systematically treated. RESULTS We here identify intrinsic distinctions between genomic features, and argue that the distinctions imply that a certain variation in the representation of features as genomic tracks is warranted. Four core informational properties of tracks are discussed: gaps, lengths, values and interconnections. From this we delineate fifteen generic track types. Based on the track type distinctions, we characterize major existing representational formats and find that the track types are not adequately supported by any single format. We also find, in contrast to the XML formats, that none of the existing tabular formats are conveniently extendable to support all track types. We thus propose two unified formats for track data, an improved XML format, BioXSD 1.1, and a new tabular format, GTrack 1.0. CONCLUSIONS The defined track types are shown to capture relevant distinctions between genomic annotation tracks, resulting in varying representational needs and analysis possibilities. The proposed formats, GTrack 1.0 and BioXSD 1.1, cater to the identified track distinctions and emphasize preciseness, flexibility and parsing convenience.
Collapse
|
12
|
Steinbiss S, Kurtz S. A new efficient data structure for storage and retrieval of multiple biosequences. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2011; 9:330-344. [PMID: 22084150 DOI: 10.1109/tcbb.2011.146] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
Today's genome analysis applications require sequence representations allowing for fast access to their contents while also being memory-efficient enough to facilitate analyses of large-scale data. While a wide variety of sequence representations exist, lack of a generic implementation of efficient sequence storage has led to a plethora of poorly reusable or programming language-specific implementations. We present a novel, space-efficient data structure (GtEncseq) for storing multiple biological sequences of variable alphabet size, with customizable character transformations, wildcard support and an assortment of internal representations optimized for different distributions of wildcards and sequence lengths. For the human genome (3.1 gigabases, including 237 million wildcard characters) our representation requires only 2 + 8 × 10^-6bits per character. Implemented in C, our portable software implementation provides a variety of methods for random and sequential access to characters and substrings (including different reading directions) using an object-oriented interface. In addition, it includes access to metadata like sequence descriptions or character distributions. The library is extensible to be used from various scripting languages. GtEncseq is much more versatile than previous solutions, adding features that were previously unavailable. Benchmarks show that it is competitive with respect to space and time requirements.
Collapse
|
13
|
Buske OJ, Hoffman MM, Ponts N, Le Roch KG, Noble WS. Exploratory analysis of genomic segmentations with Segtools. BMC Bioinformatics 2011; 12:415. [PMID: 22029426 PMCID: PMC3224787 DOI: 10.1186/1471-2105-12-415] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2011] [Accepted: 10/26/2011] [Indexed: 11/23/2022] Open
Abstract
Background As genome-wide experiments and annotations become more prevalent, researchers increasingly require tools to help interpret data at this scale. Many functional genomics experiments involve partitioning the genome into labeled segments, such that segments sharing the same label exhibit one or more biochemical or functional traits. For example, a collection of ChlP-seq experiments yields a compendium of peaks, each labeled with one or more associated DNA-binding proteins. Similarly, manually or automatically generated annotations of functional genomic elements, including cis-regulatory modules and protein-coding or RNA genes, can also be summarized as genomic segmentations. Results We present a software toolkit called Segtools that simplifies and automates the exploration of genomic segmentations. The software operates as a series of interacting tools, each of which provides one mode of summarization. These various tools can be pipelined and summarized in a single HTML page. We describe the Segtools toolkit and demonstrate its use in interpreting a collection of human histone modification data sets and Plasmodium falciparum local chromatin structure data sets. Conclusions Segtools provides a convenient, powerful means of interpreting a genomic segmentation.
Collapse
Affiliation(s)
- Orion J Buske
- Department of Genome Sciences, University of Washington, PO Box 355065, Seattle, WA 98195-5065, USA
| | | | | | | | | |
Collapse
|