1
|
Zhuang J, Huang X, Liu S, Gao W, Su R, Feng K. MulTFBS: A Spatial-Temporal Network with Multichannels for Predicting Transcription Factor Binding Sites. J Chem Inf Model 2024; 64:4322-4333. [PMID: 38733561 DOI: 10.1021/acs.jcim.3c02088] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/13/2024]
Abstract
Revealing the mechanisms that influence transcription factor binding specificity is the key to understanding gene regulation. In previous studies, DNA double helix structure and one-hot embedding have been used successfully to design computational methods for predicting transcription factor binding sites (TFBSs). However, DNA sequence as a kind of biological language, the method of word embedding representation in natural language processing, has not been considered properly in TFBS prediction models. In our work, we integrate different types of features of DNA sequence to design a multichanneled deep learning framework, namely MulTFBS, in which independent one-hot encoding, word embedding encoding, which can incorporate contextual information and extract the global features of the sequences, and double helix three-dimensional structural features have been trained in different channels. To extract sequence high-level information effectively, in our deep learning framework, we select the spatial-temporal network by combining convolutional neural networks and bidirectional long short-term memory networks with attention mechanism. Compared with six state-of-the-art methods on 66 universal protein-binding microarray data sets of different transcription factors, MulTFBS performs best on all data sets in the regression tasks, with the average R2 of 0.698 and the average PCC of 0.833, which are 5.4% and 3.2% higher, respectively, than the suboptimal method CRPTS. In addition, we evaluate the classification performance of MulTFBS for distinguishing bound or unbound regions on TF ChIP-seq data. The results show that our framework also performs well in the TFBS classification tasks.
Collapse
Affiliation(s)
- Jujuan Zhuang
- The School of Science, Dalian Maritime University, Dalian 116026, China
| | - Xinru Huang
- The School of Science, Dalian Maritime University, Dalian 116026, China
| | - Shuhan Liu
- The School of Science, Dalian Maritime University, Dalian 116026, China
| | - Wanquan Gao
- The School of Science, Dalian Maritime University, Dalian 116026, China
| | - Rui Su
- The School of Science, Dalian Maritime University, Dalian 116026, China
| | - Kexin Feng
- The School of Science, Dalian Maritime University, Dalian 116026, China
| |
Collapse
|
2
|
Chew YH, Marucci L. Mechanistic Model-Driven Biodesign in Mammalian Synthetic Biology. Methods Mol Biol 2024; 2774:71-84. [PMID: 38441759 DOI: 10.1007/978-1-0716-3718-0_6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/07/2024]
Abstract
Mathematical modeling plays a vital role in mammalian synthetic biology by providing a framework to design and optimize design circuits and engineered bioprocesses, predict their behavior, and guide experimental design. Here, we review recent models used in the literature, considering mathematical frameworks at the molecular, cellular, and system levels. We report key challenges in the field and discuss opportunities for genome-scale models, machine learning, and cybergenetics to expand the capabilities of model-driven mammalian cell biodesign.
Collapse
Affiliation(s)
- Yin Hoon Chew
- School of Mathematics, University of Birmingham, Birmingham, UK
| | - Lucia Marucci
- Department of Engineering Mathematics, University of Bristol, Bristol, UK.
- School of Cellular and Molecular Medicine, University of Bristol, Bristol, UK.
| |
Collapse
|
3
|
Zhuang J, Feng K, Teng X, Jia C. GNet: An integrated context-aware neural framework for transcription factor binding signal at single nucleotide resolution prediction. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2023; 20:15809-15829. [PMID: 37919990 DOI: 10.3934/mbe.2023704] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/04/2023]
Abstract
Transcription factors (TFs) are important factors that regulate gene expression. Revealing the mechanism affecting the binding specificity of TFs is the key to understanding gene regulation. Most of the previous studies focus on TF-DNA binding sites at the sequence level, and they seldom utilize the contextual features of DNA sequences. In this paper, we develop an integrated spatiotemporal context-aware neural network framework, named GNet, for predicting TF-DNA binding signal at single nucleotide resolution by achieving three tasks: single nucleotide resolution signal prediction, identification of binding regions at the sequence level, and TF-DNA binding motif prediction. GNet extracts implicit spatial contextual information with a gated highway neural mechanism, which captures large context multi-level patterns using linear shortcut connections, and the idea of it permeates the encoder and decoder parts of GNet. The improved dual external attention mechanism, which learns implicit relationships both within and among samples, and improves the performance of the model. Experimental results on 53 human TF ChIP-seq datasets and 6 chromatin accessibility ATAC-seq datasets shows that GNet outperforms the state-of-the-art methods in the three tasks, and the results of cross-species studies on 15 human and 18 mouse TF datasets of the corresponding TF families indicate that GNet also shows the best performance in cross-species prediction over the competitive methods.
Collapse
Affiliation(s)
- Jujuan Zhuang
- School of Science, Dalian Maritime University, Dalian, Liaoning 116026, China
| | - Kexin Feng
- School of Science, Dalian Maritime University, Dalian, Liaoning 116026, China
| | - Xinyang Teng
- School of Science, Dalian Maritime University, Dalian, Liaoning 116026, China
| | - Cangzhi Jia
- School of Science, Dalian Maritime University, Dalian, Liaoning 116026, China
| |
Collapse
|
4
|
Yu Y, Ding P, Gao H, Liu G, Zhang F, Yu B. Cooperation of local features and global representations by a dual-branch network for transcription factor binding sites prediction. Brief Bioinform 2023; 24:7030619. [PMID: 36748992 DOI: 10.1093/bib/bbad036] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2022] [Revised: 01/03/2023] [Accepted: 01/18/2023] [Indexed: 02/08/2023] Open
Abstract
Interactions between DNA and transcription factors (TFs) play an essential role in understanding transcriptional regulation mechanisms and gene expression. Due to the large accumulation of training data and low expense, deep learning methods have shown huge potential in determining the specificity of TFs-DNA interactions. Convolutional network-based and self-attention network-based methods have been proposed for transcription factor binding sites (TFBSs) prediction. Convolutional operations are efficient to extract local features but easy to ignore global information, while self-attention mechanisms are expert in capturing long-distance dependencies but difficult to pay attention to local feature details. To discover comprehensive features for a given sequence as far as possible, we propose a Dual-branch model combining Self-Attention and Convolution, dubbed as DSAC, which fuses local features and global representations in an interactive way. In terms of features, convolution and self-attention contribute to feature extraction collaboratively, enhancing the representation learning. In terms of structure, a lightweight but efficient architecture of network is designed for the prediction, in particular, the dual-branch structure makes the convolution and the self-attention mechanism can be fully utilized to improve the predictive ability of our model. The experiment results on 165 ChIP-seq datasets show that DSAC obviously outperforms other five deep learning based methods and demonstrate that our model can effectively predict TFBSs based on sequence feature alone. The source code of DSAC is available at https://github.com/YuBinLab-QUST/DSAC/.
Collapse
Affiliation(s)
- Yutong Yu
- College of Information Science and Technology, Qingdao University of Science and Technology, China
| | - Pengju Ding
- College of Information Science and Technology, Qingdao University of Science and Technology, China
| | - Hongli Gao
- College of Mathematics and Physics, Qingdao University of Science and Technology, China
| | - Guozhu Liu
- College of Information Science and Technology, Qingdao University of Science and Technology, China
| | - Fa Zhang
- School of Medical Technology, Beijing Institute of Technology, China
| | - Bin Yu
- College of Information Science and Technology, School of Data Science, Qingdao University of Science and Technology, China
| |
Collapse
|
5
|
Guo Z, Guo L, Qin J, Ye F, Sun D, Wu Q, Wang S, Crickmore N, Zhou X, Bravo A, Soberón M, Zhang Y. A single transcription factor facilitates an insect host combating Bacillus thuringiensis infection while maintaining fitness. Nat Commun 2022; 13:6024. [PMID: 36224245 PMCID: PMC9555685 DOI: 10.1038/s41467-022-33706-x] [Citation(s) in RCA: 14] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2022] [Accepted: 09/29/2022] [Indexed: 11/09/2022] Open
Abstract
Maintaining fitness during pathogen infection is vital for host survival as an excessive response can be as detrimental as the infection itself. Fitness costs are frequently associated with insect hosts countering the toxic effect of the entomopathogenic bacterium Bacillus thuringiensis (Bt), which delay the evolution of resistance to this pathogen. The insect pest Plutella xylostella has evolved a mechanism to resist Bt toxins without incurring significant fitness costs. Here, we reveal that non-phosphorylated and phosphorylated forms of a MAPK-modulated transcription factor fushi tarazu factor 1 (FTZ-F1) can respectively orchestrate down-regulation of Bt Cry1Ac toxin receptors and up-regulation of non-receptor paralogs via two distinct binding sites, thereby presenting Bt toxin resistance without growth penalty. Our findings reveal how host organisms can co-opt a master molecular switch to overcome pathogen invasion with low cost, and contribute to understanding the underlying mechanism of growth-defense tradeoffs during host-pathogen interactions in P. xylostella.
Collapse
Affiliation(s)
- Zhaojiang Guo
- Department of Plant Protection, Institute of Vegetables and Flowers, Chinese Academy of Agricultural Sciences, Beijing, 100081, China. .,Guangdong Laboratory for Lingnan Modern Agriculture, Guangzhou, 510642, China.
| | - Le Guo
- Department of Plant Protection, Institute of Vegetables and Flowers, Chinese Academy of Agricultural Sciences, Beijing, 100081, China
| | - Jianying Qin
- Department of Plant Protection, Institute of Vegetables and Flowers, Chinese Academy of Agricultural Sciences, Beijing, 100081, China
| | - Fan Ye
- Department of Plant Protection, Institute of Vegetables and Flowers, Chinese Academy of Agricultural Sciences, Beijing, 100081, China
| | - Dan Sun
- Department of Plant Protection, Institute of Vegetables and Flowers, Chinese Academy of Agricultural Sciences, Beijing, 100081, China
| | - Qingjun Wu
- Department of Plant Protection, Institute of Vegetables and Flowers, Chinese Academy of Agricultural Sciences, Beijing, 100081, China
| | - Shaoli Wang
- Department of Plant Protection, Institute of Vegetables and Flowers, Chinese Academy of Agricultural Sciences, Beijing, 100081, China
| | - Neil Crickmore
- School of Life Sciences, University of Sussex, Brighton, BN1 9QE, UK
| | - Xuguo Zhou
- Department of Entomology, University of Kentucky, Lexington, KY, 40546-0091, USA
| | - Alejandra Bravo
- Departamento de Microbiología Molecular, Instituto de Biotecnología, Universidad Nacional Autónoma de México, Cuernavaca, 62250, México
| | - Mario Soberón
- Departamento de Microbiología Molecular, Instituto de Biotecnología, Universidad Nacional Autónoma de México, Cuernavaca, 62250, México
| | - Youjun Zhang
- Department of Plant Protection, Institute of Vegetables and Flowers, Chinese Academy of Agricultural Sciences, Beijing, 100081, China.
| |
Collapse
|
6
|
Liu W, Jiang Y, Peng L, Sun X, Gan W, Zhao Q, Tang H. Inferring Gene Regulatory Networks Using the Improved Markov Blanket Discovery Algorithm. Interdiscip Sci 2021; 14:168-181. [PMID: 34495484 DOI: 10.1007/s12539-021-00478-9] [Citation(s) in RCA: 32] [Impact Index Per Article: 10.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2021] [Revised: 08/22/2021] [Accepted: 08/24/2021] [Indexed: 11/26/2022]
Abstract
Inferring gene regulatory networks (GRNs) from microarray data can help us understand the mechanisms of life and eventually develop effective therapies. Currently, many computational methods have been used in inferring GRNs. However, owing to high-dimensional data and small samples, these methods often tend to introduce redundant regulatory relationships. Therefore, a novel network inference method based on the improved Markov blanket discovery algorithm, IMBDANET, is proposed to infer GRNs. Specifically, for each target gene, data processing inequality was applied to the Markov blanket discovery algorithm for the accurate differentiation of direct regulatory genes from indirect regulatory genes. Finally, direct regulatory genes were used in constructing GRNs, and the network structure was optimized according to the importance degree score. Experimental results on six public network datasets show that the proposed method can be effectively used to infer GRNs.
Collapse
Affiliation(s)
- Wei Liu
- School of Computer Science, Xiangtan University, Xiangtan, 411105, China
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Xiangtan, 411105, China
| | - Yi Jiang
- School of Computer Science, Xiangtan University, Xiangtan, 411105, China
| | - Li Peng
- School of Computer Science and Engineering, Hunan University of Science and Technology, Xiangtan, 411201, China
| | - Xingen Sun
- School of Computer Science, Xiangtan University, Xiangtan, 411105, China
| | - Wenqing Gan
- School of Computer Science, Xiangtan University, Xiangtan, 411105, China
| | - Qi Zhao
- School of Computer Science and Software Engineering, University of Science and Technology Liaoning, Anshan, 114051, China.
| | - Huanrong Tang
- School of Computer Science, Xiangtan University, Xiangtan, 411105, China.
| |
Collapse
|
7
|
Zhang Q, Wang S, Chen Z, He Y, Liu Q, Huang DS. Locating transcription factor binding sites by fully convolutional neural network. Brief Bioinform 2021; 22:bbaa435. [PMID: 33498086 PMCID: PMC8425303 DOI: 10.1093/bib/bbaa435] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2020] [Revised: 12/11/2020] [Accepted: 12/26/2020] [Indexed: 12/27/2022] Open
Abstract
Transcription factors (TFs) play an important role in regulating gene expression, thus identification of the regions bound by them has become a fundamental step for molecular and cellular biology. In recent years, an increasing number of deep learning (DL) based methods have been proposed for predicting TF binding sites (TFBSs) and achieved impressive prediction performance. However, these methods mainly focus on predicting the sequence specificity of TF-DNA binding, which is equivalent to a sequence-level binary classification task, and fail to identify motifs and TFBSs accurately. In this paper, we developed a fully convolutional network coupled with global average pooling (FCNA), which by contrast is equivalent to a nucleotide-level binary classification task, to roughly locate TFBSs and accurately identify motifs. Experimental results on human ChIP-seq datasets show that FCNA outperforms other competing methods significantly. Besides, we find that the regions located by FCNA can be used by motif discovery tools to further refine the prediction performance. Furthermore, we observe that FCNA can accurately identify TF-DNA binding motifs across different cell lines and infer indirect TF-DNA bindings.
Collapse
Affiliation(s)
- Qinhu Zhang
- Translational Medical Center for Stem Cell Therapy and Institute for Regenerative Medicine, Shanghai East Hospital, Tongji University, Shanghai, China
| | - Siguo Wang
- Computer Science and Technology, Tongji University, China
| | | | - Ying He
- Computer Science and Technology at Tongji University, China
| | - Qi Liu
- Translational Medical Center for Stem Cell Therapy and Institute for Regenerative Medicine, Shanghai East Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai, China
| | - De-Shuang Huang
- Institute of Machines Learning and Systems Biology, Tongji University, China
| |
Collapse
|
8
|
Zhang Q, Yu W, Han K, Nandi AK, Huang DS. Multi-Scale Capsule Network for Predicting DNA-Protein Binding Sites. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:1793-1800. [PMID: 32960766 DOI: 10.1109/tcbb.2020.3025579] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Discovering DNA-protein binding sites, also known as motif discovery, is the foundation for further analysis of transcription factors (TFs). Deep learning algorithms such as convolutional neural networks (CNN) have been introduced to motif discovery task and have achieved state-of-art performance. However, due to the limitations of CNN, motif discovery methods based on CNN do not take full advantage of large-scale sequencing data generated by high-throughput sequencing technology. Hence, in this paper we propose multi-scale capsule network architecture (MSC) integrating multi-scale CNN, a variant of CNN able to extract motif features of different lengths, and capsule network, a novel type of artificial neural network architecture aimed at improving CNN. The proposed method is tested on real ChIP-seq datasets and the experimental results show a considerable improvement compared with two well-tested deep learning-based sequence model, DeepBind and Deepsea.
Collapse
|
9
|
Zhang Q, Wang D, Han K, Huang DS. Predicting TF-DNA Binding Motifs from ChIP-seq Datasets Using the Bag-Based Classifier Combined With a Multi-Fold Learning Scheme. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:1743-1751. [PMID: 32946398 DOI: 10.1109/tcbb.2020.3025007] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
The rapid development of high-throughput sequencing technology provides unique opportunities for studying of transcription factor binding sites, but also brings new computational challenges. Recently, a series of discriminative motif discovery (DMD) methods have been proposed and offer promising solutions for addressing these challenges. However, because of the huge computation cost, most of them have to choose approximate schemes that either sacrifice the accuracy of motif representation or tune motif parameter indirectly. In this paper, we propose a bag-based classifier combined with a multi-fold learning scheme (BCMF) to discover motifs from ChIP-seq datasets. First, BCMF formulates input sequences as a labeled bag naturally. Then, a bag-based classifier, combining with a bag feature extracting strategy, is applied to construct the objective function, and a multi-fold learning scheme is used to solve it. Compared with the existing DMD tools, BCMF features three improvements: 1) Learning position weight matrix (PWM) directly in a continuous space; 2) Proposing to represent a positive bag with a feature fused by its k "most positive" patterns. 3) Applying a more advanced learning scheme. The experimental results on 134 ChIP-seq datasets show that BCMF substantially outperforms existing DMD methods (including DREME, HOMER, XXmotif, motifRG, EDCOD and our previous work).
Collapse
|
10
|
DeepD2V: A Novel Deep Learning-Based Framework for Predicting Transcription Factor Binding Sites from Combined DNA Sequence. Int J Mol Sci 2021; 22:ijms22115521. [PMID: 34073774 PMCID: PMC8197256 DOI: 10.3390/ijms22115521] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2021] [Revised: 04/29/2021] [Accepted: 05/12/2021] [Indexed: 12/13/2022] Open
Abstract
Predicting in vivo protein-DNA binding sites is a challenging but pressing task in a variety of fields like drug design and development. Most promoters contain a number of transcription factor (TF) binding sites, but only a small minority has been identified by biochemical experiments that are time-consuming and laborious. To tackle this challenge, many computational methods have been proposed to predict TF binding sites from DNA sequence. Although previous methods have achieved remarkable performance in the prediction of protein-DNA interactions, there is still considerable room for improvement. In this paper, we present a hybrid deep learning framework, termed DeepD2V, for transcription factor binding sites prediction. First, we construct the input matrix with an original DNA sequence and its three kinds of variant sequences, including its inverse, complementary, and complementary inverse sequence. A sliding window of size k with a specific stride is used to obtain its k-mer representation of input sequences. Next, we use word2vec to obtain a pre-trained k-mer word distributed representation model. Finally, the probability of protein-DNA binding is predicted by using the recurrent and convolutional neural network. The experiment results on 50 public ChIP-seq benchmark datasets demonstrate the superior performance and robustness of DeepD2V. Moreover, we verify that the performance of DeepD2V using word2vec-based k-mer distributed representation is better than one-hot encoding, and the integrated framework of both convolutional neural network (CNN) and bidirectional LSTM (bi-LSTM) outperforms CNN or the bi-LSTM model when used alone. The source code of DeepD2V is available at the github repository.
Collapse
|
11
|
Abstract
Aims:
Robust and more accurate method for identifying transcription factor binding sites
(TFBS) for gene expression.
Background:
Deep neural networks (DNNs) have shown promising growth in solving complex
machine learning problems. Conventional techniques are comfortably replaced by DNNs in
computer vision, signal processing, healthcare, and genomics. Understanding DNA sequences is
always a crucial task in healthcare and regulatory genomics. For DNA motif prediction, choosing the
right dataset with a sufficient number of input sequences is crucial in order to design an effective
model.
Objective:
Designing a new algorithm which works on different dataset while an improved
performance for TFBS prediction.
Methods:
With the help of Layerwise Relevance Propagation, the proposed algorithm identifies the
invariant features with adaptive noise patterns.
Results:
The performance is compared by calculating various metrics on standard as well as recent
methods and significant improvement is noted.
Conclusion:
By identifying the invariant and robust features in the DNA sequences, the
classification performance can be increased.
Collapse
Affiliation(s)
- Kanu Geete
- Department of Computer Science & Engineering, Maulana Azad National Institute of Technology, Bhopal, India
| | - Manish Pandey
- Department of Computer Science & Engineering, Maulana Azad National Institute of Technology, Bhopal, India
| |
Collapse
|
12
|
Wang Z, Luan Y, Zhou X, Cui J, Luan F, Meng J. Optimized combination methods for exploring and verifying disease-resistant transcription factors in melon. Brief Bioinform 2020; 22:6019969. [PMID: 33270815 DOI: 10.1093/bib/bbaa326] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2020] [Revised: 10/20/2020] [Accepted: 10/21/2020] [Indexed: 11/14/2022] Open
Abstract
A large amount of omics data and number of bioinformatics tools has been produced. However, the methods for further exploring omics data are simple, in particular, to mine key regulatory genes, which are a priority concern in biological systems, and most of the specific functions are still unknown. First, raw data of two genotypes of melon (susceptible and resistant) were obtained by transcriptome analysis. Second, 391 transcription factors (TFs) were identified from the plant transcription factor database and cucurbit genomics database. Then, functional enrichment analysis indicated that these genes were mainly annotated in the process of transcription regulation. Third, 243 and 230 module-specific TFs were screened by weighted gene coexpression network analysis and short time series expression miner, respectively. Several TF genes, such as WRKYs and bHLHs, were regarded as key regulatory genes according to the values of significantly different modules. The coexpression network showed that these TF genes were significant correlated with resistance (R) genes, such as DRP2, RGA3, DRP1 and NB-ARC. Fourth, cis-acting element analysis illustrated that these R genes may bind to WRKY and bHLH. Finally, the expression of WRKY genes was verified by quantitative reverse transcription PCR (RT-qPCR). Phylogenetic analysis was carried out to further confirm that these TFs may play a critical role in Curcurbitaceae disease resistance. This study provides a new optimized combination strategy to explore the functions of TFs in a wide spectrum of biological processes. This strategy may also effectively predict potential relationships in the interactions of essential genes.
Collapse
Affiliation(s)
- Zhicheng Wang
- School of Bioengineering, Dalian University of Technology
| | - Yushi Luan
- School of Bioengineering, Dalian University of Technology
| | - Xiaoxu Zhou
- School of Bioengineering, Dalian University of Technology
| | - Jun Cui
- School of Bioengineering, Dalian University of Technology
| | - Feishi Luan
- College of Horticulture and Landscape Architecture, Northeast Agricultural University
| | - Jun Meng
- School of Computer Science and Technology, Dalian University of Technology
| |
Collapse
|
13
|
Osmala M, Lähdesmäki H. Enhancer prediction in the human genome by probabilistic modelling of the chromatin feature patterns. BMC Bioinformatics 2020; 21:317. [PMID: 32689977 PMCID: PMC7370432 DOI: 10.1186/s12859-020-03621-3] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2019] [Accepted: 06/19/2020] [Indexed: 12/11/2022] Open
Abstract
Background The binding sites of transcription factors (TFs) and the localisation of histone modifications in the human genome can be quantified by the chromatin immunoprecipitation assay coupled with next-generation sequencing (ChIP-seq). The resulting chromatin feature data has been successfully adopted for genome-wide enhancer identification by several unsupervised and supervised machine learning methods. However, the current methods predict different numbers and different sets of enhancers for the same cell type and do not utilise the pattern of the ChIP-seq coverage profiles efficiently. Results In this work, we propose a PRobabilistic Enhancer PRedictIoN Tool (PREPRINT) that assumes characteristic coverage patterns of chromatin features at enhancers and employs a statistical model to account for their variability. PREPRINT defines probabilistic distance measures to quantify the similarity of the genomic query regions and the characteristic coverage patterns. The probabilistic scores of the enhancer and non-enhancer samples are utilised to train a kernel-based classifier. The performance of the method is demonstrated on ENCODE data for two cell lines. The predicted enhancers are computationally validated based on the transcriptional regulatory protein binding sites and compared to the predictions obtained by state-of-the-art methods. Conclusion PREPRINT performs favorably to the state-of-the-art methods, especially when requiring the methods to predict a larger set of enhancers. PREPRINT generalises successfully to data from cell type not utilised for training, and often the PREPRINT performs better than the previous methods. The PREPRINT enhancers are less sensitive to the choice of prediction threshold. PREPRINT identifies biologically validated enhancers not predicted by the competing methods. The enhancers predicted by PREPRINT can aid the genome interpretation in functional genomics and clinical studies.
Collapse
Affiliation(s)
- Maria Osmala
- Department of Computer Science, Aalto University, Konemiehentie 2, Espoo, 02150, Finland.
| | - Harri Lähdesmäki
- Department of Computer Science, Aalto University, Konemiehentie 2, Espoo, 02150, Finland
| |
Collapse
|
14
|
In silico based screening of WRKY genes for identifying functional genes regulated by WRKY under salt stress. Comput Biol Chem 2019; 83:107131. [DOI: 10.1016/j.compbiolchem.2019.107131] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2019] [Revised: 08/18/2019] [Accepted: 09/18/2019] [Indexed: 11/21/2022]
|
15
|
SDBP-Pred: Prediction of single-stranded and double-stranded DNA-binding proteins by extending consensus sequence and K-segmentation strategies into PSSM. Anal Biochem 2019; 589:113494. [PMID: 31693872 DOI: 10.1016/j.ab.2019.113494] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2019] [Revised: 10/24/2019] [Accepted: 10/31/2019] [Indexed: 11/24/2022]
Abstract
Identification of DNA-binding proteins (DNA-BPs) is a hot issue in protein science due to its key role in various biological processes. These processes are highly concerned with DNA-binding protein types. DNA-BPs are classified into single-stranded DNA-binding proteins (SSBs) and double-stranded DNA-binding proteins (DSBs). SSBs mainly involved in DNA recombination, replication, and repair, while DSBs regulate transcription process, DNA cleavage, and chromosome packaging. In spite of the aforementioned significance, few methods have been proposed for discrimination of SSBs and DSBs. Therefore, more predictors with favorable performance are indispensable. In this work, we present an innovative predictor, called SDBP-Pred with a novel feature descriptor, named consensus sequence-based K-segmentation position-specific scoring matrix (CSKS-PSSM). We encoded the local discriminative features concealed in PSSM via K-segmentation strategy and the global potential features by applying the notion of the consensus sequence. The obtained feature vector then input to support vector machine (SVM) with linear, polynomial and radial base function (RBF) kernels. Our model with SVM-RBF achieved the highest accuracies on three tests namely jackknife, 10-fold, and independent tests, respectively than the recent method. The obtained prediction results illustrate the superlative prediction performance of SDBP-Pred over existing studies in the literature so far.
Collapse
|
16
|
Zhang Q, Zhu L, Huang DS. High-Order Convolutional Neural Network Architecture for Predicting DNA-Protein Binding Sites. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:1184-1192. [PMID: 29993783 DOI: 10.1109/tcbb.2018.2819660] [Citation(s) in RCA: 55] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/15/2023]
Abstract
Although Deep learning algorithms have outperformed conventional methods in predicting the sequence specificities of DNA-protein binding, they lack to consider the dependencies among nucleotides and the diverse binding lengths for different transcription factors (TFs). To address the above two limitations simultaneously, in this paper, we propose a high-order convolutional neural network architecture (HOCNN), which employs a high-order encoding method to build high-order dependencies among nucleotides, and a multi-scale convolutional layer to capture the motif features of different length. The experimental results on real ChIP-seq datasets show that the proposed method outperforms the state-of-the-art deep learning method (DeepBind) in the motif discovery task. In addition, we provide further insights about the importance of introducing additional convolutional kernels and the degeneration problem of importing high-order in the motif discovery task.
Collapse
|
17
|
Zhang Q, Shen Z, Huang DS. Modeling in-vivo protein-DNA binding by combining multiple-instance learning with a hybrid deep neural network. Sci Rep 2019; 9:8484. [PMID: 31186519 PMCID: PMC6559991 DOI: 10.1038/s41598-019-44966-x] [Citation(s) in RCA: 27] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2018] [Accepted: 05/15/2019] [Indexed: 01/26/2023] Open
Abstract
Modeling in-vivo protein-DNA binding is not only fundamental for further understanding of the regulatory mechanisms, but also a challenging task in computational biology. Deep-learning based methods have succeed in modeling in-vivo protein-DNA binding, but they often (1) follow the fully supervised learning framework and overlook the weakly supervised information of genomic sequences that a bound DNA sequence may has multiple TFBS(s), and, (2) use one-hot encoding to encode DNA sequences and ignore the dependencies among nucleotides. In this paper, we propose a weakly supervised framework, which combines multiple-instance learning with a hybrid deep neural network and uses k-mer encoding to transform DNA sequences, for modeling in-vivo protein-DNA binding. Firstly, this framework segments sequences into multiple overlapping instances using a sliding window, and then encodes all instances into image-like inputs of high-order dependencies using k-mer encoding. Secondly, it separately computes a score for all instances in the same bag using a hybrid deep neural network that integrates convolutional and recurrent neural networks. Finally, it integrates the predicted values of all instances as the final prediction of this bag using the Noisy-and method. The experimental results on in-vivo datasets demonstrate the superior performance of the proposed framework. In addition, we also explore the performance of the proposed framework when using k-mer encoding, and demonstrate the performance of the Noisy-and method by comparing it with other fusion methods, and find that adding recurrent layers can improve the performance of the proposed framework.
Collapse
Affiliation(s)
- Qinhu Zhang
- Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, Shanghai, 201804, P.R. China
| | - Zhen Shen
- Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, Shanghai, 201804, P.R. China
| | - De-Shuang Huang
- Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, Shanghai, 201804, P.R. China.
| |
Collapse
|
18
|
Zhang H, Zhu L, Huang DS. DiscMLA: An Efficient Discriminative Motif Learning Algorithm over High-Throughput Datasets. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 15:1810-1820. [PMID: 27164602 DOI: 10.1109/tcbb.2016.2561930] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
The transcription factors (TFs) can activate or suppress gene expression by binding to specific sites, hence are crucial regulatory elements for transcription. Recently, series of discriminative motif finders have been tailored to offering promising strategy for harnessing the power of large quantities of accumulated high-throughput experimental data. However, in order to achieve high speed, these algorithms have to sacrifice accuracy by employing simplified statistical models during the searching process. In this paper, we propose a novel approach named Discriminative Motif Learning via AUC (DiscMLA) to discover motifs on high-throughput datasets. Unlike previous approaches, DiscMLA tries to optimize with a more comprehensive criterion (AUC) during motifs searching. In addition, based on an experimental observation of motif identification on large-scale datasets, some novel procedures are designed to accelerate DiscMLA. The experimental results on 52 real-world datasets demonstrate that our approach substantially outperforms previous methods on discriminative motif learning problems. DiscMLA' stability, discriminability, and validity will help to exploit high-throughput datasets and answer many fundamental biological questions.
Collapse
|
19
|
Salekin S, Zhang JM, Huang Y. Base-pair resolution detection of transcription factor binding site by deep deconvolutional network. Bioinformatics 2018; 34:3446-3453. [PMID: 29757349 PMCID: PMC6184544 DOI: 10.1093/bioinformatics/bty383] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2017] [Revised: 03/05/2018] [Accepted: 05/05/2018] [Indexed: 02/01/2023] Open
Abstract
Motivation Transcription factor (TF) binds to the promoter region of a gene to control gene expression. Identifying precise TF binding sites (TFBSs) is essential for understanding the detailed mechanisms of TF-mediated gene regulation. However, there is a shortage of computational approach that can deliver single base pair resolution prediction of TFBS. Results In this paper, we propose DeepSNR, a Deep Learning algorithm for predicting TF binding location at Single Nucleotide Resolution de novo from DNA sequence. DeepSNR adopts a novel deconvolutional network (deconvNet) model and is inspired by the similarity to image segmentation by deconvNet. The proposed deconvNet architecture is constructed on top of 'DeepBind' and we trained the entire model using TF-specific data from ChIP-exonuclease (ChIP-exo) experiments. DeepSNR has been shown to outperform motif search-based methods for several evaluation metrics. We have also demonstrated the usefulness of DeepSNR in the regulatory analysis of TFBS as well as in improving the TFBS prediction specificity using ChIP-seq data. Availability and implementation DeepSNR is available open source in the GitHub repository (https://github.com/sirajulsalekin/DeepSNR). Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Sirajul Salekin
- Electrical and Computer Engineering Department, University of Texas at San Antonio, San Antonio, TX, USA
| | - Jianqiu Michelle Zhang
- Electrical and Computer Engineering Department, University of Texas at San Antonio, San Antonio, TX, USA
| | - Yufei Huang
- Electrical and Computer Engineering Department, University of Texas at San Antonio, San Antonio, TX, USA
- Department of Epidemiology and Biostatistics, University of Texas Health Science Center, San Antonio, TX, USA
| |
Collapse
|
20
|
Khamis AM, Motwalli O, Oliva R, Jankovic BR, Medvedeva YA, Ashoor H, Essack M, Gao X, Bajic VB. A novel method for improved accuracy of transcription factor binding site prediction. Nucleic Acids Res 2018; 46:e72. [PMID: 29617876 PMCID: PMC6037060 DOI: 10.1093/nar/gky237] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2017] [Revised: 03/01/2018] [Accepted: 03/20/2018] [Indexed: 12/12/2022] Open
Abstract
Identifying transcription factor (TF) binding sites (TFBSs) is important in the computational inference of gene regulation. Widely used computational methods of TFBS prediction based on position weight matrices (PWMs) usually have high false positive rates. Moreover, computational studies of transcription regulation in eukaryotes frequently require numerous PWM models of TFBSs due to a large number of TFs involved. To overcome these problems we developed DRAF, a novel method for TFBS prediction that requires only 14 prediction models for 232 human TFs, while at the same time significantly improves prediction accuracy. DRAF models use more features than PWM models, as they combine information from TFBS sequences and physicochemical properties of TF DNA-binding domains into machine learning models. Evaluation of DRAF on 98 human ChIP-seq datasets shows on average 1.54-, 1.96- and 5.19-fold reduction of false positives at the same sensitivities compared to models from HOCOMOCO, TRANSFAC and DeepBind, respectively. This observation suggests that one can efficiently replace the PWM models for TFBS prediction by a small number of DRAF models that significantly improve prediction accuracy. The DRAF method is implemented in a web tool and in a stand-alone software freely available at http://cbrc.kaust.edu.sa/DRAF.
Collapse
Affiliation(s)
- Abdullah M Khamis
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, Thuwal 23955–6900, Saudi Arabia
| | - Olaa Motwalli
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, Thuwal 23955–6900, Saudi Arabia
| | - Romina Oliva
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, Thuwal 23955–6900, Saudi Arabia
- Department of Sciences and Technologies, University ‘Parthenope’ of Naples, Centro Direzionale Isola C4 80143, Naples, Italy
| | - Boris R Jankovic
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, Thuwal 23955–6900, Saudi Arabia
| | - Yulia A Medvedeva
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, Thuwal 23955–6900, Saudi Arabia
- Institute of Bioengineering, Research Centre of Biotechnology, Russian Academy of Science, 117312 Moscow, Russia
- Department of Computational Biology, Vavilov Institute of General Genetics, Russian Academy of Science, 119991 Moscow, Russia
- Department of Biological and Medical Physics, Moscow Institute of Physics and Technology, 141701, Dolgoprudny, Moscow Region, Russia
| | - Haitham Ashoor
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, Thuwal 23955–6900, Saudi Arabia
| | - Magbubah Essack
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, Thuwal 23955–6900, Saudi Arabia
| | - Xin Gao
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, Thuwal 23955–6900, Saudi Arabia
| | - Vladimir B Bajic
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, Thuwal 23955–6900, Saudi Arabia
| |
Collapse
|
21
|
Pagerols M, Richarte V, Sánchez-Mora C, Rovira P, Soler Artigas M, Garcia-Martínez I, Calvo-Sánchez E, Corrales M, da Silva BS, Mota NR, Victor MM, Rohde LA, Grevet EH, Bau CHD, Cormand B, Casas M, Ramos-Quiroga JA, Ribasés M. Integrative genomic analysis of methylphenidate response in attention-deficit/hyperactivity disorder. Sci Rep 2018; 8:1881. [PMID: 29382897 PMCID: PMC5789875 DOI: 10.1038/s41598-018-20194-7] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2017] [Accepted: 01/15/2018] [Indexed: 12/24/2022] Open
Abstract
Methylphenidate (MPH) is the most frequently used pharmacological treatment in children with attention-deficit/hyperactivity disorder (ADHD). However, a considerable interindividual variability exists in clinical outcome. Thus, we performed a genome-wide association study of MPH efficacy in 173 ADHD paediatric patients. Although no variant reached genome-wide significance, the set of genes containing single-nucleotide polymorphisms (SNPs) nominally associated with MPH response (P < 0.05) was significantly enriched for candidates previously studied in ADHD or treatment outcome. We prioritised the nominally significant SNPs by functional annotation and expression quantitative trait loci (eQTL) analysis in human brain, and we identified 33 SNPs tagging cis-eQTL in 32 different loci (referred to as eSNPs and eGenes, respectively). Pathway enrichment analyses revealed an over-representation of genes involved in nervous system development and function among the eGenes. Categories related to neurological diseases, psychological disorders and behaviour were also significantly enriched. We subsequently meta-analysed the association with clinical outcome for the 33 eSNPs across the discovery sample and an independent cohort of 189 ADHD adult patients (target sample) and we detected 15 suggestive signals. Following this comprehensive strategy, our results provide a better understanding of the molecular mechanisms implicated in MPH treatment effects and suggest promising candidates that may encourage future studies.
Collapse
Affiliation(s)
- Mireia Pagerols
- Psychiatric Genetics Unit, Group of Psychiatry, Mental Health and Addiction, Vall d'Hebron Research Institute (VHIR), Universitat Autònoma de Barcelona, Barcelona, Spain.,Department of Psychiatry, Hospital Universitari Vall d'Hebron, Barcelona, Spain
| | - Vanesa Richarte
- Department of Psychiatry, Hospital Universitari Vall d'Hebron, Barcelona, Spain.,Biomedical Network Research Centre on Mental Health (CIBERSAM), Instituto de Salud Carlos III, Barcelona, Spain.,Department of Psychiatry and Legal Medicine, Universitat Autònoma de Barcelona, Barcelona, Spain
| | - Cristina Sánchez-Mora
- Psychiatric Genetics Unit, Group of Psychiatry, Mental Health and Addiction, Vall d'Hebron Research Institute (VHIR), Universitat Autònoma de Barcelona, Barcelona, Spain.,Department of Psychiatry, Hospital Universitari Vall d'Hebron, Barcelona, Spain.,Biomedical Network Research Centre on Mental Health (CIBERSAM), Instituto de Salud Carlos III, Barcelona, Spain
| | - Paula Rovira
- Psychiatric Genetics Unit, Group of Psychiatry, Mental Health and Addiction, Vall d'Hebron Research Institute (VHIR), Universitat Autònoma de Barcelona, Barcelona, Spain.,Department of Psychiatry, Hospital Universitari Vall d'Hebron, Barcelona, Spain
| | - María Soler Artigas
- Psychiatric Genetics Unit, Group of Psychiatry, Mental Health and Addiction, Vall d'Hebron Research Institute (VHIR), Universitat Autònoma de Barcelona, Barcelona, Spain.,Biomedical Network Research Centre on Mental Health (CIBERSAM), Instituto de Salud Carlos III, Barcelona, Spain
| | - Iris Garcia-Martínez
- Psychiatric Genetics Unit, Group of Psychiatry, Mental Health and Addiction, Vall d'Hebron Research Institute (VHIR), Universitat Autònoma de Barcelona, Barcelona, Spain.,Department of Psychiatry, Hospital Universitari Vall d'Hebron, Barcelona, Spain
| | - Eva Calvo-Sánchez
- Psychiatric Genetics Unit, Group of Psychiatry, Mental Health and Addiction, Vall d'Hebron Research Institute (VHIR), Universitat Autònoma de Barcelona, Barcelona, Spain.,Department of Psychiatry, Hospital Universitari Vall d'Hebron, Barcelona, Spain
| | - Montse Corrales
- Department of Psychiatry, Hospital Universitari Vall d'Hebron, Barcelona, Spain.,Department of Psychiatry and Legal Medicine, Universitat Autònoma de Barcelona, Barcelona, Spain
| | - Bruna Santos da Silva
- Department of Genetics, Institute of Biosciences, Universidade Federal do Rio Grande do Sul, Porto Alegre, Brazil
| | - Nina Roth Mota
- Department of Human Genetics and Psychiatry, Donders Institute for Brain, Cognition and Behaviour, Radboud University Medical Centre, Nijmegen, The Netherlands.,ADHD Outpatient Program, Adult Division, Hospital de Clínicas de Porto Alegre, Porto Alegre, Brazil
| | - Marcelo Moraes Victor
- ADHD Outpatient Program, Adult Division, Hospital de Clínicas de Porto Alegre, Porto Alegre, Brazil
| | - Luis Augusto Rohde
- ADHD Outpatient Program, Adult Division, Hospital de Clínicas de Porto Alegre, Porto Alegre, Brazil.,Department of Psychiatry, Faculty of Medicine, Universidade Federal do Rio Grande do Sul, Porto Alegre, Brazil
| | - Eugenio Horacio Grevet
- ADHD Outpatient Program, Adult Division, Hospital de Clínicas de Porto Alegre, Porto Alegre, Brazil.,Department of Psychiatry, Faculty of Medicine, Universidade Federal do Rio Grande do Sul, Porto Alegre, Brazil
| | - Claiton Henrique Dotto Bau
- Department of Genetics, Institute of Biosciences, Universidade Federal do Rio Grande do Sul, Porto Alegre, Brazil.,ADHD Outpatient Program, Adult Division, Hospital de Clínicas de Porto Alegre, Porto Alegre, Brazil
| | - Bru Cormand
- Departament de Genètica, Microbiologia i Estadística, Facultat de Biologia, Universitat de Barcelona, Barcelona, Spain.,Centro de Investigación Biomédica en Red de Enfermedades Raras (CIBERER), Instituto de Salud Carlos III, Barcelona, Spain.,Institut de Biomedicina de la Universitat de Barcelona (IBUB), Barcelona, Spain.,Institut de Recerca Sant Joan de Déu (IR-SJD), Esplugues de Llobregat, Spain
| | - Miguel Casas
- Psychiatric Genetics Unit, Group of Psychiatry, Mental Health and Addiction, Vall d'Hebron Research Institute (VHIR), Universitat Autònoma de Barcelona, Barcelona, Spain.,Department of Psychiatry, Hospital Universitari Vall d'Hebron, Barcelona, Spain.,Biomedical Network Research Centre on Mental Health (CIBERSAM), Instituto de Salud Carlos III, Barcelona, Spain.,Department of Psychiatry and Legal Medicine, Universitat Autònoma de Barcelona, Barcelona, Spain
| | - Josep Antoni Ramos-Quiroga
- Psychiatric Genetics Unit, Group of Psychiatry, Mental Health and Addiction, Vall d'Hebron Research Institute (VHIR), Universitat Autònoma de Barcelona, Barcelona, Spain.,Department of Psychiatry, Hospital Universitari Vall d'Hebron, Barcelona, Spain.,Biomedical Network Research Centre on Mental Health (CIBERSAM), Instituto de Salud Carlos III, Barcelona, Spain.,Department of Psychiatry and Legal Medicine, Universitat Autònoma de Barcelona, Barcelona, Spain
| | - Marta Ribasés
- Psychiatric Genetics Unit, Group of Psychiatry, Mental Health and Addiction, Vall d'Hebron Research Institute (VHIR), Universitat Autònoma de Barcelona, Barcelona, Spain. .,Department of Psychiatry, Hospital Universitari Vall d'Hebron, Barcelona, Spain. .,Biomedical Network Research Centre on Mental Health (CIBERSAM), Instituto de Salud Carlos III, Barcelona, Spain.
| |
Collapse
|
22
|
Zhang H, Zhu L, Huang DS. WSMD: weakly-supervised motif discovery in transcription factor ChIP-seq data. Sci Rep 2017; 7:3217. [PMID: 28607381 PMCID: PMC5468353 DOI: 10.1038/s41598-017-03554-7] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2016] [Accepted: 05/02/2017] [Indexed: 01/24/2023] Open
Abstract
Although discriminative motif discovery (DMD) methods are promising for eliciting motifs from high-throughput experimental data, due to consideration of computational expense, most of existing DMD methods have to choose approximate schemes that greatly restrict the search space, leading to significant loss of predictive accuracy. In this paper, we propose Weakly-Supervised Motif Discovery (WSMD) to discover motifs from ChIP-seq datasets. In contrast to the learning strategies adopted by previous DMD methods, WSMD allows a "global" optimization scheme of the motif parameters in continuous space, thereby reducing the information loss of model representation and improving the quality of resultant motifs. Meanwhile, by exploiting the connection between DMD framework and existing weakly supervised learning (WSL) technologies, we also present highly scalable learning strategies for the proposed method. The experimental results on both real ChIP-seq datasets and synthetic datasets show that WSMD substantially outperforms former DMD methods (including DREME, HOMER, XXmotif, motifRG and DECOD) in terms of predictive accuracy, while also achieving a competitive computational speed.
Collapse
Affiliation(s)
- Hongbo Zhang
- Institute of Machine Learning and Systems Biology, College of Electronics and Information Engineering, Tongji University, Shanghai, 201804, P.R. China
| | - Lin Zhu
- Institute of Machine Learning and Systems Biology, College of Electronics and Information Engineering, Tongji University, Shanghai, 201804, P.R. China
| | - De-Shuang Huang
- Institute of Machine Learning and Systems Biology, College of Electronics and Information Engineering, Tongji University, Shanghai, 201804, P.R. China.
| |
Collapse
|
23
|
Liu W, Zhu W, Liao B, Chen H, Ren S, Cai L. Improving gene regulatory network structure using redundancy reduction in the MRNET algorithm. RSC Adv 2017. [DOI: 10.1039/c7ra01557g] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open
Abstract
Inferring gene regulatory networks from expression data is a central problem in systems biology.
Collapse
Affiliation(s)
- Wei Liu
- College of Information Science and Engineering
- Hunan University
- Changsha
- China
| | - Wen Zhu
- College of Information Science and Engineering
- Hunan University
- Changsha
- China
| | - Bo Liao
- College of Information Science and Engineering
- Hunan University
- Changsha
- China
| | - Haowen Chen
- College of Information Science and Engineering
- Hunan University
- Changsha
- China
| | - Siqi Ren
- College of Information Science and Engineering
- Hunan University
- Changsha
- China
| | - Lijun Cai
- College of Information Science and Engineering
- Hunan University
- Changsha
- China
| |
Collapse
|
24
|
Jayaram N, Usvyat D, R Martin AC. Evaluating tools for transcription factor binding site prediction. BMC Bioinformatics 2016; 17:547. [PMID: 27806697 PMCID: PMC6889335 DOI: 10.1186/s12859-016-1298-9] [Citation(s) in RCA: 56] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2016] [Accepted: 10/20/2016] [Indexed: 12/21/2022] Open
Abstract
Background Binding of transcription factors to transcription factor binding sites (TFBSs) is key to the mediation of transcriptional regulation. Information on experimentally validated functional TFBSs is limited and consequently there is a need for accurate prediction of TFBSs for gene annotation and in applications such as evaluating the effects of single nucleotide variations in causing disease. TFBSs are generally recognized by scanning a position weight matrix (PWM) against DNA using one of a number of available computer programs. Thus we set out to evaluate the best tools that can be used locally (and are therefore suitable for large-scale analyses) for creating PWMs from high-throughput ChIP-Seq data and for scanning them against DNA. Results We evaluated a set of de novo motif discovery tools that could be downloaded and installed locally using ENCODE-ChIP-Seq data and showed that rGADEM was the best-performing tool. TFBS prediction tools used to scan PWMs against DNA fall into two classes — those that predict individual TFBSs and those that identify clusters. Our evaluation showed that FIMO and MCAST performed best respectively. Conclusions Selection of the best-performing tools for generating PWMs from ChIP-Seq data and for scanning PWMs against DNA has the potential to improve prediction of precise transcription factor binding sites within regions identified by ChIP-Seq experiments for gene finding, understanding regulation and in evaluating the effects of single nucleotide variations in causing disease. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1298-9) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Narayan Jayaram
- Institute of Structural and Molecular Biology, Division of Biosciences, University College London, Darwin Building, Gower Street, London, WC1E 6BT, UK
| | - Daniel Usvyat
- Institute of Structural and Molecular Biology, Division of Biosciences, University College London, Darwin Building, Gower Street, London, WC1E 6BT, UK
| | - Andrew C R Martin
- Institute of Structural and Molecular Biology, Division of Biosciences, University College London, Darwin Building, Gower Street, London, WC1E 6BT, UK.
| |
Collapse
|
25
|
Guillen-Ahlers H, Rao PK, Levenstein ME, Kennedy-Darling J, Perumalla DS, Jadhav AYL, Glenn JP, Ludwig-Kubinski A, Drigalenko E, Montoya MJ, Göring HH, Anderson CD, Scalf M, Gildersleeve HIS, Cole R, Greene AM, Oduro AK, Lazarova K, Cesnik AJ, Barfknecht J, Cirillo LA, Gasch AP, Shortreed MR, Smith LM, Olivier M. HyCCAPP as a tool to characterize promoter DNA-protein interactions in Saccharomyces cerevisiae. Genomics 2016; 107:267-73. [PMID: 27184763 DOI: 10.1016/j.ygeno.2016.05.002] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2016] [Revised: 05/11/2016] [Accepted: 05/12/2016] [Indexed: 11/30/2022]
Abstract
Currently available methods for interrogating DNA-protein interactions at individual genomic loci have significant limitations, and make it difficult to work with unmodified cells or examine single-copy regions without specific antibodies. In this study, we describe a physiological application of the Hybridization Capture of Chromatin-Associated Proteins for Proteomics (HyCCAPP) methodology we have developed. Both novel and known locus-specific DNA-protein interactions were identified at the ENO2 and GAL1 promoter regions of Saccharomyces cerevisiae, and revealed subgroups of proteins present in significantly different levels at the loci in cells grown on glucose versus galactose as the carbon source. Results were validated using chromatin immunoprecipitation. Overall, our analysis demonstrates that HyCCAPP is an effective and flexible technology that does not require specific antibodies nor prior knowledge of locally occurring DNA-protein interactions and can now be used to identify changes in protein interactions at target regions in the genome in response to physiological challenges.
Collapse
Affiliation(s)
- Hector Guillen-Ahlers
- Department of Genetics, Texas Biomedical Research Institute, San Antonio, TX 78227, USA; Biotechnology and Bioengineering Center, Medical College of Wisconsin, Milwaukee, WI 53226, USA
| | - Prahlad K Rao
- Department of Genetics, Texas Biomedical Research Institute, San Antonio, TX 78227, USA
| | - Mark E Levenstein
- Department of Chemistry, University of Wisconsin, Madison, WI 53706, USA
| | | | - Danu S Perumalla
- Department of Genetics, Texas Biomedical Research Institute, San Antonio, TX 78227, USA
| | - Avinash Y L Jadhav
- Department of Genetics, Texas Biomedical Research Institute, San Antonio, TX 78227, USA
| | - Jeremy P Glenn
- Department of Genetics, Texas Biomedical Research Institute, San Antonio, TX 78227, USA
| | - Amy Ludwig-Kubinski
- Biotechnology and Bioengineering Center, Medical College of Wisconsin, Milwaukee, WI 53226, USA
| | - Eugene Drigalenko
- Department of Genetics, Texas Biomedical Research Institute, San Antonio, TX 78227, USA
| | - Maria J Montoya
- Department of Genetics, Texas Biomedical Research Institute, San Antonio, TX 78227, USA
| | - Harald H Göring
- Department of Genetics, Texas Biomedical Research Institute, San Antonio, TX 78227, USA
| | - Corianna D Anderson
- Biotechnology and Bioengineering Center, Medical College of Wisconsin, Milwaukee, WI 53226, USA
| | - Mark Scalf
- Department of Chemistry, University of Wisconsin, Madison, WI 53706, USA
| | | | - Regina Cole
- Biotechnology and Bioengineering Center, Medical College of Wisconsin, Milwaukee, WI 53226, USA
| | - Alexandra M Greene
- Biotechnology and Bioengineering Center, Medical College of Wisconsin, Milwaukee, WI 53226, USA
| | - Akua K Oduro
- Department of Cell Biology, Medical College of Wisconsin, Milwaukee, WI 53226, USA
| | - Katarina Lazarova
- Biotechnology and Bioengineering Center, Medical College of Wisconsin, Milwaukee, WI 53226, USA
| | - Anthony J Cesnik
- Department of Chemistry, University of Wisconsin, Madison, WI 53706, USA
| | - Jared Barfknecht
- Biotechnology and Bioengineering Center, Medical College of Wisconsin, Milwaukee, WI 53226, USA
| | - Lisa A Cirillo
- Department of Cell Biology, Medical College of Wisconsin, Milwaukee, WI 53226, USA
| | - Audrey P Gasch
- Department of Genetics, University of Wisconsin, Madison, WI 53706, USA
| | | | - Lloyd M Smith
- Department of Chemistry, University of Wisconsin, Madison, WI 53706, USA
| | - Michael Olivier
- Department of Genetics, Texas Biomedical Research Institute, San Antonio, TX 78227, USA; Biotechnology and Bioengineering Center, Medical College of Wisconsin, Milwaukee, WI 53226, USA.
| |
Collapse
|
26
|
Syeda SS, Rice D, Hook DJ, Heckert LL, Georg GI. Synthesis of Arylazide- and Diazirine-Containing CrAsH-EDT2 Photoaffinity Probes. Arch Pharm (Weinheim) 2016; 349:233-41. [PMID: 26948688 PMCID: PMC5069617 DOI: 10.1002/ardp.201500440] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2015] [Revised: 02/03/2016] [Accepted: 02/10/2016] [Indexed: 11/08/2022]
Abstract
Two photo-crosslinking biarsenical (CrAsH-EDT2 )-modified probes were synthesized that are expected to be useful tools for tetracysteine-labeled proteins to facilitate the co-affinity purification of their DNA binding sequences and interacting proteins. In addition, improvements for the synthesis of CrAsH-EDT2 and N(1) -(4-azido-2-nitrophenyl)hexane-1,6-diamine are reported. Both photoprobes effectively entered HeLa cells (and the nucleus) and were dependent on the tetracysteine motif in recombinant DMRT1 (doublesex and Mab3-related transcription factor) to induce fluorescence, suggesting that their crosslinking abilities can be exploited for the identification of nucleic acids and proteins associated with a protein of interest.
Collapse
Affiliation(s)
- Shameem S Syeda
- Department of Medicinal Chemistry and Institute for Therapeutics Discovery and Development, University of Minnesota, Minneapolis, MN, USA.,Interdisciplinary Center for Male Contraceptive Research and Drug Development, University of Kansas Medical Center, Kansas City, KS, USA
| | - Daren Rice
- Interdisciplinary Center for Male Contraceptive Research and Drug Development, University of Kansas Medical Center, Kansas City, KS, USA.,Department of Molecular and Integrative Physiology, University of Kansas Medical Center, Kansas City, KS, USA
| | - Derek J Hook
- Department of Medicinal Chemistry and Institute for Therapeutics Discovery and Development, University of Minnesota, Minneapolis, MN, USA.,Interdisciplinary Center for Male Contraceptive Research and Drug Development, University of Kansas Medical Center, Kansas City, KS, USA
| | - Leslie L Heckert
- Interdisciplinary Center for Male Contraceptive Research and Drug Development, University of Kansas Medical Center, Kansas City, KS, USA.,Department of Molecular and Integrative Physiology, University of Kansas Medical Center, Kansas City, KS, USA
| | - Gunda I Georg
- Department of Medicinal Chemistry and Institute for Therapeutics Discovery and Development, University of Minnesota, Minneapolis, MN, USA.,Interdisciplinary Center for Male Contraceptive Research and Drug Development, University of Kansas Medical Center, Kansas City, KS, USA
| |
Collapse
|
27
|
Nettling M, Treutler H, Grau J, Keilwagen J, Posch S, Grosse I. DiffLogo: a comparative visualization of sequence motifs. BMC Bioinformatics 2015; 16:387. [PMID: 26577052 PMCID: PMC4650857 DOI: 10.1186/s12859-015-0767-x] [Citation(s) in RCA: 50] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2015] [Accepted: 10/08/2015] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND For three decades, sequence logos are the de facto standard for the visualization of sequence motifs in biology and bioinformatics. Reasons for this success story are their simplicity and clarity. The number of inferred and published motifs grows with the number of data sets and motif extraction algorithms. Hence, it becomes more and more important to perceive differences between motifs. However, motif differences are hard to detect from individual sequence logos in case of multiple motifs for one transcription factor, highly similar binding motifs of different transcription factors, or multiple motifs for one protein domain. RESULTS Here, we present DiffLogo, a freely available, extensible, and user-friendly R package for visualizing motif differences. DiffLogo is capable of showing differences between DNA motifs as well as protein motifs in a pair-wise manner resulting in publication-ready figures. In case of more than two motifs, DiffLogo is capable of visualizing pair-wise differences in a tabular form. Here, the motifs are ordered by similarity, and the difference logos are colored for clarity. We demonstrate the benefit of DiffLogo on CTCF motifs from different human cell lines, on E-box motifs of three basic helix-loop-helix transcription factors as examples for comparison of DNA motifs, and on F-box domains from three different families as example for comparison of protein motifs. CONCLUSIONS DiffLogo provides an intuitive visualization of motif differences. It enables the illustration and investigation of differences between highly similar motifs such as binding patterns of transcription factors for different cell types, treatments, and algorithmic approaches.
Collapse
Affiliation(s)
- Martin Nettling
- Institute of Computer Science, Martin Luther University Halle-Wittenberg, Halle (Saale), Germany.
| | - Hendrik Treutler
- Leibniz Institute of Plant Biochemistry, Halle (Saale), Germany.
| | - Jan Grau
- Institute of Computer Science, Martin Luther University Halle-Wittenberg, Halle (Saale), Germany.
| | - Jens Keilwagen
- Institute for Biosafety in Plant Biotechnology, Julius Kühn-Institut (JKI), Federal Research Centre for Cultivated Plants, Quedlinburg, Germany.
| | - Stefan Posch
- Institute of Computer Science, Martin Luther University Halle-Wittenberg, Halle (Saale), Germany.
| | - Ivo Grosse
- Institute of Computer Science, Martin Luther University Halle-Wittenberg, Halle (Saale), Germany.
- German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, Leipzig, Germany.
| |
Collapse
|
28
|
Pundhir S, Poirazi P, Gorodkin J. Emerging applications of read profiles towards the functional annotation of the genome. Front Genet 2015; 6:188. [PMID: 26042150 PMCID: PMC4437211 DOI: 10.3389/fgene.2015.00188] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2015] [Accepted: 05/06/2015] [Indexed: 12/21/2022] Open
Abstract
Functional annotation of the genome is important to understand the phenotypic complexity of various species. The road toward functional annotation involves several challenges ranging from experiments on individual molecules to large-scale analysis of high-throughput sequencing (HTS) data. HTS data is typically a result of the protocol designed to address specific research questions. The sequencing results in reads, which when mapped to a reference genome often leads to the formation of distinct patterns (read profiles). Interpretation of these read profiles is essential for their analysis in relation to the research question addressed. Several strategies have been employed at varying levels of abstraction ranging from a somewhat ad hoc to a more systematic analysis of read profiles. These include methods which can compare read profiles, e.g., from direct (non-sequence based) alignments to classification of patterns into functional groups. In this review, we highlight the emerging applications of read profiles for the annotation of non-coding RNA and cis-regulatory elements (CREs) such as enhancers and promoters. We also discuss the biological rationale behind their formation.
Collapse
Affiliation(s)
- Sachin Pundhir
- Center for non-coding RNA in Technology and Health, Department of Veterinary Clinical and Animal Sciences (IKVH), University of Copenhagen Frederiksberg C, Denmark
| | - Panayiota Poirazi
- Computational Biology Lab, Institute of Molecular Biology and Biotechnology, Foundation for Research and Technology-Hellas Heraklion, Greece
| | - Jan Gorodkin
- Center for non-coding RNA in Technology and Health, Department of Veterinary Clinical and Animal Sciences (IKVH), University of Copenhagen Frederiksberg C, Denmark
| |
Collapse
|
29
|
Smita S, Katiyar A, Chinnusamy V, Pandey DM, Bansal KC. Transcriptional Regulatory Network Analysis of MYB Transcription Factor Family Genes in Rice. FRONTIERS IN PLANT SCIENCE 2015; 6:1157. [PMID: 26734052 PMCID: PMC4689866 DOI: 10.3389/fpls.2015.01157] [Citation(s) in RCA: 53] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/30/2015] [Accepted: 12/07/2015] [Indexed: 05/18/2023]
Abstract
MYB transcription factor (TF) is one of the largest TF families and regulates defense responses to various stresses, hormone signaling as well as many metabolic and developmental processes in plants. Understanding these regulatory hierarchies of gene expression networks in response to developmental and environmental cues is a major challenge due to the complex interactions between the genetic elements. Correlation analyses are useful to unravel co-regulated gene pairs governing biological process as well as identification of new candidate hub genes in response to these complex processes. High throughput expression profiling data are highly useful for construction of co-expression networks. In the present study, we utilized transcriptome data for comprehensive regulatory network studies of MYB TFs by "top-down" and "guide-gene" approaches. More than 50% of OsMYBs were strongly correlated under 50 experimental conditions with 51 hub genes via "top-down" approach. Further, clusters were identified using Markov Clustering (MCL). To maximize the clustering performance, parameter evaluation of the MCL inflation score (I) was performed in terms of enriched GO categories by measuring F-score. Comparison of co-expressed cluster and clads analyzed from phylogenetic analysis signifies their evolutionarily conserved co-regulatory role. We utilized compendium of known interaction and biological role with Gene Ontology enrichment analysis to hypothesize function of coexpressed OsMYBs. In the other part, the transcriptional regulatory network analysis by "guide-gene" approach revealed 40 putative targets of 26 OsMYB TF hubs with high correlation value utilizing 815 microarray data. The putative targets with MYB-binding cis-elements enrichment in their promoter region, functional co-occurrence as well as nuclear localization supports our finding. Specially, enrichment of MYB binding regions involved in drought-inducibility implying their regulatory role in drought response in rice. Thus, the co-regulatory network analysis facilitated the identification of complex OsMYB regulatory networks, and candidate target regulon genes of selected guide MYB genes. The results contribute to the candidate gene screening, and experimentally testable hypotheses for potential regulatory MYB TFs, and their targets under stress conditions.
Collapse
Affiliation(s)
- Shuchi Smita
- ICAR-National Bureau of Plant Genetic Resources, Indian Agricultural Research InstituteNew Delhi, India
- Department of Biotechnology, Birla Institute of TechnologyMesra, Ranchi, India
| | - Amit Katiyar
- ICAR-National Bureau of Plant Genetic Resources, Indian Agricultural Research InstituteNew Delhi, India
- Department of Biotechnology, Birla Institute of TechnologyMesra, Ranchi, India
| | - Viswanathan Chinnusamy
- Division of Plant Physiology, ICAR-Indian Agricultural Research InstituteNew Delhi, India
| | - Dev M. Pandey
- Department of Biotechnology, Birla Institute of TechnologyMesra, Ranchi, India
| | - Kailash C. Bansal
- ICAR-National Bureau of Plant Genetic Resources, Indian Agricultural Research InstituteNew Delhi, India
- *Correspondence: Kailash C. Bansal
| |
Collapse
|
30
|
Nie W, Gu J, Wang Z, Li D, Guan X. The regulatory loop of COMP1 and HNF-4-miR-150-p27 in various signaling pathways. Oncol Lett 2014; 9:195-200. [PMID: 25435958 PMCID: PMC4247106 DOI: 10.3892/ol.2014.2643] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2014] [Accepted: 09/30/2014] [Indexed: 11/06/2022] Open
Abstract
MicroRNAs (miRNAs) are short regulatory RNAs that negatively modulate protein expression at the post-transcriptional level. Additionally, they have been associated with the pathogenesis of a number of types of cancer. In the current study, two target sites for miR-150 were determined within the 3'-untranslated region of p27Kip1 (hereafter referred to as p27) mRNA, and it was determined that ectopic overexpression of miR-150 led directly to p27 downregulation in cancer cells. These findings indicate that miR-150 may be a novel regulator of p27 expression. In the databases of the University of California, Santa Cruz (UCSC) and Match online, two common transcription factors were identified for miR-150 and p27: Cooperates with myogenic proteins 1 (COMP1) and hepatocyte nuclear factor-4 (HNF-4). Using the Database for Annotation, Visualization, and Integrated Discovery (DAVID), it was determined that p27 is involved in pathways regulated by the target genes of miR-150. Therefore, these results suggest that there may be a regulatory loop between COMP1 and HNF-4-miR-150-p27. Additional functional studies are required to understand the molecular basis for the formation of this circuit loop, and provide an insight into the development of innovative therapies targeting specific tumor markers.
Collapse
Affiliation(s)
- Weiwei Nie
- Department of Medical Oncology, Jinling Hospital, Southern Medical University, Guangzhou, Guangdong 510282, P.R. China
| | - Jun Gu
- Department of Extramammary, Jinling Hospital, Medical School of Nanjing University, Nanjing, Jiangsu 210002, P.R. China
| | - Zexing Wang
- Department of Medical Oncology, Jinling Hospital, Medical School of Nanjing University, Nanjing, Jiangsu 210002, P.R. China
| | - Donghai Li
- Jiangsu Engineering Research Center for MicroRNA Biology and Biotechnology, State Key Laboratory of Pharmaceutical Biotechnology, School of Life Sciences, Nanjing University, Nanjing, Jiangsu 210093, P.R. China
| | - Xiaoxiang Guan
- Department of Medical Oncology, Jinling Hospital, Southern Medical University, Guangzhou, Guangdong 510282, P.R. China ; Department of Medical Oncology, Jinling Hospital, Medical School of Nanjing University, Nanjing, Jiangsu 210002, P.R. China
| |
Collapse
|
31
|
Guerrero-Bosagna C, Weeks S, Skinner MK. Identification of genomic features in environmentally induced epigenetic transgenerational inherited sperm epimutations. PLoS One 2014; 9:e100194. [PMID: 24937757 PMCID: PMC4061094 DOI: 10.1371/journal.pone.0100194] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2014] [Accepted: 05/22/2014] [Indexed: 11/19/2022] Open
Abstract
A variety of environmental toxicants have been shown to induce the epigenetic transgenerational inheritance of disease and phenotypic variation. The process involves exposure of a gestating female and the developing fetus to environmental factors that promote permanent alterations in the epigenetic programming of the germline. The molecular aspects of the phenomenon involve epigenetic modifications (epimutations) in the germline (e.g. sperm) that are transmitted to subsequent generations. The current study integrates previously described experimental epigenomic transgenerational data and web-based bioinformatic analyses to identify genomic features associated with these transgenerationally transmitted epimutations. A previously identified genomic feature associated with these epimutations is a low CpG density (<12/100bp). The current observations suggest the transgenerational differential DNA methylation regions (DMR) in sperm contain unique consensus DNA sequence motifs, zinc finger motifs and G-quadruplex sequences. Interaction of molecular factors with these sequences could alter chromatin structure and accessibility of proteins with DNA methyltransferases to alter de novo DNA methylation patterns. G-quadruplex regions can promote the opening of the chromatin that may influence the action of DNA methyltransferases, or factors interacting with them, for the establishment of epigenetic marks. Zinc finger binding factors can also promote this chromatin remodeling and influence the expression of non-coding RNA. The current study identified genomic features associated with sperm epimutations that may explain in part how these sites become susceptible for transgenerational programming.
Collapse
Affiliation(s)
- Carlos Guerrero-Bosagna
- Center for Reproductive Biology, School of Biological Sciences, Washington State University, Pullman, Washington, United States of America
- Department of Physics, Biology and Chemistry, Linköping University, Linköping, Sweden
| | - Shelby Weeks
- Center for Reproductive Biology, School of Biological Sciences, Washington State University, Pullman, Washington, United States of America
| | - Michael K. Skinner
- Center for Reproductive Biology, School of Biological Sciences, Washington State University, Pullman, Washington, United States of America
- * E-mail:
| |
Collapse
|
32
|
Chuang TJ, Chiang TW. Impacts of pretranscriptional DNA methylation, transcriptional transcription factor, and posttranscriptional microRNA regulations on protein evolutionary rate. Genome Biol Evol 2014; 6:1530-41. [PMID: 24923326 PMCID: PMC4080426 DOI: 10.1093/gbe/evu124] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022] Open
Abstract
Gene expression is largely regulated by DNA methylation, transcription factor (TF), and
microRNA (miRNA) before, during, and after transcription, respectively. Although the
evolutionary effects of TF/miRNA regulations have been widely studied, evolutionary
analysis of simultaneously accounting for DNA methylation, TF, and miRNA regulations and
whether promoter methylation and gene body (coding regions) methylation have different
effects on the rate of gene evolution remain uninvestigated. Here, we compared
human–macaque and human–mouse protein evolutionary rates against
experimentally determined single base-resolution DNA methylation data, revealing that
promoter methylation level is positively correlated with protein evolutionary rates but
negatively correlated with TF/miRNA regulations, whereas the opposite was observed for
gene body methylation level. Our results showed that the relative importance of these
regulatory factors in determining the rate of mammalian protein evolution is as follows:
Promoter methylation ≈ miRNA regulation > gene body methylation > TF regulation,
and further indicated that promoter methylation and miRNA regulation have a significant
dependent effect on protein evolutionary rates. Although the mechanisms underlying
cooperation between DNA methylation and TFs/miRNAs in gene regulation remain unclear, our
study helps to not only illuminate the impact of these regulatory factors on mammalian
protein evolution but also their intricate interaction within gene regulatory
networks.
Collapse
Affiliation(s)
- Trees-Juen Chuang
- Division of Physical & Computational Genomics, Genomics Research Center, Academia Sinica, Taipei, Taiwan
| | - Tai-Wei Chiang
- Division of Physical & Computational Genomics, Genomics Research Center, Academia Sinica, Taipei, Taiwan
| |
Collapse
|
33
|
Guillen-Ahlers H, Shortreed MR, Smith LM, Olivier M. Advanced methods for the analysis of chromatin-associated proteins. Physiol Genomics 2014; 46:441-7. [PMID: 24803678 DOI: 10.1152/physiolgenomics.00041.2014] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022] Open
Abstract
DNA-protein interactions are central to gene expression and chromatin regulation and have become one of the main focus areas of the ENCODE consortium. Advances in mass spectrometry and associated technologies have facilitated studies of these interactions, revealing many novel DNA-interacting proteins and histone posttranslational modifications. Proteins interacting at a single locus or at multiple loci have been targeted in these recent studies, each requiring a separate analytical strategy for isolation and analysis of DNA-protein interactions. The enrichment of target chromatin fractions occurs via a number of methods including immunoprecipitation, affinity purification, and hybridization, with the shared goal of using proteomics approaches as the final readout. The result of this is a number of exciting new tools, with distinct strengths and limitations that can enable highly robust and novel chromatin studies when applied appropriately. The present review compares and contrasts these methods to help the reader distinguish the advantages of each approach.
Collapse
Affiliation(s)
- Hector Guillen-Ahlers
- Department of Genetics, Texas Biomedical Research Institute, San Antonio, Texas; and
| | | | - Lloyd M Smith
- Department of Chemistry, University of Wisconsin, Madison, Wisconsin
| | - Michael Olivier
- Department of Genetics, Texas Biomedical Research Institute, San Antonio, Texas; and
| |
Collapse
|
34
|
Rouault H, Santolini M, Schweisguth F, Hakim V. Imogene: identification of motifs and cis-regulatory modules underlying gene co-regulation. Nucleic Acids Res 2014; 42:6128-45. [PMID: 24682824 PMCID: PMC4041412 DOI: 10.1093/nar/gku209] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
Cis-regulatory modules (CRMs) and motifs play a central role in tissue and condition-specific gene expression. Here we present Imogene, an ensemble of statistical tools that we have developed to facilitate their identification and implemented in a publicly available software. Starting from a small training set of mammalian or fly CRMs that drive similar gene expression profiles, Imogene determines de novocis-regulatory motifs that underlie this co-expression. It can then predict on a genome-wide scale other CRMs with a regulatory potential similar to the training set. Imogene bypasses the need of large datasets for statistical analyses by making central use of the information provided by the sequenced genomes of multiple species, based on the developed statistical tools and explicit models for transcription factor binding site evolution. We test Imogene on characterized tissue-specific mouse developmental CRMs. Its ability to identify CRMs with the same specificity based on its de novo created motifs is comparable to that of previously evaluated ‘motif-blind’ methods. We further show, both in flies and in mammals, that Imogene de novo generated motifs are sufficient to discriminate CRMs related to different developmental programs. Notably, purely relying on sequence data, Imogene performs as well in this discrimination task as a previously reported learning algorithm based on Chromatin Immunoprecipitation (ChIP) data for multiple transcription factors at multiple developmental stages.
Collapse
Affiliation(s)
- Hervé Rouault
- Developmental and Stem Cell Biology Department, Institut Pasteur, F-75015 Paris, France CNRS, URA2578, F-75015 Paris, France
| | - Marc Santolini
- Laboratoire de Physique Statistique, CNRS, École Normale Supérieure, Université P. et M. Curie, Université Paris-Diderot
| | - François Schweisguth
- Developmental and Stem Cell Biology Department, Institut Pasteur, F-75015 Paris, France CNRS, URA2578, F-75015 Paris, France
| | - Vincent Hakim
- Laboratoire de Physique Statistique, CNRS, École Normale Supérieure, Université P. et M. Curie, Université Paris-Diderot
| |
Collapse
|
35
|
Application of experimentally verified transcription factor binding sites models for computational analysis of ChIP-Seq data. BMC Genomics 2014; 15:80. [PMID: 24472686 PMCID: PMC4234207 DOI: 10.1186/1471-2164-15-80] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2013] [Accepted: 01/25/2014] [Indexed: 02/07/2023] Open
Abstract
Background ChIP-Seq is widely used to detect genomic segments bound by transcription factors (TF), either directly at DNA binding sites (BSs) or indirectly via other proteins. Currently, there are many software tools implementing different approaches to identify TFBSs within ChIP-Seq peaks. However, their use for the interpretation of ChIP-Seq data is usually complicated by the absence of direct experimental verification, making it difficult both to set a threshold to avoid recognition of too many false-positive BSs, and to compare the actual performance of different models. Results Using ChIP-Seq data for FoxA2 binding loci in mouse adult liver and human HepG2 cells we compared FoxA binding-site predictions for four computational models of two fundamental classes: pattern matching based on existing training set of experimentally confirmed TFBSs (oPWM and SiteGA) and de novo motif discovery (ChIPMunk and diChIPMunk). To properly select prediction thresholds for the models, we experimentally evaluated affinity of 64 predicted FoxA BSs using EMSA that allows safely distinguishing sequences able to bind TF. As a result we identified thousands of reliable FoxA BSs within ChIP-Seq loci from mouse liver and human HepG2 cells. It was found that the performance of conventional position weight matrix (PWM) models was inferior with the highest false positive rate. On the contrary, the best recognition efficiency was achieved by the combination of SiteGA & diChIPMunk/ChIPMunk models, properly identifying FoxA BSs in up to 90% of loci for both mouse and human ChIP-Seq datasets. Conclusions The experimental study of TF binding to oligonucleotides corresponding to predicted sites increases the reliability of computational methods for TFBS-recognition in ChIP-Seq data analysis. Regarding ChIP-Seq data interpretation, basic PWMs have inferior TFBS recognition quality compared to the more sophisticated SiteGA and de novo motif discovery methods. A combination of models from different principles allowed identification of proper TFBSs. Electronic supplementary material The online version of this article (doi:10.1186/1471-2164-15-80) contains supplementary material, which is available to authorized users.
Collapse
|
36
|
Bryzgalov LO, Antontseva EV, Matveeva MY, Shilov AG, Kashina EV, Mordvinov VA, Merkulova TI. Detection of regulatory SNPs in human genome using ChIP-seq ENCODE data. PLoS One 2013; 8:e78833. [PMID: 24205329 PMCID: PMC3812152 DOI: 10.1371/journal.pone.0078833] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2012] [Accepted: 09/17/2013] [Indexed: 11/18/2022] Open
Abstract
A vast amount of SNPs derived from genome-wide association studies are represented by non-coding ones, therefore exacerbating the need for effective identification of regulatory SNPs (rSNPs) among them. However, this task remains challenging since the regulatory part of the human genome is annotated much poorly as opposed to coding regions. Here we describe an approach aggregating the whole set of ENCODE ChIP-seq data in order to search for rSNPs, and provide the experimental evidence of its efficiency. Its algorithm is based on the assumption that the enrichment of a genomic region with transcription factor binding loci (ChIP-seq peaks) indicates its regulatory function, and thereby SNPs located in this region are more likely to influence transcription regulation. To ensure that the approach preferably selects functionally meaningful SNPs, we performed enrichment analysis of several human SNP datasets associated with phenotypic manifestations. It was shown that all samples are significantly enriched with SNPs falling into the regions of multiple ChIP-seq peaks as compared with the randomly selected SNPs. For experimental verification, 40 SNPs falling into overlapping regions of at least 7 TF binding loci were selected from OMIM. The effect of SNPs on the binding of the DNA fragments containing them to the nuclear proteins from four human cell lines (HepG2, HeLaS3, HCT-116, and K562) has been tested by EMSA. A radical change in the binding pattern has been observed for 29 SNPs, besides, 6 more SNPs also demonstrated less pronounced changes. Taken together, the results demonstrate the effective way to search for potential rSNPs with the aid of ChIP-seq data provided by ENCODE project.
Collapse
Affiliation(s)
| | - Elena V. Antontseva
- Institute of Cytology and Genetics SD RAS, Novosibirsk, Russian Federation
- * E-mail:
| | | | | | - Elena V. Kashina
- Institute of Cytology and Genetics SD RAS, Novosibirsk, Russian Federation
| | | | - Tatyana I. Merkulova
- Institute of Cytology and Genetics SD RAS, Novosibirsk, Russian Federation
- Novosibirsk State University, Novosibirsk, Russian Federation
| |
Collapse
|
37
|
Disclosing the crosstalk among DNA methylation, transcription factors, and histone marks in human pluripotent cells through discovery of DNA methylation motifs. Genome Res 2013; 23:2013-29. [PMID: 24149073 PMCID: PMC3847772 DOI: 10.1101/gr.155960.113] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Abstract
Gene expression regulation is gated by promoter methylation states modulating transcription factor binding. The known DNA methylation/unmethylation mechanisms are sequence unspecific, but different cells with the same genome have different methylomes. Thus, additional processes bringing specificity to the methylation/unmethylation mechanisms are required. Searching for such processes, we demonstrated that CpG methylation states are influenced by the sequence context surrounding the CpGs. We used such a property to develop a CpG methylation motif discovery algorithm. The newly discovered motifs reveal “methylation/unmethylation factors” that could recruit the “methylation/unmethylation machinery” to the loci specified by the motifs. Our methylation motif discovery algorithm provides a synergistic approach to the differently methylated region algorithms. Since our algorithm searches for commonly methylated regions inside the same sample, it requires only a single sample to operate. The motifs that were found discriminate between hypomethylated and hypermethylated regions. The hypomethylation-associated motifs have a high CG content, their targets appear in conserved regions near transcription start sites, they tend to co-occur within transcription factor binding sites, they are involved in breaking the H3K4me3/H3K27me3 bivalent balance, and they transit the enhancers from repressive H3K27me3 to active H3K27ac during ES cell differentiation. The new methylation motifs characterize the pluripotent state shared between ES and iPS cells. Additionally, we found a collection of motifs associated with the somatic memory inherited by the iPS from the initial fibroblast cells, thus revealing the existence of epigenetic somatic memory on a fine methylation scale.
Collapse
|
38
|
Wang H, Guan S, Zhu Z, Wang Y, Lu Y. A valid strategy for precise identifications of transcription factor binding sites in combinatorial regulation using bioinformatic and experimental approaches. PLANT METHODS 2013; 9:34. [PMID: 23971995 PMCID: PMC3847620 DOI: 10.1186/1746-4811-9-34] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/01/2013] [Accepted: 08/13/2013] [Indexed: 05/04/2023]
Abstract
BACKGROUND Transcription factor (TF) binding sites (cis element) play a central role in gene regulation, and eukaryotic organisms frequently adapt a combinatorial regulation to render sophisticated local gene expression patterns. Knowing the precise cis element on a distal promoter is a prerequisite for studying a typical transcription process; however, identifications of cis elements have lagged behind those of their associated trans acting TFs due to technical difficulties. Consequently, gene regulations via combinatorial TFs, as widely observed across biological processes, have remained vague in many cases. RESULTS We present here a valid strategy for identifying cis elements in combinatorial TF regulations. It consists of bioinformatic searches of available databases to generate candidate cis elements and tests of the candidates using improved experimental assays. Taking the MYB and the bHLH that collaboratively regulate the anthocyanin pathway genes as examples, we demonstrate how candidate cis motifs for the TFs are found on multi-specific promoters of chalcone synthase (CHS) genes, and how to experimentally test the candidate sites by designing DNA fragments hosting the candidate motifs based on a known promoter (us1 allele of Ipomoea purpurea CHS-D in our case) and applying site-mutagenesis at the motifs. It was shown that TF-DNA interactions could be unambiguously analyzed by assays of electrophoretic mobility shift (EMSA) and dual-luciferase transient expressions, and the resulting evidence precisely delineated a cis element. The cis element for R2R3 MYBs including Ipomoea MYB1 and Magnolia MYB1, for instance, was found to be ANCNACC, and that for bHLHs (exemplified by Ipomoea bHLH2 and petunia AN1) was CACNNG. A re-analysis was conducted on previously reported promoter segments recognized by maize C1 and apple MYB10, which indicated that cis elements similar to ANCNACC were indeed present on these segments, and tested positive for their bindings to Ipomoea MYB1. CONCLUSION Identification of cis elements in combinatorial regulation is now feasible with the strategy outlined. The working pipeline integrates the existing databases with experimental techniques, providing an open framework for precisely identifying cis elements. This strategy is widely applicable to various biological systems, and may enhance future analyses on gene regulation.
Collapse
Affiliation(s)
- Hailong Wang
- State Key Laboratory of Systematic and Evolutionary Botany, Institute of Botany, Chinese Academy of Sciences, 20 Nan Xin Cun, Beijing 100093, China
- University of Chinese Academy of Sciences, Beijing 100049, China
| | - Shan Guan
- State Key Laboratory of Systematic and Evolutionary Botany, Institute of Botany, Chinese Academy of Sciences, 20 Nan Xin Cun, Beijing 100093, China
| | - Zhixin Zhu
- State Key Laboratory of Systematic and Evolutionary Botany, Institute of Botany, Chinese Academy of Sciences, 20 Nan Xin Cun, Beijing 100093, China
- University of Chinese Academy of Sciences, Beijing 100049, China
| | - Yan Wang
- State Key Laboratory of Systematic and Evolutionary Botany, Institute of Botany, Chinese Academy of Sciences, 20 Nan Xin Cun, Beijing 100093, China
| | - Yingqing Lu
- State Key Laboratory of Systematic and Evolutionary Botany, Institute of Botany, Chinese Academy of Sciences, 20 Nan Xin Cun, Beijing 100093, China
| |
Collapse
|
39
|
Jia C, Carson MB, Yu J. A fast weak motif-finding algorithm based on community detection in graphs. BMC Bioinformatics 2013; 14:227. [PMID: 23865838 PMCID: PMC3726413 DOI: 10.1186/1471-2105-14-227] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2012] [Accepted: 07/12/2013] [Indexed: 12/02/2022] Open
Abstract
BACKGROUND Identification of transcription factor binding sites (also called 'motif discovery') in DNA sequences is a basic step in understanding genetic regulation. Although many successful programs have been developed, the problem is far from being solved on account of diversity in gene expression/regulation and the low specificity of binding sites. State-of-the-art algorithms have their own constraints (e.g., high time or space complexity for finding long motifs, low precision in identification of weak motifs, or the OOPS constraint: one occurrence of the motif instance per sequence) which limit their scope of application. RESULTS In this paper, we present a novel and fast algorithm we call TFBSGroup. It is based on community detection from a graph and is used to discover long and weak (l,d) motifs under the ZOMOPS constraint (zero, one or multiple occurrence(s) of the motif instance(s) per sequence), where l is the length of a motif and d is the maximum number of mutations between a motif instance and the motif itself. Firstly, TFBSGroup transforms the (l, d) motif search in sequences to focus on the discovery of dense subgraphs within a graph. It identifies these subgraphs using a fast community detection method for obtaining coarse-grained candidate motifs. Next, it greedily refines these candidate motifs towards the true motif within their own communities. Empirical studies on synthetic (l, d) samples have shown that TFBSGroup is very efficient (e.g., it can find true (18, 6), (24, 8) motifs within 30 seconds). More importantly, the algorithm has succeeded in rapidly identifying motifs in a large data set of prokaryotic promoters generated from the Escherichia coli database RegulonDB. The algorithm has also accurately identified motifs in ChIP-seq data sets for 12 mouse transcription factors involved in ES cell pluripotency and self-renewal. CONCLUSIONS Our novel heuristic algorithm, TFBSGroup, is able to quickly identify nearly exact matches for long and weak (l, d) motifs in DNA sequences under the ZOMOPS constraint. It is also capable of finding motifs in real applications. The source code for TFBSGroup can be obtained from http://bioinformatics.bioengr.uic.edu/TFBSGroup/.
Collapse
Affiliation(s)
- Caiyan Jia
- School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, China.
| | | | | |
Collapse
|
40
|
Maetschke SR, Madhamshettiwar PB, Davis MJ, Ragan MA. Supervised, semi-supervised and unsupervised inference of gene regulatory networks. Brief Bioinform 2013; 15:195-211. [PMID: 23698722 PMCID: PMC3956069 DOI: 10.1093/bib/bbt034] [Citation(s) in RCA: 88] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022] Open
Abstract
Inference of gene regulatory network from expression data is a challenging task. Many methods have been developed to this purpose but a comprehensive evaluation that covers unsupervised, semi-supervised and supervised methods, and provides guidelines for their practical application, is lacking. We performed an extensive evaluation of inference methods on simulated and experimental expression data. The results reveal low prediction accuracies for unsupervised techniques with the notable exception of the Z-SCORE method on knockout data. In all other cases, the supervised approach achieved the highest accuracies and even in a semi-supervised setting with small numbers of only positive samples, outperformed the unsupervised techniques.
Collapse
Affiliation(s)
- Stefan R Maetschke
- Institute for Molecular Bioscience and ARC Centre of Excellence in Bioinformatics, Brisbane, QLD 4072, Australia, Tel.: 61 7 3346 2616; Fax: 61 7 3346 2101;
| | | | | | | |
Collapse
|
41
|
Thompson JA, Congdon CB. An Exploration Into Improving DNA Motif Inference by Looking for Highly Conserved Core Regions. IEEE SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE IN BIOINFORMATICS AND COMPUTATIONAL BIOLOGY PROCEEDINGS. IEEE SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE IN BIOINFORMATICS AND COMPUTATIONAL BIOLOGY 2013; 2013:60-67. [PMID: 31008453 PMCID: PMC6474685 DOI: 10.1109/cibcb.2013.6595389] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Although most verified functional elements in noncoding DNA contain a highly conserved core region, this concept is not generally incorporated into de novo motif inference systems. In this work, we explore the utility of adding the notion of conserved core regions into a comparative genomics approach for the search for putative functional elements in noncoding DNA. By modifying the scoring function for GAMI, Genetic Algorithms for Motif Inference, we investigate tradeoffs between the strength of conservation of the full motif vs. the strength of conservation of a core region. This work illustrates that incorporating information about the structure of transcription factor binding sites can be helpful in identifying biologically functional elements.
Collapse
Affiliation(s)
- Jeffrey A Thompson
- Department of Computer Science, University of Southern Maine, Portland, Maine 04104
| | - Clare Bates Congdon
- Department of Computer Science, University of Southern Maine, Portland, Maine 04104
| |
Collapse
|
42
|
Mukherjee R, Evans P, Singh LN, Hannenhalli S. Correlated evolution of positions within mammalian cis elements. PLoS One 2013; 8:e55521. [PMID: 23408994 PMCID: PMC3568137 DOI: 10.1371/journal.pone.0055521] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2012] [Accepted: 12/27/2012] [Indexed: 12/26/2022] Open
Abstract
Transcriptional regulation critically depends on proper interactions between transcription factors (TF) and their cognate DNA binding sites. The widely used model of TF-DNA binding – the Positional Weight Matrix (PWM) – presumes independence between positions within the binding site. However, there is evidence to show that the independence assumption may not always hold, and the extent of interposition dependence is not completely known. We hypothesize that the interposition dependence should partly be manifested as correlated evolution at the positions. We report a Maximum-Likelihood (ML) approach to infer correlated evolution at any two positions within a PWM, based on a multiple alignment of 5 mammalian genomes. Application to a genome-wide set of putative cis elements in human promoters reveals a prevalence of correlated evolution within cis elements. We found that the interdependence between two positions decreases with increasing distance between the positions. The interdependent positions tend to be evolutionarily more constrained and moreover, the dependence patterns are relatively similar across structurally related transcription factors. Although some of the detected mutational dependencies may be due to context-dependent genomic hyper-mutation, notably CG to TG, the majority is likely due to context-dependent preferences for specific nucleotide combinations within the cis elements. Patterns of evolution at individual nucleotide positions within mammalian TF binding sites are often significantly correlated, suggesting interposition dependence. The proposed methodology is also applicable to other classes of non-coding functional elements. A detailed investigation of mutational dependencies within specific motifs could reveal preferred nucleotide combinations that may help refine the DNA binding models.
Collapse
Affiliation(s)
- Rithun Mukherjee
- Computational Biology Program, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America
- * E-mail: (RM); (SH)
| | - Perry Evans
- Department of Pathology, School of Medicine, Yale University, New Haven, Connecticut, United States of America
| | - Larry N. Singh
- Genetic Diseases Research Branch, NHGRI, NIH, Bethesda, Maryland, United States of America
| | - Sridhar Hannenhalli
- Center for Bioinformatics and Computational Biology, Department of Cell and Molecular Biology, University of Maryland, College Park, Maryland, United States of America
- * E-mail: (RM); (SH)
| |
Collapse
|
43
|
Xu B, Schones DE, Wang Y, Liang H, Li G. A structural-based strategy for recognition of transcription factor binding sites. PLoS One 2013; 8:e52460. [PMID: 23320072 PMCID: PMC3540023 DOI: 10.1371/journal.pone.0052460] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2012] [Accepted: 11/19/2012] [Indexed: 12/30/2022] Open
Abstract
Scanning through genomes for potential transcription factor binding sites (TFBSs) is becoming increasingly important in this post-genomic era. The position weight matrix (PWM) is the standard representation of TFBSs utilized when scanning through sequences for potential binding sites. However, many transcription factor (TF) motifs are short and highly degenerate, and methods utilizing PWMs to scan for sites are plagued by false positives. Furthermore, many important TFs do not have well-characterized PWMs, making identification of potential binding sites even more difficult. One approach to the identification of sites for these TFs has been to use the 3D structure of the TF to predict the DNA structure around the TF and then to generate a PWM from the predicted 3D complex structure. However, this approach is dependent on the similarity of the predicted structure to the native structure. We introduce here a novel approach to identify TFBSs utilizing structure information that can be applied to TFs without characterized PWMs, as long as a 3D complex structure (TF/DNA) exists. This approach utilizes an energy function that is uniquely trained on each structure. Our approach leads to increased prediction accuracy and robustness compared with those using a more general energy function. The software is freely available upon request.
Collapse
Affiliation(s)
- Beisi Xu
- Laboratory of Molecular Modeling and Design, State Key Laboratory of Molecular Reaction Dynamics, Dalian Institute of Chemical Physics, The Chinese Academy of Sciences, Dalian, Liaoning, China
- Department of Microbiology, Immunology and Biochemistry, University of Tennessee Health Science Center, Memphis, Tennessee, United States of America
- Center for Integrative and Translational Genomics, University of Tennessee Health Science Center, Memphis, Tennessee, United States of America
| | - Dustin E. Schones
- Department of Cancer Biology, Beckman Research Institute, City of Hope, Duarte, California, United States of America
| | - Yongmei Wang
- Department of Chemistry, University of Memphis, Memphis, Tennessee, United States of America
| | - Haojun Liang
- Department of Polymer Science and Engineering, University of Science and Technology of China, Hefei, Anhui, China
| | - Guohui Li
- Laboratory of Molecular Modeling and Design, State Key Laboratory of Molecular Reaction Dynamics, Dalian Institute of Chemical Physics, The Chinese Academy of Sciences, Dalian, Liaoning, China
- * E-mail:
| |
Collapse
|
44
|
MacKenzie A, Hing B, Davidson S. Exploring the effects of polymorphisms on cis-regulatory signal transduction response. Trends Mol Med 2012; 19:99-107. [PMID: 23265842 PMCID: PMC3569712 DOI: 10.1016/j.molmed.2012.11.003] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2012] [Revised: 10/11/2012] [Accepted: 11/09/2012] [Indexed: 12/20/2022]
Abstract
cis-Regulatory sequences (CRSs) direct cell-specific and inducible gene expression in response to signal transduction networks, and it is becoming apparent that many cases of disease susceptibility and drug response stratification are due to polymorphisms that alter CRS responses in a context-dependent manner. In the current review, we describe successful methods for identifying CRSs and analyzing the effects of allelic variation on their responses to signal transduction. The technologies described build on the successes of ENCODE (ENCyclopedia Of DNA Elements) by exploring the effects of polymorphisms on CRS context dependency. This understanding is essential to uncover the genomic basis of disease susceptibility and will play a major role in delivering on the promise of personalized medicine.
Collapse
Affiliation(s)
- Alasdair MacKenzie
- Gene Regulatory Systems Laboratory, School of Medical Sciences, Institute of Medical Sciences, University of Aberdeen, Aberdeen, Scotland AB25 2ZD, UK.
| | | | | |
Collapse
|
45
|
Wang D, Tapan S. MISCORE: a new scoring function for characterizing DNA regulatory motifs in promoter sequences. BMC SYSTEMS BIOLOGY 2012; 6 Suppl 2:S4. [PMID: 23282090 PMCID: PMC3521183 DOI: 10.1186/1752-0509-6-s2-s4] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Abstract
Background Computational approaches for finding DNA regulatory motifs in promoter sequences are useful to biologists in terms of reducing the experimental costs and speeding up the discovery process of de novo binding sites. It is important for rule-based or clustering-based motif searching schemes to effectively and efficiently evaluate the similarity between a k-mer (a k-length subsequence) and a motif model, without assuming the independence of nucleotides in motif models or without employing computationally expensive Markov chain models to estimate the background probabilities of k-mers. Also, it is interesting and beneficial to use a priori knowledge in developing advanced searching tools. Results This paper presents a new scoring function, termed as MISCORE, for functional motif characterization and evaluation. Our MISCORE is free from: (i) any assumption on model dependency; and (ii) the use of Markov chain model for background modeling. It integrates the compositional complexity of motif instances into the function. Performance evaluations with comparison to the well-known Maximum a Posteriori (MAP) score and Information Content (IC) have shown that MISCORE has promising capabilities to separate and recognize functional DNA motifs and its instances from non-functional ones. Conclusions MISCORE is a fast computational tool for candidate motif characterization, evaluation and selection. It enables to embed priori known motif models for computing motif-to-motif similarity, which is more advantageous than IC and MAP score. In addition to these merits mentioned above, MISCORE can automatically filter out some repetitive k-mers from a motif model due to the introduction of the compositional complexity in the function. Consequently, the merits of our proposed MISCORE in terms of both motif signal modeling power and computational efficiency will make it more applicable in the development of computational motif discovery tools.
Collapse
Affiliation(s)
- Dianhui Wang
- Department of Computer Science and Computer Engineering, La Trobe University, Melbourne, Victoria 3086, Australia.
| | | |
Collapse
|
46
|
Blanco E, Corominas M. CBS: an open platform that integrates predictive methods and epigenetics information to characterize conserved regulatory features in multiple Drosophila genomes. BMC Genomics 2012; 13:688. [PMID: 23228284 PMCID: PMC3564944 DOI: 10.1186/1471-2164-13-688] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2012] [Accepted: 11/28/2012] [Indexed: 12/11/2022] Open
Abstract
Background Information about the composition of regulatory regions is of great value for designing experiments to functionally characterize gene expression. The multiplicity of available applications to predict transcription factor binding sites in a particular locus contrasts with the substantial computational expertise that is demanded to manipulate them, which may constitute a potential barrier for the experimental community. Results CBS (Conserved regulatory Binding Sites, http://compfly.bio.ub.es/CBS) is a public platform of evolutionarily conserved binding sites and enhancers predicted in multiple Drosophila genomes that is furnished with published chromatin signatures associated to transcriptionally active regions and other experimental sources of information. The rapid access to this novel body of knowledge through a user-friendly web interface enables non-expert users to identify the binding sequences available for any particular gene, transcription factor, or genome region. Conclusions The CBS platform is a powerful resource that provides tools for data mining individual sequences and groups of co-expressed genes with epigenomics information to conduct regulatory screenings in Drosophila.
Collapse
Affiliation(s)
- Enrique Blanco
- Departament de Genètica and Institut de Biomedicina (IBUB), Universitat de Barcelona, Av, Diagonal 643, 08028, Barcelona, Spain.
| | | |
Collapse
|
47
|
Müller-Molina AJ, Schöler HR, Araúzo-Bravo MJ. Comprehensive human transcription factor binding site map for combinatory binding motifs discovery. PLoS One 2012; 7:e49086. [PMID: 23209563 PMCID: PMC3509107 DOI: 10.1371/journal.pone.0049086] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2012] [Accepted: 10/08/2012] [Indexed: 11/18/2022] Open
Abstract
To know the map between transcription factors (TFs) and their binding sites is essential to reverse engineer the regulation process. Only about 10%-20% of the transcription factor binding motifs (TFBMs) have been reported. This lack of data hinders understanding gene regulation. To address this drawback, we propose a computational method that exploits never used TF properties to discover the missing TFBMs and their sites in all human gene promoters. The method starts by predicting a dictionary of regulatory "DNA words." From this dictionary, it distills 4098 novel predictions. To disclose the crosstalk between motifs, an additional algorithm extracts TF combinatorial binding patterns creating a collection of TF regulatory syntactic rules. Using these rules, we narrowed down a list of 504 novel motifs that appear frequently in syntax patterns. We tested the predictions against 509 known motifs confirming that our system can reliably predict ab initio motifs with an accuracy of 81%-far higher than previous approaches. We found that on average, 90% of the discovered combinatorial binding patterns target at least 10 genes, suggesting that to control in an independent manner smaller gene sets, supplementary regulatory mechanisms are required. Additionally, we discovered that the new TFBMs and their combinatorial patterns convey biological meaning, targeting TFs and genes related to developmental functions. Thus, among all the possible available targets in the genome, the TFs tend to regulate other TFs and genes involved in developmental functions. We provide a comprehensive resource for regulation analysis that includes a dictionary of "DNA words," newly predicted motifs and their corresponding combinatorial patterns. Combinatorial patterns are a useful filter to discover TFBMs that play a major role in orchestrating other factors and thus, are likely to lock/unlock cellular functional clusters.
Collapse
Affiliation(s)
- Arnoldo J. Müller-Molina
- Computational Biology and Bioinformatics Group, Max Planck Institute for Molecular Biomedicine, Münster, Germany
| | - Hans R. Schöler
- Department of Cell and Developmental Biology, Max Planck Institute for Molecular Biomedicine, Münster, Germany
- Medical Faculty, University of Münster, Münster, Germany
| | - Marcos J. Araúzo-Bravo
- Computational Biology and Bioinformatics Group, Max Planck Institute for Molecular Biomedicine, Münster, Germany
| |
Collapse
|
48
|
Abstract
The control of gene transcription is a critical level of gene expression regulation. The interactions between transcription factors (TF) and their DNA binding sites (TFBS) play a key role at this level. In order to decipher the molecular mechanism of the interactions of TFs with TFBSs and construct transcription regulatory network, it is necessary to systematically collect, save, and analyze the information of discovered TFs and their TFBSs. In recent years, multiple TF and TFBS-related databases have been established. These databeses significantly promoted the TF-related studies in the fields of molecular biology, bioinformatics, and system biology. This paper summarized the contents, characteristics, access, and advances of main TFs and TFBSs-related databases, including TRANSFAC, JASPAR, TFDB, TRRD, TRED, PAZAR, MAPPER and others.
Collapse
|
49
|
Kulakovskiy IV, Medvedeva YA, Schaefer U, Kasianov AS, Vorontsov IE, Bajic VB, Makeev VJ. HOCOMOCO: a comprehensive collection of human transcription factor binding sites models. Nucleic Acids Res 2012; 41:D195-202. [PMID: 23175603 PMCID: PMC3531053 DOI: 10.1093/nar/gks1089] [Citation(s) in RCA: 156] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022] Open
Abstract
Transcription factor (TF) binding site (TFBS) models are crucial for computational reconstruction of transcription regulatory networks. In existing repositories, a TF often has several models (also called binding profiles or motifs), obtained from different experimental data. Having a single TFBS model for a TF is more pragmatic for practical applications. We show that integration of TFBS data from various types of experiments into a single model typically results in the improved model quality probably due to partial correction of source specific technique bias. We present the Homo sapiens comprehensive model collection (HOCOMOCO, http://autosome.ru/HOCOMOCO/, http://cbrc.kaust.edu.sa/hocomoco/) containing carefully hand-curated TFBS models constructed by integration of binding sequences obtained by both low- and high-throughput methods. To construct position weight matrices to represent these TFBS models, we used ChIPMunk software in four computational modes, including newly developed periodic positional prior mode associated with DNA helix pitch. We selected only one TFBS model per TF, unless there was a clear experimental evidence for two rather distinct TFBS models. We assigned a quality rating to each model. HOCOMOCO contains 426 systematically curated TFBS models for 401 human TFs, where 172 models are based on more than one data source.
Collapse
Affiliation(s)
- Ivan V Kulakovskiy
- Laboratory of Bioinformatics and Systems Biology, Engelhardt Institute of Molecular Biology, Russian Academy of Sciences, Vavilov Street 32, Moscow 119991, GSP-1, Russia.
| | | | | | | | | | | | | |
Collapse
|
50
|
Abstract
Understanding regulation of gene transcription is central to molecular biology as well as being of great interest in medicine. The molecular syntax of the concerted transcriptional activation/repression of gene networks in mammal cells, which shape the physiological response to the molecular signals, is often unknown or not completely understood. Combining genome-wide experiments with in silico approaches opens the way to a more systematic comprehension of the molecular mechanisms of transcription regulation. Diverse bioinformatics tools have been developed to help unravel these mechanisms, by handling and processing data at different stages: from data collection and storage to the identification of molecular targets and from the detection of DNA motif signatures in the regulatory sequences of functionally related genes to the identification of relevant regulatory networks. Moreover, the large amount of genome-wide scale data recently produced has attracted professionals from diverse backgrounds to this cutting-edge realm of molecular biology. This mini-review is intended as an orientation for multidisciplinary professionals, introducing a streamlined workflow in gene transcription regulation with emphasis on sequence analysis. It provides an outlook on tools and methods, selected from a host of bioinformatics resources available today. It has been designed for the benefit of students, investigators, and professionals who seek a coherent yet quick introduction to in silico approaches to analyzing regulation of gene transcription in the post-genomic era.
Collapse
Affiliation(s)
- Gioia Altobelli
- Department of Endocrinology, William Harvey Research Institute, Barts and The London School of Medicine and Dentistry, Queen Mary University of London, Charterhouse Square, London EC1M 6BQ, UK.
| |
Collapse
|