1
|
Rahman JF, Hoque H, Jubayer AA, Jewel NA, Hasan MN, Chowdhury AT, Prodhan SH. Alfin-like (AL) transcription factor family in Oryza sativa L.: Genome-wide analysis and expression profiling under different stresses. BIOTECHNOLOGY REPORTS (AMSTERDAM, NETHERLANDS) 2024; 43:e00845. [PMID: 38962072 PMCID: PMC11217604 DOI: 10.1016/j.btre.2024.e00845] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/25/2024] [Revised: 04/24/2024] [Accepted: 05/29/2024] [Indexed: 07/05/2024]
Abstract
Oryza sativa L. is the world's most essential and economically important food crop. Climate change and ecological imbalances make rice plants vulnerable to abiotic and biotic stresses, threatening global food security. The Alfin-like (AL) transcription factor family plays a crucial role in plant development and stress responses. This study comprehensively analyzed this gene family and their expression profiles in rice, revealing nine AL genes, classifying them into three distinct groups based on phylogenetic analysis and identifying four segmental duplication events. RNA-seq data analysis revealed high expression levels of OsALs in different tissues, growth stages, and their responsiveness to stresses. RT-qPCR data showed significant expression of OsALs in different abiotic stresses. Identification of potential cis-regulatory elements in promoter regions has also unveiled their involvement. Tertiary structures of the proteins were predicted. These findings would lay the groundwork for future research to reveal their molecular mechanism in stress tolerance and plant development.
Collapse
Affiliation(s)
- Jeba Faizah Rahman
- Department of Genetic Engineering and Biotechnology, Shahjalal University of Science and Technology, Sylhet, 3114, Bangladesh
| | - Hammadul Hoque
- Department of Genetic Engineering and Biotechnology, Shahjalal University of Science and Technology, Sylhet, 3114, Bangladesh
| | - Abdullah -Al- Jubayer
- Department of Biotechnology and Genetic Engineering, Bangabandhu Sheikh Mujibur Rahman Science and Technology University, Gopalganj, 8100, Bangladesh
| | - Nurnabi Azad Jewel
- Department of Genetic Engineering and Biotechnology, Shahjalal University of Science and Technology, Sylhet, 3114, Bangladesh
| | - Md. Nazmul Hasan
- Department of Genetic Engineering and Biotechnology, Shahjalal University of Science and Technology, Sylhet, 3114, Bangladesh
| | - Aniqua Tasnim Chowdhury
- Department of Genetic Engineering and Biotechnology, Shahjalal University of Science and Technology, Sylhet, 3114, Bangladesh
| | - Shamsul H. Prodhan
- Department of Genetic Engineering and Biotechnology, Shahjalal University of Science and Technology, Sylhet, 3114, Bangladesh
| |
Collapse
|
2
|
Chowdhury AT, Hasan MN, Bhuiyan FH, Islam MQ, Nayon MRW, Rahaman MM, Hoque H, Jewel NA, Ashrafuzzaman M, Prodhan SH. Identification, characterization of Apyrase (APY) gene family in rice (Oryza sativa) and analysis of the expression pattern under various stress conditions. PLoS One 2023; 18:e0273592. [PMID: 37163561 PMCID: PMC10171694 DOI: 10.1371/journal.pone.0273592] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2022] [Accepted: 02/27/2023] [Indexed: 05/12/2023] Open
Abstract
Apyrase (APY) is a nucleoside triphosphate (NTP) diphosphohydrolase (NTPDase) which is a member of the superfamily of guanosine diphosphatase 1 (GDA1)-cluster of differentiation 39 (CD39) nucleoside phosphatase. Under various circumstances like stress, cell growth, the extracellular adenosine triphosphate (eATP) level increases, causing a detrimental influence on cells such as cell growth retardation, ROS production, NO burst, and apoptosis. Apyrase hydrolyses eATP accumulated in the extracellular membrane during stress, wounds, into adenosine diphosphate (ADP) and adenosine monophosphate (AMP) and regulates the stress-responsive pathway in plants. This study was designed for the identification, characterization, and for analysis of APY gene expression in Oryza sativa. This investigation discovered nine APYs in rice, including both endo- and ecto-apyrase. According to duplication event analysis, in the evolution of OsAPYs, a significant role is performed by segmental duplication. Their role in stress control, hormonal responsiveness, and the development of cells is supported by the corresponding cis-elements present in their promoter regions. According to expression profiling by RNA-seq data, the genes were expressed in various tissues. Upon exposure to a variety of biotic as well as abiotic stimuli, including anoxia, drought, submergence, alkali, heat, dehydration, salt, and cold, they showed a differential expression pattern. The expression analysis from the RT-qPCR data also showed expression under various abiotic stress conditions, comprising cold, salinity, cadmium, drought, submergence, and especially heat stress. This finding will pave the way for future in-vivo analysis, unveil the molecular mechanisms of APY genes in stress response, and contribute to the development of stress-tolerant rice varieties.
Collapse
Affiliation(s)
- Aniqua Tasnim Chowdhury
- Department of Genetic Engineering and Biotechnology, School of Life Sciences, Shahjalal University of Science and Technology, Sylhet, Bangladesh
| | - Md Nazmul Hasan
- Department of Genetic Engineering and Biotechnology, School of Life Sciences, Shahjalal University of Science and Technology, Sylhet, Bangladesh
| | - Fahmid H Bhuiyan
- Plant Biotechnology Division, National Institute of Biotechnology, Ganakbari, Ashulia, Savar, Dhaka, Bangladesh
| | - Md Qamrul Islam
- Department of Genetic Engineering and Biotechnology, School of Life Sciences, Shahjalal University of Science and Technology, Sylhet, Bangladesh
| | - Md Rakib Wazed Nayon
- Department of Genetic Engineering and Biotechnology, School of Life Sciences, Shahjalal University of Science and Technology, Sylhet, Bangladesh
| | - Md Mashiur Rahaman
- Department of Genetic Engineering and Biotechnology, School of Life Sciences, Shahjalal University of Science and Technology, Sylhet, Bangladesh
- Institute of Epidemiology, Disease Control and Research (IEDCR), Dhaka, Bangladesh
| | - Hammadul Hoque
- Department of Genetic Engineering and Biotechnology, School of Life Sciences, Shahjalal University of Science and Technology, Sylhet, Bangladesh
| | - Nurnabi Azad Jewel
- Department of Genetic Engineering and Biotechnology, School of Life Sciences, Shahjalal University of Science and Technology, Sylhet, Bangladesh
| | - Md Ashrafuzzaman
- Department of Genetic Engineering and Biotechnology, School of Life Sciences, Shahjalal University of Science and Technology, Sylhet, Bangladesh
| | - Shamsul H Prodhan
- Department of Genetic Engineering and Biotechnology, School of Life Sciences, Shahjalal University of Science and Technology, Sylhet, Bangladesh
| |
Collapse
|
3
|
Yang Y, Lee JH, Poindexter MR, Shao Y, Liu W, Lenaghan SC, Ahkami AH, Blumwald E, Stewart CN. Rational design and testing of abiotic stress-inducible synthetic promoters from poplar cis-regulatory elements. PLANT BIOTECHNOLOGY JOURNAL 2021; 19:1354-1369. [PMID: 33471413 PMCID: PMC8313130 DOI: 10.1111/pbi.13550] [Citation(s) in RCA: 23] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/14/2020] [Revised: 12/31/2020] [Accepted: 01/09/2021] [Indexed: 05/27/2023]
Abstract
Abiotic stress resistance traits may be especially crucial for sustainable production of bioenergy tree crops. Here, we show the performance of a set of rationally designed osmotic-related and salt stress-inducible synthetic promoters for use in hybrid poplar. De novo motif-detecting algorithms yielded 30 water-deficit (SD) and 34 salt stress (SS) candidate DNA motifs from relevant poplar transcriptomes. We selected three conserved water-deficit stress motifs (SD18, SD13 and SD9) found in 16 co-expressed gene promoters, and we discovered a well-conserved motif for salt response (SS16). We characterized several native poplar stress-inducible promoters to enable comparisons with our synthetic promoters. Fifteen synthetic promoters were designed using various SD and SS subdomains, in which heptameric repeats of five-to-eight subdomain bases were fused to a common core promoter downstream, which, in turn, drove a green fluorescent protein (GFP) gene for reporter assays. These 15 synthetic promoters were screened by transient expression assays in poplar leaf mesophyll protoplasts and agroinfiltrated Nicotiana benthamiana leaves under osmotic stress conditions. Twelve synthetic promoters were induced in transient expression assays with a GFP readout. Of these, five promoters (SD18-1, SD9-2, SS16-1, SS16-2 and SS16-3) endowed higher inducibility under osmotic stress conditions than native promoters. These five synthetic promoters were stably transformed into Arabidopsis thaliana to study inducibility in whole plants. Herein, SD18-1 and SD9-2 were induced by water-deficit stress, whereas SS16-1, SS16-2 and SS16-3 were induced by salt stress. The synthetic biology design pipeline resulted in five synthetic promoters that outperformed endogenous promoters in transgenic plants.
Collapse
Affiliation(s)
- Yongil Yang
- Center for Agricultural Synthetic BiologyUniversity of Tennessee Institute of AgricultureKnoxvilleTNUSA
- Department of Plant SciencesUniversity of TennesseeKnoxvilleTNUSA
| | - Jun Hyung Lee
- Center for Agricultural Synthetic BiologyUniversity of Tennessee Institute of AgricultureKnoxvilleTNUSA
- Department of Plant SciencesUniversity of TennesseeKnoxvilleTNUSA
- Biosciences DivisionOak Ridge National LaboratoryOak RidgeTNUSA
| | - Magen R. Poindexter
- Center for Agricultural Synthetic BiologyUniversity of Tennessee Institute of AgricultureKnoxvilleTNUSA
- Department of Plant SciencesUniversity of TennesseeKnoxvilleTNUSA
| | - Yuanhua Shao
- Center for Agricultural Synthetic BiologyUniversity of Tennessee Institute of AgricultureKnoxvilleTNUSA
- Department of Plant SciencesUniversity of TennesseeKnoxvilleTNUSA
| | - Wusheng Liu
- Department of Plant SciencesUniversity of TennesseeKnoxvilleTNUSA
- Department of Horticultural ScienceNorth Carolina State UniversityRaleighNCUSA
| | - Scott C. Lenaghan
- Center for Agricultural Synthetic BiologyUniversity of Tennessee Institute of AgricultureKnoxvilleTNUSA
- Department of Food ScienceUniversity of TennesseeKnoxvilleTNUSA
| | - Amir H. Ahkami
- Environmental Molecular Sciences Laboratory (EMSL)Pacific Northwest National Laboratory (PNNL)RichlandWAUSA
| | | | - Charles Neal Stewart
- Center for Agricultural Synthetic BiologyUniversity of Tennessee Institute of AgricultureKnoxvilleTNUSA
- Department of Plant SciencesUniversity of TennesseeKnoxvilleTNUSA
| |
Collapse
|
4
|
Mahood EH, Kruse LH, Moghe GD. Machine learning: A powerful tool for gene function prediction in plants. APPLICATIONS IN PLANT SCIENCES 2020; 8:e11376. [PMID: 32765975 PMCID: PMC7394712 DOI: 10.1002/aps3.11376] [Citation(s) in RCA: 51] [Impact Index Per Article: 12.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/01/2019] [Accepted: 03/19/2020] [Indexed: 05/06/2023]
Abstract
Recent advances in sequencing and informatic technologies have led to a deluge of publicly available genomic data. While it is now relatively easy to sequence, assemble, and identify genic regions in diploid plant genomes, functional annotation of these genes is still a challenge. Over the past decade, there has been a steady increase in studies utilizing machine learning algorithms for various aspects of functional prediction, because these algorithms are able to integrate large amounts of heterogeneous data and detect patterns inconspicuous through rule-based approaches. The goal of this review is to introduce experimental plant biologists to machine learning, by describing how it is currently being used in gene function prediction to gain novel biological insights. In this review, we discuss specific applications of machine learning in identifying structural features in sequenced genomes, predicting interactions between different cellular components, and predicting gene function and organismal phenotypes. Finally, we also propose strategies for stimulating functional discovery using machine learning-based approaches in plants.
Collapse
Affiliation(s)
- Elizabeth H. Mahood
- Plant Biology SectionSchool of Integrative Plant SciencesCornell UniversityIthacaNew York14853USA
| | - Lars H. Kruse
- Plant Biology SectionSchool of Integrative Plant SciencesCornell UniversityIthacaNew York14853USA
| | - Gaurav D. Moghe
- Plant Biology SectionSchool of Integrative Plant SciencesCornell UniversityIthacaNew York14853USA
| |
Collapse
|
5
|
Clustering genomic words in human DNA using peaks and trends of distributions. ADV DATA ANAL CLASSI 2019. [DOI: 10.1007/s11634-019-00362-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
6
|
Bioinformatics Approaches to Gain Insights into cis-Regulatory Motifs Involved in mRNA Localization. ADVANCES IN EXPERIMENTAL MEDICINE AND BIOLOGY 2019; 1203:165-194. [PMID: 31811635 DOI: 10.1007/978-3-030-31434-7_7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Abstract
Messenger RNA (mRNA) is a fundamental intermediate in the expression of proteins. As an integral part of this important process, protein production can be localized by the targeting of mRNA to a specific subcellular compartment. The subcellular destination of mRNA is suggested to be governed by a region of its primary sequence or secondary structure, which consequently dictates the recruitment of trans-acting factors, such as RNA-binding proteins or regulatory RNAs, to form a messenger ribonucleoprotein particle. This molecular ensemble is requisite for precise and spatiotemporal control of gene expression. In the context of RNA localization, the description of the binding preferences of an RNA-binding protein defines a motif, and one, or more, instance of a given motif is defined as a localization element (zip code). In this chapter, we first discuss the cis-regulatory motifs previously identified as mRNA localization elements. We then describe motif representation in terms of entropy and information content and offer an overview of motif databases and search algorithms. Finally, we provide an outline of the motif topology of asymmetrically localized mRNA molecules.
Collapse
|
7
|
Bai W, Geng W, Wang S, Zhang F. Biosynthesis, regulation, and engineering of microbially produced branched biofuels. BIOTECHNOLOGY FOR BIOFUELS 2019; 12:84. [PMID: 31011367 PMCID: PMC6461809 DOI: 10.1186/s13068-019-1424-9] [Citation(s) in RCA: 27] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/01/2018] [Accepted: 04/03/2019] [Indexed: 05/13/2023]
Abstract
The steadily increasing demand on transportation fuels calls for renewable fuel replacements. This has attracted a growing amount of research to develop advanced biofuels that have similar physical, chemical, and combustion properties with petroleum-derived fossil fuels. Early generations of biofuels, such as ethanol, butanol, and straight-chain fatty acid-derived esters or hydrocarbons suffer from various undesirable properties and can only be blended in limited amounts. Recent research has shifted to the production of branched-chain biofuels that, compared to straight-chain fuels, have higher octane values, better cold flow, and lower cloud points, making them more suitable for existing engines, particularly for diesel and jet engines. This review focuses on several types of branched-chain biofuels and their immediate precursors, including branched short-chain (C4-C8) and long-chain (C15-C19)-alcohols, alkanes, and esters. We discuss their biosynthesis, regulation, and recent efforts in their overproduction by engineered microbes.
Collapse
Affiliation(s)
- Wenqin Bai
- Department of Energy, Environmental and Chemical Engineering, Washington University in St. Louis, Saint Louis, MO 63130 USA
| | - Weitao Geng
- Department of Energy, Environmental and Chemical Engineering, Washington University in St. Louis, Saint Louis, MO 63130 USA
| | - Shaojie Wang
- Department of Energy, Environmental and Chemical Engineering, Washington University in St. Louis, Saint Louis, MO 63130 USA
| | - Fuzhong Zhang
- Department of Energy, Environmental and Chemical Engineering, Washington University in St. Louis, Saint Louis, MO 63130 USA
- Division of Biological & Biomedical Sciences, Washington University in St. Louis, Saint Louis, MO 63130 USA
- Institute of Materials Science & Engineering, Washington University in St. Louis, Saint Louis, MO 63130 USA
| |
Collapse
|
8
|
Tran NTL, Huang CH. MODSIDE: a motif discovery pipeline and similarity detector. BMC Genomics 2018; 19:755. [PMID: 30340511 PMCID: PMC6194616 DOI: 10.1186/s12864-018-5148-1] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2018] [Accepted: 10/08/2018] [Indexed: 01/06/2023] Open
Abstract
Background Previous studies demonstrate the usefulness of using multiple tools and methods for improving the accuracy of motif detection. Over the past years, numerous motif discovery pipelines have been developed. However, they typically report only the top ranked results either from individual motif finders or from a combination of multiple tools and algorithms. Results Here we present MODSIDE, a motif discovery pipeline and similarity detector. The pipeline integrated four de novo motif finders: ChIPMunk, MEME, Weeder, and XXmotif. It also incorporated a motif similarity detection tool MOTIFSIM. MODSIDE was designed for delivering not only the predictive results from individual motif finders but also the comparison results for multiple tools. The results include the common significant motifs from multiple tools, the motifs detected by some tools but not by others, and the best matches for each motif in the motif collection of multiple tools. MODSIDE also possesses other useful features for merging similar motifs and clustering motifs into motif trees. Conclusions We evaluated MODSIDE and its adopted motif finders on 16 benchmark datasets. The statistical results demonstrate MODSIDE achieves better accuracy than individual motif finders. We also compared MODSIDE with two popular motif discovery pipelines: MEME-ChIP and RSAT peak-motifs. The comparison results reveal MODSIDE attains similar performance as RSAT peak-motifs but better accuracy than MEME-ChIP. In addition, MODSIDE is able to deliver various comparison results that are not offered by MEME-ChIP, RSAT peak-motifs, and other existing motif discovery pipelines. Electronic supplementary material The online version of this article (10.1186/s12864-018-5148-1) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Ngoc Tam L Tran
- Department of Computer Science and Engineering, University of Connecticut, Storrs, CT, 06269, USA.
| | - Chun-Hsi Huang
- Department of Computer Science and Engineering, University of Connecticut, Storrs, CT, 06269, USA
| |
Collapse
|
9
|
Guo Y, Tian K, Zeng H, Guo X, Gifford DK. A novel k-mer set memory (KSM) motif representation improves regulatory variant prediction. Genome Res 2018; 28:891-900. [PMID: 29654070 PMCID: PMC5991515 DOI: 10.1101/gr.226852.117] [Citation(s) in RCA: 27] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2017] [Accepted: 04/04/2018] [Indexed: 12/15/2022]
Abstract
The representation and discovery of transcription factor (TF) sequence binding specificities is critical for understanding gene regulatory networks and interpreting the impact of disease-associated noncoding genetic variants. We present a novel TF binding motif representation, the k-mer set memory (KSM), which consists of a set of aligned k-mers that are overrepresented at TF binding sites, and a new method called KMAC for de novo discovery of KSMs. We find that KSMs more accurately predict in vivo binding sites than position weight matrix (PWM) models and other more complex motif models across a large set of ChIP-seq experiments. Furthermore, KSMs outperform PWMs and more complex motif models in predicting in vitro binding sites. KMAC also identifies correct motifs in more experiments than five state-of-the-art motif discovery methods. In addition, KSM-derived features outperform both PWM and deep learning model derived sequence features in predicting differential regulatory activities of expression quantitative trait loci (eQTL) alleles. Finally, we have applied KMAC to 1600 ENCODE TF ChIP-seq data sets and created a public resource of KSM and PWM motifs. We expect that the KSM representation and KMAC method will be valuable in characterizing TF binding specificities and in interpreting the effects of noncoding genetic variations.
Collapse
Affiliation(s)
- Yuchun Guo
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
| | - Kevin Tian
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
| | - Haoyang Zeng
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
| | - Xiaoyun Guo
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
| | - David Kenneth Gifford
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
| |
Collapse
|
10
|
Triska M, Ivliev A, Nikolsky Y, Tatarinova TV. Analysis of cis-Regulatory Elements in Gene Co-expression Networks in Cancer. Methods Mol Biol 2017; 1613:291-310. [PMID: 28849565 DOI: 10.1007/978-1-4939-7027-8_11] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023]
Abstract
Analysis of gene co-expression networks is a powerful "data-driven" tool, invaluable for understanding cancer biology and mechanisms of tumor development. Yet, despite of completion of thousands of studies on cancer gene expression, there were few attempts to normalize and integrate co-expression data from scattered sources in a concise "meta-analysis" framework. Here we describe an integrated approach to cancer expression meta-analysis, which combines generation of "data-driven" co-expression networks with detailed statistical detection of promoter sequence motifs within the co-expression clusters. First, we applied Weighted Gene Co-Expression Network Analysis (WGCNA) workflow and Pearson's correlation to generate a comprehensive set of over 3000 co-expression clusters in 82 normalized microarray datasets from nine cancers of different origin. Next, we designed a genome-wide statistical approach to the detection of specific DNA sequence motifs based on similarities between the promoters of similarly expressed genes. The approach, realized as cisExpress software module, was specifically designed for analysis of very large data sets such as those generated by publicly accessible whole genome and transcriptome projects. cisExpress uses a task farming algorithm to exploit all available computational cores within a shared memory node.We discovered that although co-expression modules are populated with different sets of genes, they share distinct stable patterns of co-regulation based on promoter sequence analysis. The number of motifs per co-expression cluster varies widely in accordance with cancer tissue of origin, with the largest number in colon (68 motifs) and the lowest in ovary (18 motifs). The top scored motifs are typically shared between several tissues; they define sets of target genes responsible for certain functionality of cancerogenesis. Both the co-expression modules and a database of precalculated motifs are publically available and accessible for further studies.
Collapse
Affiliation(s)
- Martin Triska
- Spatial Sciences Institute, University of Southern California, Los Angeles, CA, USA
| | | | - Yuri Nikolsky
- Prosapia Genetics, Solana Beach, CA, USA.,School of Systems Biology, George Mason University, Fairfax, VA, USA
| | - Tatiana V Tatarinova
- Spatial Sciences Institute, University of Southern California, Los Angeles, CA, USA. .,Center for Personalized Medicine, Children's Hospital Los Angeles, 4640 Hollywood Blvd, Los Angeles, CA, 90027, USA. .,A.A. Kharkevich Institute for Information Transmission Problems RAS, Moscow, Russia.
| |
Collapse
|
11
|
Czeizler E, Hirvola T, Karhu K. A graph-theoretical approach for motif discovery in protein sequences. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2017; 14:121-130. [PMID: 28055896 DOI: 10.1109/tcbb.2015.2511750] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Motif recognition is a challenging problem in bioinformatics due to the diversity of protein motifs. Many existing algorithms identify motifs of a given length, thus being either not applicable or not efficient when searching simultaneously for motifs of various lengths. Searching for gapped motifs, although very important, is a highly time-consuming task due to the combinatorial explosion of possible combinations implied by the consideration of long gaps. We introduce a new graph theoretical approach to identify motifs of various lengths, both with and without gaps. We compare our approach with two widely used methods: MEME and GLAM2 analyzing both the quality of the results and the required computational time. Our method provides results of a slightly higher level of quality than MEME but at a much faster rate, i.e., one eighth of MEME's query time. By using similarity indexing, we drop the query times down to an average of approximately one sixth of the ones required by GLAM2, while achieving a slightly higher level of quality of the results. More precisely, for sequence collections smaller than 50000 bytes GLAM2 is 13 times slower, while being at least as fast as our method on larger ones. The source code of our C++ implementation is freely available in GitHub: https://github.com/hirvolt1/debruijn-motif.
Collapse
|
12
|
Ren C, Chen H, Yang B, Liu F, Ouyang Z, Bo X, Shu W. iFORM: Incorporating Find Occurrence of Regulatory Motifs. PLoS One 2016; 11:e0168607. [PMID: 27992540 PMCID: PMC5167396 DOI: 10.1371/journal.pone.0168607] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2016] [Accepted: 12/02/2016] [Indexed: 11/18/2022] Open
Abstract
Accurately identifying the binding sites of transcription factors (TFs) is crucial to understanding the mechanisms of transcriptional regulation and human disease. We present incorporating Find Occurrence of Regulatory Motifs (iFORM), an easy-to-use and efficient tool for scanning DNA sequences with TF motifs described as position weight matrices (PWMs). Both performance assessment with a receiver operating characteristic (ROC) curve and a correlation-based approach demonstrated that iFORM achieves higher accuracy and sensitivity by integrating five classical motif discovery programs using Fisher’s combined probability test. We have used iFORM to provide accurate results on a variety of data in the ENCODE Project and the NIH Roadmap Epigenomics Project, and the tool has demonstrated its utility in further elucidating individual roles of functional elements. Both the source and binary codes for iFORM can be freely accessed at https://github.com/wenjiegroup/iFORM. The identified TF binding sites across human cell and tissue types using iFORM have been deposited in the Gene Expression Omnibus under the accession ID GSE53962.
Collapse
Affiliation(s)
- Chao Ren
- Department of Biotechnology, Beijing Institute of Radiation Medicine, Beijing, China
| | - Hebing Chen
- Department of Biotechnology, Beijing Institute of Radiation Medicine, Beijing, China
| | - Bite Yang
- Department of Biotechnology, Beijing Institute of Radiation Medicine, Beijing, China
| | - Feng Liu
- Department of Biotechnology, Beijing Institute of Radiation Medicine, Beijing, China
| | - Zhangyi Ouyang
- Department of Biotechnology, Beijing Institute of Radiation Medicine, Beijing, China
| | - Xiaochen Bo
- Department of Biotechnology, Beijing Institute of Radiation Medicine, Beijing, China
- * E-mail: (WS); (XB)
| | - Wenjie Shu
- Department of Biotechnology, Beijing Institute of Radiation Medicine, Beijing, China
- * E-mail: (WS); (XB)
| |
Collapse
|
13
|
Van de Velde J, Van Bel M, Vaneechoutte D, Vandepoele K. A Collection of Conserved Noncoding Sequences to Study Gene Regulation in Flowering Plants. PLANT PHYSIOLOGY 2016; 171:2586-98. [PMID: 27261064 PMCID: PMC4972296 DOI: 10.1104/pp.16.00821] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/18/2016] [Accepted: 05/31/2016] [Indexed: 05/03/2023]
Abstract
Transcription factors (TFs) regulate gene expression by binding cis-regulatory elements, of which the identification remains an ongoing challenge owing to the prevalence of large numbers of nonfunctional TF binding sites. Powerful comparative genomics methods, such as phylogenetic footprinting, can be used for the detection of conserved noncoding sequences (CNSs), which are functionally constrained and can greatly help in reducing the number of false-positive elements. In this study, we applied a phylogenetic footprinting approach for the identification of CNSs in 10 dicot plants, yielding 1,032,291 CNSs associated with 243,187 genes. To annotate CNSs with TF binding sites, we made use of binding site information for 642 TFs originating from 35 TF families in Arabidopsis (Arabidopsis thaliana). In three species, the identified CNSs were evaluated using TF chromatin immunoprecipitation sequencing data, resulting in significant overlap for the majority of data sets. To identify ultraconserved CNSs, we included genomes of additional plant families and identified 715 binding sites for 501 genes conserved in dicots, monocots, mosses, and green algae. Additionally, we found that genes that are part of conserved mini-regulons have a higher coherence in their expression profile than other divergent gene pairs. All identified CNSs were integrated in the PLAZA 3.0 Dicots comparative genomics platform (http://bioinformatics.psb.ugent.be/plaza/versions/plaza_v3_dicots/) together with new functionalities facilitating the exploration of conserved cis-regulatory elements and their associated genes. The availability of this data set in a user-friendly platform enables the exploration of functional noncoding DNA to study gene regulation in a variety of plant species, including crops.
Collapse
Affiliation(s)
- Jan Van de Velde
- Department of Plant Systems Biology, Vlaams Instituut voor Biotechnologie, B-9052 Ghent, Belgium (J.V.d.V., M.V.B., D.V., K.V.); andDepartment of Plant Biotechnology and Bioinformatics, Ghent University, B-9052 Ghent, Belgium (J.V.d.V., M.V.B., D.V., K.V.)
| | - Michiel Van Bel
- Department of Plant Systems Biology, Vlaams Instituut voor Biotechnologie, B-9052 Ghent, Belgium (J.V.d.V., M.V.B., D.V., K.V.); andDepartment of Plant Biotechnology and Bioinformatics, Ghent University, B-9052 Ghent, Belgium (J.V.d.V., M.V.B., D.V., K.V.)
| | - Dries Vaneechoutte
- Department of Plant Systems Biology, Vlaams Instituut voor Biotechnologie, B-9052 Ghent, Belgium (J.V.d.V., M.V.B., D.V., K.V.); andDepartment of Plant Biotechnology and Bioinformatics, Ghent University, B-9052 Ghent, Belgium (J.V.d.V., M.V.B., D.V., K.V.)
| | - Klaas Vandepoele
- Department of Plant Systems Biology, Vlaams Instituut voor Biotechnologie, B-9052 Ghent, Belgium (J.V.d.V., M.V.B., D.V., K.V.); andDepartment of Plant Biotechnology and Bioinformatics, Ghent University, B-9052 Ghent, Belgium (J.V.d.V., M.V.B., D.V., K.V.)
| |
Collapse
|
14
|
Knox DA, Dowell RD. A Modeling Framework for Generation of Positional and Temporal Simulations of Transcriptional Regulation. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2016; 13:459-471. [PMID: 27295631 DOI: 10.1109/tcbb.2015.2459708] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
We present a modeling framework aimed at capturing both the positional and temporal behavior of transcriptional regulatory proteins in eukaryotic cells. There is growing evidence that transcriptional regulation is the complex behavior that emerges not solely from the individual components, but rather from their collective behavior, including competition and cooperation. Our framework describes individual regulatory components using generic action oriented descriptions of their biochemical interactions with a DNA sequence. All the possible actions are based on the current state of factors bound to the DNA. We developed a rule builder to automatically generate the complete set of biochemical interaction rules for any given DNA sequence. Off-the-shelf stochastic simulation engines can model the behavior of a system of rules and the resulting changes in the configuration of bound factors can be visualized. We compared our model to experimental data at well-studied loci in yeast, confirming that our model captures both the positional and temporal behavior of transcriptional regulation.
Collapse
Affiliation(s)
- David A Knox
- Computational Bioscience Program, University of Colorado, School of Medicine, Anschutz Medical Campus, Aurora, CO
| | - Robin D Dowell
- Molecular, Cellular, Developmental Biology Department, BioFrontiers Institute, University of Colorado, Boulder, CO
| |
Collapse
|
15
|
Tapan S, Wang D. A Further Study on Mining DNA Motifs Using Fuzzy Self-Organizing Maps. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2016; 27:113-124. [PMID: 26068877 DOI: 10.1109/tnnls.2015.2435155] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
Self-organizing map (SOM)-based motif mining, despite being a promising approach for problem solving, mostly fails to offer a consistent interpretation of clusters with respect to the mixed composition of signal and noise in the nodes. The main reason behind this shortcoming comes from the similarity metrics used in data assignment, specially designed with the biological interpretation for this domain, which are not meant to consider the inevitable noise mixture in the clusters. This limits the explicability of the majority of clusters that are supposedly noise dominated, degrading the overall system clarity in motif discovery. This paper aims to improve the explicability aspect of learning process by introducing a composite similarity function (CSF) that is specially designed for the k -mer-to-cluster similarity measure with respect to the degree of motif properties and embedded noise in the cluster. Our proposed motif finding algorithm in this paper is built on our previous work robust elicitation algorithms for discovering (READ) [1] and termed READ Deoxyribonucleic acid motifs using CSFs (READ(csf)), which performs slightly better than READ and shows some remarkable improvements over SOM-based SOMBRERO and SOMEA tools in terms of F-measure on the testing data sets. A real data set containing multiple motifs is used to explore the potential of the READ(csf) for more challenging biological data mining tasks. Visual comparisons with the verified logos extracted from JASPAR database demonstrate that our algorithm is promising to discover multiple motifs simultaneously.
Collapse
|
16
|
Pandey B, Sharma P, Tyagi C, Goyal S, Grover A, Sharma I. Structural modeling and molecular simulation analysis of HvAP2/EREBP from barley. J Biomol Struct Dyn 2015. [DOI: 10.1080/07391102.2015.1073630] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
17
|
Lihu A, Holban T. A review of ensemble methods for de novo motif discovery in ChIP-Seq data. Brief Bioinform 2015; 16:964-73. [DOI: 10.1093/bib/bbv022] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2015] [Indexed: 01/17/2023] Open
|
18
|
Orsi GA, Kasinathan S, Zentner GE, Henikoff S, Ahmad K. Mapping regulatory factors by immunoprecipitation from native chromatin. ACTA ACUST UNITED AC 2015; 110:21.31.1-21.31.25. [PMID: 25827087 DOI: 10.1002/0471142727.mb2131s110] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Occupied Regions of Genomes from Affinity-purified Naturally Isolated Chromatin (ORGANIC) is a high-resolution method that can be used to quantitatively map protein-DNA interactions with high specificity and sensitivity. This method uses micrococcal nuclease (MNase) digestion of chromatin and low-salt solubilization to preserve protein-DNA complexes, followed by immunoprecipitation and paired-end sequencing for genome-wide mapping of binding sites. In this unit, we describe methods for isolation of nuclei and MNase digestion of unfixed chromatin, immunoprecipitation of protein-DNA complexes, and high-throughput sequencing to map sites of bound factors.
Collapse
Affiliation(s)
- Guillermo A Orsi
- Department of Biological Chemistry and Molecular Pharmacology, Harvard Medical School, Boston, Massachusetts.,CNRS-UMR3664/Institut Curie-Centre de Recherche, Paris, France.,These authors contributed equally to this work
| | - Sivakanthan Kasinathan
- Basic Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, Washington.,Medical Scientist Training Program, University of Washington School of Medicine, Seattle, Washington.,These authors contributed equally to this work
| | - Gabriel E Zentner
- Basic Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, Washington
| | - Steven Henikoff
- Basic Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, Washington.,Howard Hughes Medical Institute, Seattle, Washington
| | - Kami Ahmad
- Department of Biological Chemistry and Molecular Pharmacology, Harvard Medical School, Boston, Massachusetts
| |
Collapse
|
19
|
Suryamohan K, Halfon MS. Identifying transcriptional cis-regulatory modules in animal genomes. WILEY INTERDISCIPLINARY REVIEWS. DEVELOPMENTAL BIOLOGY 2015; 4:59-84. [PMID: 25704908 PMCID: PMC4339228 DOI: 10.1002/wdev.168] [Citation(s) in RCA: 47] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/24/2014] [Revised: 11/04/2014] [Accepted: 11/16/2014] [Indexed: 11/08/2022]
Abstract
UNLABELLED Gene expression is regulated through the activity of transcription factors (TFs) and chromatin-modifying proteins acting on specific DNA sequences, referred to as cis-regulatory elements. These include promoters, located at the transcription initiation sites of genes, and a variety of distal cis-regulatory modules (CRMs), the most common of which are transcriptional enhancers. Because regulated gene expression is fundamental to cell differentiation and acquisition of new cell fates, identifying, characterizing, and understanding the mechanisms of action of CRMs is critical for understanding development. CRM discovery has historically been challenging, as CRMs can be located far from the genes they regulate, have few readily identifiable sequence characteristics, and for many years were not amenable to high-throughput discovery methods. However, the recent availability of complete genome sequences and the development of next-generation sequencing methods have led to an explosion of both computational and empirical methods for CRM discovery in model and nonmodel organisms alike. Experimentally, CRMs can be identified through chromatin immunoprecipitation directed against TFs or histone post-translational modifications, identification of nucleosome-depleted 'open' chromatin regions, or sequencing-based high-throughput functional screening. Computational methods include comparative genomics, clustering of known or predicted TF-binding sites, and supervised machine-learning approaches trained on known CRMs. All of these methods have proven effective for CRM discovery, but each has its own considerations and limitations, and each is subject to a greater or lesser number of false-positive identifications. Experimental confirmation of predictions is essential, although shortcomings in current methods suggest that additional means of validation need to be developed. For further resources related to this article, please visit the WIREs website. CONFLICT OF INTEREST The authors have declared no conflicts of interest for this article.
Collapse
Affiliation(s)
- Kushal Suryamohan
- Department of Biochemistry, University at Buffalo-State University of New York, Buffalo, NY 14203, USA
- NY State Center of Excellence in Bioinformatics and Life Sciences, Buffalo, NY 14203, USA
| | - Marc S. Halfon
- Department of Biochemistry, University at Buffalo-State University of New York, Buffalo, NY 14203, USA
- Department of Biological Sciences, University at Buffalo-State University of New York, Buffalo, NY 14203, USA
- Department of Biomedical Informatics, University at Buffalo-State University of New York, Buffalo, NY 14203, USA
- NY State Center of Excellence in Bioinformatics and Life Sciences, Buffalo, NY 14203, USA
- Molecular and Cellular Biology Department and Program in Cancer Genetics, Roswell Park Cancer Institute, Buffalo, NY 14263, USA
| |
Collapse
|
20
|
Wong MH, Sze-To HYA, Lo LYP, Chan TMC, Leung KS. Discovering Binding Cores in Protein-DNA Binding Using Association Rule Mining with Statistical Measures. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2015; 12:142-154. [PMID: 26357085 DOI: 10.1109/tcbb.2014.2343952] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Understanding binding cores is of fundamental importance in deciphering Protein-DNA (TF-TFBS) binding and for the deep understanding of gene regulation. Traditionally, binding cores are identified in resolved high-resolution 3D structures. However, it is expensive, labor-intensive and time-consuming to obtain these structures. Hence, it is promising to discover binding cores computationally on a large scale. Previous studies successfully applied association rule mining to discover binding cores from TF-TFBS binding sequence data only. Despite the successful results, there are limitations such as the use of tight support and confidence thresholds, the distortion by statistical bias in counting pattern occurrences, and the lack of a unified scheme to rank TF-TFBS associated patterns. In this study, we proposed an association rule mining algorithm incorporating statistical measures and ranking to address these limitations. Experimental results demonstrated that, even when the threshold on support was lowered to one-tenth of the value used in previous studies, a satisfactory verification ratio was consistently observed under different confidence levels. Moreover, we proposed a novel ranking scheme for TF-TFBS associated patterns based on p-values and co-support values. By comparing with other discovery approaches, the effectiveness of our algorithm was demonstrated. Eighty-four binding cores with PDB support are uniquely identified.
Collapse
|
21
|
Beadell AV, Haag ES. Evolutionary Dynamics of GLD-1-mRNA complexes in Caenorhabditis nematodes. Genome Biol Evol 2014; 7:314-35. [PMID: 25502909 PMCID: PMC4316625 DOI: 10.1093/gbe/evu272] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/04/2014] [Indexed: 12/17/2022] Open
Abstract
Given the large number of RNA-binding proteins and regulatory RNAs within genomes, posttranscriptional regulation may be an underappreciated aspect of cis-regulatory evolution. Here, we focus on nematode germ cells, which are known to rely heavily upon translational control to regulate meiosis and gametogenesis. GLD-1 belongs to the STAR-domain family of RNA-binding proteins, conserved throughout eukaryotes, and functions in Caenorhabditis elegans as a germline-specific translational repressor. A phylogenetic analysis across opisthokonts shows that GLD-1 is most closely related to Drosophila How and deuterostome Quaking, both implicated in alternative splicing. We identify messenger RNAs associated with C. briggsae GLD-1 on a genome-wide scale and provide evidence that many participate in aspects of germline development. By comparing our results with published C. elegans GLD-1 targets, we detect nearly 100 that are conserved between the two species. We also detected several hundred Cbr-GLD-1 targets whose homologs have not been reported to be associated with C. elegans GLD-1 in either of two independent studies. Low expression in C. elegans may explain the failure to detect most of them, but a highly expressed subset are strong candidates for Cbr-GLD-1-specific targets. We examine GLD-1-binding motifs among targets conserved in C. elegans and C. briggsae and find that most, but not all, display evidence of shared ancestral binding sites. Our work illustrates both the conservative and the dynamic character of evolution at the posttranslational level of gene regulation, even between congeners.
Collapse
Affiliation(s)
- Alana V Beadell
- Program in Behavior, Evolution, Ecology, and Systematics, University of Maryland, College Park Present address: Department of Organismal Biology and Anatomy, University of Chicago, Chicago, IL
| | - Eric S Haag
- Program in Behavior, Evolution, Ecology, and Systematics, University of Maryland, College Park Department of Biology, University of Maryland, College Park
| |
Collapse
|
22
|
Van de Velde J, Heyndrickx KS, Vandepoele K. Inference of transcriptional networks in Arabidopsis through conserved noncoding sequence analysis. THE PLANT CELL 2014; 26:2729-45. [PMID: 24989046 PMCID: PMC4145110 DOI: 10.1105/tpc.114.127001] [Citation(s) in RCA: 42] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/05/2023]
Abstract
Transcriptional regulation plays an important role in establishing gene expression profiles during development or in response to (a)biotic stimuli. Transcription factor binding sites (TFBSs) are the functional elements that determine transcriptional activity, and the identification of individual TFBS in genome sequences is a major goal to inferring regulatory networks. We have developed a phylogenetic footprinting approach for the identification of conserved noncoding sequences (CNSs) across 12 dicot plants. Whereas both alignment and non-alignment-based techniques were applied to identify functional motifs in a multispecies context, our method accounts for incomplete motif conservation as well as high sequence divergence between related species. We identified 69,361 footprints associated with 17,895 genes. Through the integration of known TFBS obtained from the literature and experimental studies, we used the CNSs to compile a gene regulatory network in Arabidopsis thaliana containing 40,758 interactions, of which two-thirds act through binding events located in DNase I hypersensitive sites. This network shows significant enrichment toward in vivo targets of known regulators, and its overall quality was confirmed using five different biological validation metrics. Finally, through the integration of detailed expression and function information, we demonstrate how static CNSs can be converted into condition-dependent regulatory networks, offering opportunities for regulatory gene annotation.
Collapse
Affiliation(s)
- Jan Van de Velde
- Department of Plant Systems Biology, VIB, B-9052 Ghent, Belgium Department of Plant Biotechnology and Bioinformatics, Ghent University, B-9052 Ghent, Belgium
| | - Ken S Heyndrickx
- Department of Plant Systems Biology, VIB, B-9052 Ghent, Belgium Department of Plant Biotechnology and Bioinformatics, Ghent University, B-9052 Ghent, Belgium
| | - Klaas Vandepoele
- Department of Plant Systems Biology, VIB, B-9052 Ghent, Belgium Department of Plant Biotechnology and Bioinformatics, Ghent University, B-9052 Ghent, Belgium
| |
Collapse
|
23
|
Affiliation(s)
- Fran Lewitter
- Bioinformatics and Research Computing, Whitehead Institute, Cambridge, Massachusetts, United States of America
- * E-mail:
| |
Collapse
|
24
|
Affiliation(s)
- Joanne A. Fox
- Michael Smith Laboratories, University of British Columbia, Vancouver, British Columbia, Canada
- Department of Microbiology & Immunology, University of British Columbia, Vancouver, British Columbia, Canada
- * E-mail: (JAF); (BFFO)
| | - B. F. Francis Ouellette
- Ontario Institute for Cancer Research, Toronto, Canada
- Department of Cell and Systems Biology, University of Toronto, Toronto, Canada
- * E-mail: (JAF); (BFFO)
| |
Collapse
|
25
|
Carvalho L. Bayesian centroid estimation for motif discovery. PLoS One 2013; 8:e80511. [PMID: 24324603 PMCID: PMC3855595 DOI: 10.1371/journal.pone.0080511] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2013] [Accepted: 10/03/2013] [Indexed: 11/29/2022] Open
Abstract
Biological sequences may contain patterns that signal important biomolecular functions; a classical example is regulation of gene expression by transcription factors that bind to specific patterns in genomic promoter regions. In motif discovery we are given a set of sequences that share a common motif and aim to identify not only the motif composition, but also the binding sites in each sequence of the set. We propose a new centroid estimator that arises from a refined and meaningful loss function for binding site inference. We discuss the main advantages of centroid estimation for motif discovery, including computational convenience, and how its principled derivation offers further insights about the posterior distribution of binding site configurations. We also illustrate, using simulated and real datasets, that the centroid estimator can differ from the traditional maximum a posteriori or maximum likelihood estimators.
Collapse
Affiliation(s)
- Luis Carvalho
- Department of Mathematics and Statistics, Boston University, Boston, Massachusetts, United States of America
| |
Collapse
|
26
|
Soares MPM, Barchuk AR, Simões ACQ, Dos Santos Cristino A, de Paula Freitas FC, Canhos LL, Bitondi MMG. Genes involved in thoracic exoskeleton formation during the pupal-to-adult molt in a social insect model, Apis mellifera. BMC Genomics 2013; 14:576. [PMID: 23981317 PMCID: PMC3766229 DOI: 10.1186/1471-2164-14-576] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2013] [Accepted: 08/23/2013] [Indexed: 12/04/2022] Open
Abstract
Background The insect exoskeleton provides shape, waterproofing, and locomotion via attached somatic muscles. The exoskeleton is renewed during molting, a process regulated by ecdysteroid hormones. The holometabolous pupa transforms into an adult during the imaginal molt, when the epidermis synthe3sizes the definitive exoskeleton that then differentiates progressively. An important issue in insect development concerns how the exoskeletal regions are constructed to provide their morphological, physiological and mechanical functions. We used whole-genome oligonucleotide microarrays to screen for genes involved in exoskeletal formation in the honeybee thoracic dorsum. Our analysis included three sampling times during the pupal-to-adult molt, i.e., before, during and after the ecdysteroid-induced apolysis that triggers synthesis of the adult exoskeleton. Results Gene ontology annotation based on orthologous relationships with Drosophila melanogaster genes placed the honeybee differentially expressed genes (DEGs) into distinct categories of Biological Process and Molecular Function, depending on developmental time, revealing the functional elements required for adult exoskeleton formation. Of the 1,253 unique DEGs, 547 were upregulated in the thoracic dorsum after apolysis, suggesting induction by the ecdysteroid pulse. The upregulated gene set included 20 of the 47 cuticular protein (CP) genes that were previously identified in the honeybee genome, and three novel putative CP genes that do not belong to a known CP family. In situ hybridization showed that two of the novel genes were abundantly expressed in the epidermis during adult exoskeleton formation, strongly implicating them as genuine CP genes. Conserved sequence motifs identified the CP genes as members of the CPR, Tweedle, Apidermin, CPF, CPLCP1 and Analogous-to-Peritrophins families. Furthermore, 28 of the 36 muscle-related DEGs were upregulated during the de novo formation of striated fibers attached to the exoskeleton. A search for cis-regulatory motifs in the 5′-untranslated region of the DEGs revealed potential binding sites for known transcription factors. Construction of a regulatory network showed that various upregulated CP- and muscle-related genes (15 and 21 genes, respectively) share common elements, suggesting co-regulation during thoracic exoskeleton formation. Conclusions These findings help reveal molecular aspects of rigid thoracic exoskeleton formation during the ecdysteroid-coordinated pupal-to-adult molt in the honeybee.
Collapse
Affiliation(s)
- Michelle Prioli Miranda Soares
- Departamento de Biologia, Faculdade de Filosofia, Ciências e Letras de Ribeirão Preto, Universidade de São Paulo, Ribeirão Preto, SP, Brasil.
| | | | | | | | | | | | | |
Collapse
|
27
|
A temperature-responsive network links cell shape and virulence traits in a primary fungal pathogen. PLoS Biol 2013; 11:e1001614. [PMID: 23935449 PMCID: PMC3720256 DOI: 10.1371/journal.pbio.1001614] [Citation(s) in RCA: 86] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2013] [Accepted: 06/12/2013] [Indexed: 11/19/2022] Open
Abstract
Analysis of a transcriptional regulatory network in a fungal pathogen reveals that four interdependent transcription factors respond to human body temperature to trigger changes in cell shape and virulence gene expression. Survival at host temperature is a critical trait for pathogenic microbes of humans. Thermally dimorphic fungal pathogens, including Histoplasma capsulatum, are soil fungi that undergo dramatic changes in cell shape and virulence gene expression in response to host temperature. How these organisms link changes in temperature to both morphologic development and expression of virulence traits is unknown. Here we elucidate a temperature-responsive transcriptional network in H. capsulatum, which switches from a filamentous form in the environment to a pathogenic yeast form at body temperature. The circuit is driven by three highly conserved factors, Ryp1, Ryp2, and Ryp3, that are required for yeast-phase growth at 37°C. Ryp factors belong to distinct families of proteins that control developmental transitions in fungi: Ryp1 is a member of the WOPR family of transcription factors, and Ryp2 and Ryp3 are both members of the Velvet family of proteins whose molecular function is unknown. Here we provide the first evidence that these WOPR and Velvet proteins interact, and that Velvet proteins associate with DNA to drive gene expression. Using genome-wide chromatin immunoprecipitation studies, we determine that Ryp1, Ryp2, and Ryp3 associate with a large common set of genomic loci that includes known virulence genes, indicating that the Ryp factors directly control genes required for pathogenicity in addition to their role in regulating cell morphology. We further dissect the Ryp regulatory circuit by determining that a fourth transcription factor, which we name Ryp4, is required for yeast-phase growth and gene expression, associates with DNA, and displays interdependent regulation with Ryp1, Ryp2, and Ryp3. Finally, we define cis-acting motifs that recruit the Ryp factors to their interwoven network of temperature-responsive target genes. Taken together, our results reveal a positive feedback circuit that directs a broad transcriptional switch between environmental and pathogenic states in response to temperature. Microbial pathogens of humans display the ability to thrive at host temperature. So-called “thermally dimorphic” fungal pathogens, which include Histoplasma capsulatum, are a class of soil fungi that upon being inhaled into the human lung, undergo dramatic changes in cell shape and virulence gene expression in response to host temperature. The ability of these pathogens to cause disease is exquisitely coupled to temperature response. Here we elucidate the regulatory network that governs the ability of H. capsulatum to switch from a filamentous form in the soil environment to a pathogenic yeast form at body temperature. The circuit is driven by three transcription regulators (Ryp1, Ryp2, and Ryp3) that control yeast-phase growth. We show that these factors, which include two highly conserved proteins of the Velvet family of unknown function, bind to specific regulatory DNA elements and directly regulate expression of virulence genes. We identify and characterize Ryp4, a fourth regulator of this pathway, and define DNA motifs that recruit these transcription factors to their temperature-responsive target genes. Our results provide a molecular understanding of how changes in cell shape are linked to expression of virulence genes in thermally dimorphic fungi.
Collapse
|
28
|
Wong KC, Chan TM, Peng C, Li Y, Zhang Z. DNA motif elucidation using belief propagation. Nucleic Acids Res 2013; 41:e153. [PMID: 23814189 PMCID: PMC3763557 DOI: 10.1093/nar/gkt574] [Citation(s) in RCA: 46] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open
Abstract
Protein-binding microarray (PBM) is a high-throughout platform that can measure the DNA-binding preference of a protein in a comprehensive and unbiased manner. A typical PBM experiment can measure binding signal intensities of a protein to all the possible DNA k-mers (k = 8 ∼10); such comprehensive binding affinity data usually need to be reduced and represented as motif models before they can be further analyzed and applied. Since proteins can often bind to DNA in multiple modes, one of the major challenges is to decompose the comprehensive affinity data into multimodal motif representations. Here, we describe a new algorithm that uses Hidden Markov Models (HMMs) and can derive precise and multimodal motifs using belief propagations. We describe an HMM-based approach using belief propagations (kmerHMM), which accepts and preprocesses PBM probe raw data into median-binding intensities of individual k-mers. The k-mers are ranked and aligned for training an HMM as the underlying motif representation. Multiple motifs are then extracted from the HMM using belief propagations. Comparisons of kmerHMM with other leading methods on several data sets demonstrated its effectiveness and uniqueness. Especially, it achieved the best performance on more than half of the data sets. In addition, the multiple binding modes derived by kmerHMM are biologically meaningful and will be useful in interpreting other genome-wide data such as those generated from ChIP-seq. The executables and source codes are available at the authors’ websites: e.g. http://www.cs.toronto.edu/∼wkc/kmerHMM.
Collapse
Affiliation(s)
- Ka-Chun Wong
- Department of Computer Science, University of Toronto, Toronto, Ontario, Canada, Terrence Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada, Department of Integrative Biology and Physiology, University of California Los Angeles, Los Angeles, CA, USA, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology, Thuwal, Jeddah, KSA, Banting and Best Department of Medical Research, University of Toronto, Toronto, Ontario, Canada and Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada
| | | | | | | | | |
Collapse
|
29
|
Chan TM, Lo LY, Sze-To HY, Leung KS, Xiao X, Wong MH. Modeling associated protein-DNA pattern discovery with unified scores. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2013; 10:696-707. [PMID: 24091402 DOI: 10.1109/tcbb.2013.60] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
Understanding protein-DNA interactions, specifically transcription factor (TF) and transcription factor binding site (TFBS) bindings, is crucial in deciphering gene regulation. The recent associated TF-TFBS pattern discovery combines one-sided motif discovery on both the TF and the TFBS sides. Using sequences only, it identifies the short protein-DNA binding cores available only in high-resolution 3D structures. The discovered patterns lead to promising subtype and disease analysis applications. While the related studies use either association rule mining or existing TFBS annotations, none has proposed any formal unified (both-sided) model to prioritize the top verifiable associated patterns. We propose the unified scores and develop an effective pipeline for associated TF-TFBS pattern discovery. Our stringent instance-level evaluations show that the patterns with the top unified scores match with the binding cores in 3D structures considerably better than the previous works, where up to 90 percent of the top 20 scored patterns are verified. We also introduce extended verification from literature surveys, where the high unified scores correspond to even higher verification percentage. The top scored patterns are confirmed to match the known WRKY binding cores with no available 3D structures and agree well with the top binding affinities of in vivo experiments.
Collapse
|
30
|
Elati M, Nicolle R, Junier I, Fernández D, Fekih R, Font J, Képès F. PreCisIon: PREdiction of CIS-regulatory elements improved by gene's positION. Nucleic Acids Res 2012; 41:1406-15. [PMID: 23241390 PMCID: PMC3561985 DOI: 10.1093/nar/gks1286] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023] Open
Abstract
Conventional approaches to predict transcriptional regulatory interactions usually rely on the definition of a shared motif sequence on the target genes of a transcription factor (TF). These efforts have been frustrated by the limited availability and accuracy of TF binding site motifs, usually represented as position-specific scoring matrices, which may match large numbers of sites and produce an unreliable list of target genes. To improve the prediction of binding sites, we propose to additionally use the unrelated knowledge of the genome layout. Indeed, it has been shown that co-regulated genes tend to be either neighbors or periodically spaced along the whole chromosome. This study demonstrates that respective gene positioning carries significant information. This novel type of information is combined with traditional sequence information by a machine learning algorithm called PreCisIon. To optimize this combination, PreCisIon builds a strong gene target classifier by adaptively combining weak classifiers based on either local binding sequence or global gene position. This strategy generically paves the way to the optimized incorporation of any future advances in gene target prediction based on local sequence, genome layout or on novel criteria. With the current state of the art, PreCisIon consistently improves methods based on sequence information only. This is shown by implementing a cross-validation analysis of the 20 major TFs from two phylogenetically remote model organisms. For Bacillus subtilis and Escherichia coli, respectively, PreCisIon achieves on average an area under the receiver operating characteristic curve of 70 and 60%, a sensitivity of 80 and 70% and a specificity of 60 and 56%. The newly predicted gene targets are demonstrated to be functionally consistent with previously known targets, as assessed by analysis of Gene Ontology enrichment or of the relevant literature and databases.
Collapse
Affiliation(s)
- Mohamed Elati
- Institute of Systems and Synthetic Biology, CNRS, University of Evry, Genopole, 91030 Evry, France.
| | | | | | | | | | | | | |
Collapse
|
31
|
Deciphering the transcriptional cis-regulatory code. Trends Genet 2012; 29:11-22. [PMID: 23102583 DOI: 10.1016/j.tig.2012.09.007] [Citation(s) in RCA: 85] [Impact Index Per Article: 7.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2012] [Revised: 09/24/2012] [Accepted: 09/25/2012] [Indexed: 02/07/2023]
Abstract
Information about developmental gene expression resides in defined regulatory elements, called enhancers, in the non-coding part of the genome. Although cells reliably utilize enhancers to orchestrate gene expression, a cis-regulatory code that would allow their interpretation has remained one of the greatest challenges of modern biology. In this review, we summarize studies from the past three decades that describe progress towards revealing the properties of enhancers and discuss how recent approaches are providing unprecedented insights into regulatory elements in animal genomes. Over the next years, we believe that the functional characterization of regulatory sequences in entire genomes, combined with recent computational methods, will provide a comprehensive view of genomic regulatory elements and their building blocks and will enable researchers to begin to understand the sequence basis of the cis-regulatory code.
Collapse
|
32
|
Katara P, Grover A, Sharma V. Phylogenetic footprinting: a boost for microbial regulatory genomics. PROTOPLASMA 2012; 249:901-907. [PMID: 22113593 DOI: 10.1007/s00709-011-0351-9] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/27/2011] [Accepted: 11/09/2011] [Indexed: 05/31/2023]
Abstract
Phylogenetic footprinting is a method for the discovery of regulatory elements in a set of homologous regulatory regions, usually collected from multiple species. It does so by identifying the best conserved motifs in those homologous regions. There are two popular sets of methods-alignment-based and motif-based, which are generally employed for phylogenetic methods. However, serious efforts have lacked to develop a tool exclusively for phylogenetic footprinting, based on either of these methods. Nevertheless, a number of software and tools exist that can be applied for prediction of phylogenetic footprinting with variable degree of success. The output from these tools may get affected by a number of factors associated with current state of knowledge, techniques and other resources available. We here present a critical apprehension of various phylogenetic approaches with reference to prokaryotes outlining the available resources and also discussing various factors affecting footprinting in order to make a clear idea about the proper use of this approach on prokaryotes.
Collapse
Affiliation(s)
- Pramod Katara
- Department of Bioscience and Biotechnology, Banasthali University, Banasthali, 304022, India.
| | | | | |
Collapse
|
33
|
Bi C. Memetic algorithms for de novo motif-finding in biomedical sequences. Artif Intell Med 2012; 56:1-17. [DOI: 10.1016/j.artmed.2012.04.002] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2011] [Revised: 04/03/2012] [Accepted: 04/10/2012] [Indexed: 11/26/2022]
|
34
|
Chan TM, Leung KS, Lee KH, Wong MH, Lau TCK, Tsui SKW. Subtypes of associated protein-DNA (Transcription Factor-Transcription Factor Binding Site) patterns. Nucleic Acids Res 2012; 40:9392-403. [PMID: 22904079 PMCID: PMC3479201 DOI: 10.1093/nar/gks749] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022] Open
Abstract
In protein–DNA interactions, particularly transcription factor (TF) and transcription factor binding site (TFBS) bindings, associated residue variations form patterns denoted as subtypes. Subtypes may lead to changed binding preferences, distinguish conserved from flexible binding residues and reveal novel binding mechanisms. However, subtypes must be studied in the context of core bindings. While solving 3D structures would require huge experimental efforts, recent sequence-based associated TF-TFBS pattern discovery has shown to be promising, upon which a large-scale subtype study is possible and desirable. In this article, we investigate residue-varying subtypes based on associated TF-TFBS patterns. By re-categorizing the patterns with respect to varying TF amino acids, statistically significant (P values ≤ 0.005) subtypes leading to varying TFBS patterns are discovered without using TF family or domain annotations. Resultant subtypes have various biological meanings. The subtypes reflect familial and functional properties and exhibit changed binding preferences supported by 3D structures. Conserved residues critical for maintaining TF-TFBS bindings are revealed by analyzing the subtypes. In-depth analysis on the subtype pair PKVVIL-CACGTG versus PKVEIL-CAGCTG shows the V/E variation is indicative for distinguishing Myc from MRF families. Discovered from sequences only, the TF-TFBS subtypes are informative and promising for more biological findings, complementing and extending recent one-sided subtype and familial studies with comprehensive evidence.
Collapse
Affiliation(s)
- Tak-Ming Chan
- Department of Computer Science & Engineering, The Chinese University of Hong Kong, Shatin, N T, Hong Kong.
| | | | | | | | | | | |
Collapse
|
35
|
Wang S, Yin Y, Ma Q, Tang X, Hao D, Xu Y. Genome-scale identification of cell-wall related genes in Arabidopsis based on co-expression network analysis. BMC PLANT BIOLOGY 2012; 12:138. [PMID: 22877077 PMCID: PMC3463447 DOI: 10.1186/1471-2229-12-138] [Citation(s) in RCA: 40] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/14/2012] [Accepted: 07/30/2012] [Indexed: 05/21/2023]
Abstract
BACKGROUND Identification of the novel genes relevant to plant cell-wall (PCW) synthesis represents a highly important and challenging problem. Although substantial efforts have been invested into studying this problem, the vast majority of the PCW related genes remain unknown. RESULTS Here we present a computational study focused on identification of the novel PCW genes in Arabidopsis based on the co-expression analyses of transcriptomic data collected under 351 conditions, using a bi-clustering technique. Our analysis identified 217 highly co-expressed gene clusters (modules) under some experimental conditions, each containing at least one gene annotated as PCW related according to the Purdue Cell Wall Gene Families database. These co-expression modules cover 349 known/annotated PCW genes and 2,438 new candidates. For each candidate gene, we annotated the specific PCW synthesis stages in which it is involved and predicted the detailed function. In addition, for the co-expressed genes in each module, we predicted and analyzed their cis regulatory motifs in the promoters using our motif discovery pipeline, providing strong evidence that the genes in each co-expression module are transcriptionally co-regulated. From the all co-expression modules, we infer that 108 modules are related to four major PCW synthesis components, using three complementary methods. CONCLUSIONS We believe our approach and data presented here will be useful for further identification and characterization of PCW genes. All the predicted PCW genes, co-expression modules, motifs and their annotations are available at a web-based database: http://csbl.bmb.uga.edu/publications/materials/shanwang/CWRPdb/index.html.
Collapse
Affiliation(s)
- Shan Wang
- Computational Systems Biology Laboratory, Department of Biochemistry and Molecular Biology, and Institute of Bioinformatics, Athens, GA, USA
- Key Lab for Molecular Enzymology and Engineering of the Ministry of Education, Jilin University, Changchun, China
- Biotechnology Research Centre, Jilin Academy of Agricultural Sciences (JAAS), Changchun, China
| | - Yanbin Yin
- Computational Systems Biology Laboratory, Department of Biochemistry and Molecular Biology, and Institute of Bioinformatics, Athens, GA, USA
- BESC BioEerngy Science Center, University of Georgia, Athens, GA, USA
| | - Qin Ma
- Computational Systems Biology Laboratory, Department of Biochemistry and Molecular Biology, and Institute of Bioinformatics, Athens, GA, USA
- BESC BioEerngy Science Center, University of Georgia, Athens, GA, USA
| | - Xiaojia Tang
- Computational Systems Biology Laboratory, Department of Biochemistry and Molecular Biology, and Institute of Bioinformatics, Athens, GA, USA
| | - Dongyun Hao
- Key Lab for Molecular Enzymology and Engineering of the Ministry of Education, Jilin University, Changchun, China
- Biotechnology Research Centre, Jilin Academy of Agricultural Sciences (JAAS), Changchun, China
| | - Ying Xu
- Computational Systems Biology Laboratory, Department of Biochemistry and Molecular Biology, and Institute of Bioinformatics, Athens, GA, USA
- BESC BioEerngy Science Center, University of Georgia, Athens, GA, USA
- College of Computer Science and Technology, Jilin University, Changchun, China
| |
Collapse
|
36
|
Guo Y, Mahony S, Gifford DK. High resolution genome wide binding event finding and motif discovery reveals transcription factor spatial binding constraints. PLoS Comput Biol 2012; 8:e1002638. [PMID: 22912568 PMCID: PMC3415389 DOI: 10.1371/journal.pcbi.1002638] [Citation(s) in RCA: 188] [Impact Index Per Article: 15.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2012] [Accepted: 06/15/2012] [Indexed: 12/27/2022] Open
Abstract
An essential component of genome function is the syntax of genomic regulatory elements that determine how diverse transcription factors interact to orchestrate a program of regulatory control. A precise characterization of in vivo spacing constraints between key transcription factors would reveal key aspects of this genomic regulatory language. To discover novel transcription factor spatial binding constraints in vivo, we developed a new integrative computational method, genome wide event finding and motif discovery (GEM). GEM resolves ChIP data into explanatory motifs and binding events at high spatial resolution by linking binding event discovery and motif discovery with positional priors in the context of a generative probabilistic model of ChIP data and genome sequence. GEM analysis of 63 transcription factors in 214 ENCODE human ChIP-Seq experiments recovers more known factor motifs than other contemporary methods, and discovers six new motifs for factors with unknown binding specificity. GEM's adaptive learning of binding-event read distributions allows it to further improve upon previous methods for processing ChIP-Seq and ChIP-exo data to yield unsurpassed spatial resolution and discovery of closely spaced binding events of the same factor. In a systematic analysis of in vivo sequence-specific transcription factor binding using GEM, we have found hundreds of spatial binding constraints between factors. GEM found 37 examples of factor binding constraints in mouse ES cells, including strong distance-specific constraints between Klf4 and other key regulatory factors. In human ENCODE data, GEM found 390 examples of spatially constrained pair-wise binding, including such novel pairs as c-Fos:c-Jun/USF1, CTCF/Egr1, and HNF4A/FOXA1. The discovery of new factor-factor spatial constraints in ChIP data is significant because it proposes testable models for regulatory factor interactions that will help elucidate genome function and the implementation of combinatorial control. The letters in our genome spell words and phrases that control when each gene is activated. To understand how these words and phrases function in health and disease, we have developed a new computational method to determine what word positions in our genomic text are used by each genome regulatory protein, and how these active words are spaced relative to one another. Our method achieves exceptional spatial accuracy by integrating experimental data with the text of our genome to find the precise words that are regulated by each protein factor. Using this analysis we have discovered novel word spacings in the experimental data that suggest novel genome grammatical control constructs.
Collapse
Affiliation(s)
- Yuchun Guo
- Computational and Systems Biology Program, Massachusetts Institute of Technology, Cambridge, Massachusetts, United States of America
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, United States of America
| | - Shaun Mahony
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, United States of America
- * E-mail: (SM); (DKG)
| | - David K. Gifford
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, United States of America
- * E-mail: (SM); (DKG)
| |
Collapse
|
37
|
Mahdevar G, Sadeghi M, Nowzari-Dalini A. Transcription factor binding sites detection by using alignment-based approach. J Theor Biol 2012; 304:96-102. [PMID: 22504445 DOI: 10.1016/j.jtbi.2012.03.039] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2011] [Revised: 03/27/2012] [Accepted: 03/29/2012] [Indexed: 11/25/2022]
Abstract
Gene expression is the main cause for the existence of various phenotypes. Through this procedure, the information stored in DNA rises to the phenotype. Essentially, gene expression is dependent upon the successful binding of transcription factors (TFs) - a specific type of proteins - to explicit positions in its upstream, TF binding sites (TFBSs). Unfortunately, finding these TFBSs is costly and laborious; therefore, discovering TFBSs computationally is a significant problem that many researches endeavor to solve. In this paper, a new TFBS discovery method is presented by considering known biological facts about TFBSs. The input to this method includes sequences with arbitrary lengths and the output comprises positions that tend to be TFBS. Through the application of previous methods along with a method that focuses on biological and simulated datasets, it is shown that this method achieves higher accuracy in discovering TFBSs.
Collapse
Affiliation(s)
- Ghasem Mahdevar
- Department of Bioinformatics, Institute of Biochemistry and Biophysics, University of Tehran, Tehran, Iran.
| | | | | |
Collapse
|
38
|
Pirino D, Rigosa J, Ledda A, Ferretti L. Detecting correlations among functional-sequence motifs. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2012; 85:066124. [PMID: 23005179 DOI: 10.1103/physreve.85.066124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/02/2012] [Indexed: 06/01/2023]
Abstract
Sequence motifs are words of nucleotides in DNA with biological functions, e.g., gene regulation. Identification of such words proceeds through rejection of Markov models on the expected motif frequency along the genome. Additional biological information can be extracted from the correlation structure among patterns of motif occurrences. In this paper a log-linear multivariate intensity Poisson model is estimated via expectation maximization on a set of motifs along the genome of E. coli K12. The proposed approach allows for excitatory as well as inhibitory interactions among motifs and between motifs and other genomic features like gene occurrences. Our findings confirm previous stylized facts about such types of interactions and shed new light on genome-maintenance functions of some particular motifs. We expect these methods to be applicable to a wider set of genomic features.
Collapse
|
39
|
Swimming upstream: identifying proteomic signals that drive transcriptional changes using the interactome and multiple "-omics" datasets. Methods Cell Biol 2012; 110:57-80. [PMID: 22482945 PMCID: PMC3870464 DOI: 10.1016/b978-0-12-388403-9.00003-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023]
Abstract
Signaling and transcription are tightly integrated processes that underlie many cellular responses to the environment. A network of signaling events, often mediated by post-translational modification on proteins, can lead to long-term changes in cellular behavior by altering the activity of specific transcriptional regulators and consequently the expression level of their downstream targets. As many high-throughput, "-omics" methods are now available that can simultaneously measure changes in hundreds of proteins and thousands of transcripts, it should be possible to systematically reconstruct cellular responses to perturbations in order to discover previously unrecognized signaling pathways. This chapter describes a computational method for discovering such pathways that aims to compensate for the varying levels of noise present in these diverse data sources. Based on the concept of constraint optimization on networks, the method seeks to achieve two conflicting aims: (1) to link together many of the signaling proteins and differentially expressed transcripts identified in the experiments "constraints" using previously reported protein-protein and protein-DNA interactions, while (2) keeping the resulting network small and ensuring it is composed of the highest confidence interactions "optimization". A further distinctive feature of this approach is the use of transcriptional data as evidence of upstream signaling events that drive changes in gene expression, rather than as proxies for downstream changes in the levels of the encoded proteins. We recently demonstrated that by applying this method to phosphoproteomic and transcriptional data from the pheromone response in yeast, we were able to recover functionally coherent pathways and to reveal many components of the cellular response that are not readily apparent in the original data. Here, we provide a more detailed description of the method, explore the robustness of the solution to the noise level of input data and discuss the effect of parameter values.
Collapse
|
40
|
Harris EY, Ponts N, Le Roch KG, Lonardi S. Chromatin-driven de novo discovery of DNA binding motifs in the human malaria parasite. BMC Genomics 2011; 12:601. [PMID: 22165844 PMCID: PMC3282892 DOI: 10.1186/1471-2164-12-601] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2011] [Accepted: 12/13/2011] [Indexed: 11/10/2022] Open
Abstract
Background Despite extensive efforts to discover transcription factors and their binding sites in the human malaria parasite Plasmodium falciparum, only a few transcription factor binding motifs have been experimentally validated to date. As a consequence, gene regulation in P. falciparum is still poorly understood. There is now evidence that the chromatin architecture plays an important role in transcriptional control in malaria. Results We propose a methodology for discovering cis-regulatory elements that uses for the first time exclusively dynamic chromatin remodeling data. Our method employs nucleosome positioning data collected at seven time points during the erythrocytic cycle of P. falciparum to discover putative DNA binding motifs and their transcription factor binding sites along with their associated clusters of target genes. Our approach results in 129 putative binding motifs within the promoter region of known genes. About 75% of those are novel, the remaining being highly similar to experimentally validated binding motifs. About half of the binding motifs reported show statistically significant enrichment in functional gene sets and strong positional bias in the promoter region. Conclusion Experimental results establish the principle that dynamic chromatin remodeling data can be used in lieu of gene expression data to discover binding motifs and their transcription factor binding sites. Our approach can be applied using only dynamic nucleosome positioning data, independent from any knowledge of gene function or expression.
Collapse
Affiliation(s)
- Elena Y Harris
- Department of Cell Biology and Neuroscience, University of California, Riverside, CA 92521, USA
| | | | | | | |
Collapse
|
41
|
Bi C. SEAM: A STOCHASTIC EM-TYPE ALGORITHM FOR MOTIF-FINDING IN BIOPOLYMER SEQUENCES. J Bioinform Comput Biol 2011; 5:47-77. [PMID: 17477491 DOI: 10.1142/s0219720007002527] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2006] [Revised: 08/22/2006] [Accepted: 10/14/2006] [Indexed: 12/21/2022]
Abstract
Position weight matrix-based statistical modeling for the identification and characterization of motif sites in a set of unaligned biopolymer sequences is presented. This paper describes and implements a new algorithm, the Stochastic EM-type Algorithm for Motif-finding (SEAM), and redesigns and implements the EM-based motif-finding algorithm called deterministic EM (DEM) for comparison with SEAM, its stochastic counterpart. The gold standard example, cyclic adenosine monophosphate receptor protein (CRP) binding sequences, together with other biological sequences, is used to illustrate the performance of the new algorithm and compare it with other popular motif-finding programs. The convergence of the new algorithm is shown by simulation. The in silico experiments using simulated and biological examples illustrate the power and robustness of the new algorithm SEAM in de novo motif discovery.
Collapse
Affiliation(s)
- Chengpeng Bi
- Children's Mercy Hospitals and Clinics, 2401 Gillham Road, Pediatrics Research Building, Third Floor, Kansas City, Missouri 64108, USA.
| |
Collapse
|
42
|
Abstract
MEME and many other popular motif finders use the expectation-maximization (EM) algorithm to optimize their parameters. Unfortunately, the running time of EM is linear in the length of the input sequences. This can prohibit its application to data sets of the size commonly generated by high-throughput biological techniques. A suffix tree is a data structure that can efficiently index a set of sequences. We describe an algorithm, Suffix Tree EM for Motif Elicitation (STEME), that approximates EM using suffix trees. To the best of our knowledge, this is the first application of suffix trees to EM. We provide an analysis of the expected running time of the algorithm and demonstrate that STEME runs an order of magnitude more quickly than the implementation of EM used by MEME. We give theoretical bounds for the quality of the approximation and show that, in practice, the approximation has a negligible effect on the outcome. We provide an open source implementation of the algorithm that we hope will be used to speed up existing and future motif search algorithms.
Collapse
Affiliation(s)
- John E Reid
- MRC Biostatistics Unit, Institute of Public Health, Forvie Site, Robinson Way, Cambridge CB2 0SR, UK.
| | | |
Collapse
|
43
|
Zhang S, Li S, Niu M, Pham PT, Su Z. MotifClick: prediction of cis-regulatory binding sites via merging cliques. BMC Bioinformatics 2011; 12:238. [PMID: 21679436 PMCID: PMC3225181 DOI: 10.1186/1471-2105-12-238] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2010] [Accepted: 06/16/2011] [Indexed: 11/21/2022] Open
Abstract
Background Although dozens of algorithms and tools have been developed to find a set of cis-regulatory binding sites called a motif in a set of intergenic sequences using various approaches, most of these tools focus on identifying binding sites that are significantly different from their background sequences. However, some motifs may have a similar nucleotide distribution to that of their background sequences. Therefore, such binding sites can be missed by these tools. Results Here, we present a graph-based polynomial-time algorithm, MotifClick, for the prediction of cis-regulatory binding sites, in particular, those that have a similar nucleotide distribution to that of their background sequences. To find binding sites with length k, we construct a graph using some 2(k-1)-mers in the input sequences as the vertices, and connect two vertices by an edge if the maximum number of matches of the local gapless alignments between the two 2(k-1)-mers is greater than a cutoff value. We identify a motif as a set of similar k-mers from a merged group of maximum cliques associated with some vertices. Conclusions When evaluated on both synthetic and real datasets of prokaryotes and eukaryotes, MotifClick outperforms existing leading motif-finding tools for prediction accuracy and balancing the prediction sensitivity and specificity in general. In particular, when the distribution of nucleotides of binding sites is similar to that of their background sequences, MotifClick is more likely to identify the binding sites than the other tools.
Collapse
Affiliation(s)
- Shaoqiang Zhang
- Department of Bioinformatics and Genomics, Center for Bioinformatics Research, the University of North Carolina at Charlotte, 28223, USA
| | | | | | | | | |
Collapse
|
44
|
Bais AS, Kaminski N, Benos PV. Finding subtypes of transcription factor motif pairs with distinct regulatory roles. Nucleic Acids Res 2011; 39:e76. [PMID: 21486752 PMCID: PMC3113591 DOI: 10.1093/nar/gkr205] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/27/2023] Open
Abstract
DNA sequences bound by a transcription factor (TF) are presumed to contain sequence elements that reflect its DNA binding preferences and its downstream-regulatory effects. Experimentally identified TF binding sites (TFBSs) are usually similar enough to be summarized by a ‘consensus’ motif, representative of the TF DNA binding specificity. Studies have shown that groups of nucleotide TFBS variants (subtypes) can contribute to distinct modes of downstream regulation by the TF via differential recruitment of cofactors. A TFA may bind to TFBS subtypes a1 or a2 depending on whether it associates with cofactors TFB or TFC, respectively. While some approaches can discover motif pairs (dyads), none address the problem of identifying ‘variants’ of dyads. TFs are key components of multiple regulatory pathways targeting different sets of genes perhaps with different binding preferences. Identifying the discriminating TF–DNA associations that lead to the differential downstream regulation is thus essential. We present DiSCo (Discovery of Subtypes and Cofactors), a novel approach for identifying variants of dyad motifs (and their respective target sequence sets) that are instrumental for differential downstream regulation. Using both simulated and experimental datasets, we demonstrate how current motif discovery can be successfully leveraged to address this question.
Collapse
Affiliation(s)
- Abha Singh Bais
- Department of Computational and Systems Biology, Dorothy P. and Richard P. Simmons Center for Interstitial Lung Disease, Division of Pulmonary, Allergy and Critical Care Medicine and Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA 15260, USA
| | | | | |
Collapse
|
45
|
Wong KC, Peng C, Wong MH, Leung KS. Generalizing and learning protein-DNA binding sequence representations by an evolutionary algorithm. Soft comput 2011. [DOI: 10.1007/s00500-011-0692-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
46
|
Sánchez-Cabo F, Rainer J, Dopazo A, Trajanoski Z, Hackl H. Insights into global mechanisms and disease by gene expression profiling. Methods Mol Biol 2011; 719:269-98. [PMID: 21370089 DOI: 10.1007/978-1-61779-027-0_13] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Abstract
Transcriptomics has played an essential role as proof of concept in the development of experimental and bioinformatics approaches for the generation and analysis of Omics data. We are giving an introduction on how large-scale technologies for gene expression profiling, especially microarrays, have changed the view from studying single molecular events to a systems level view of global mechanisms in a cell, the biological processes, and their pathological mutations. The main platforms available for gene expression profiling (from microarrays to RNA-seq) are presented and the general concepts that need to be taken into account for proper data analysis in order to extract objective and general conclusions from transcriptomics experiments are introduced. We also describe the available main bioinformatics resources used for this purpose.
Collapse
Affiliation(s)
- Fátima Sánchez-Cabo
- Genomics Unit, Centro Nacional de Investigaciones Cardiovasculares, Madrid, Spain
| | | | | | | | | |
Collapse
|
47
|
Oshchepkov DY, Levitsky VG. In silico prediction of transcriptional factor-binding sites. Methods Mol Biol 2011; 760:251-67. [PMID: 21780002 DOI: 10.1007/978-1-61779-176-5_16] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023]
Abstract
The recognition of transcription factor binding sites (TFBSs) is the first step on the way to deciphering the DNA regulatory code. A large variety of computational approaches and corresponding in silico tools for TFBS recognition are available, each having their own advantages and shortcomings. This chapter provides a brief tutorial to assist end users in the application of these tools for functional characterization of genes.
Collapse
Affiliation(s)
- Dmitry Y Oshchepkov
- Laboratory of Theoretical Genetics, Institute of Cytology and Genetics, Siberian Branch of the Russian Academy of Sciences, Novosibirsk, Russia.
| | | |
Collapse
|
48
|
Chan TM, Wong KC, Lee KH, Wong MH, Lau CK, Tsui SKW, Leung KS. Discovering approximate-associated sequence patterns for protein-DNA interactions. ACTA ACUST UNITED AC 2010; 27:471-8. [PMID: 21193520 DOI: 10.1093/bioinformatics/btq682] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023]
Abstract
MOTIVATION The bindings between transcription factors (TFs) and transcription factor binding sites (TFBSs) are fundamental protein-DNA interactions in transcriptional regulation. Extensive efforts have been made to better understand the protein-DNA interactions. Recent mining on exact TF-TFBS-associated sequence patterns (rules) has shown great potentials and achieved very promising results. However, exact rules cannot handle variations in real data, resulting in limited informative rules. In this article, we generalize the exact rules to approximate ones for both TFs and TFBSs, which are essential for biological variations. RESULTS A progressive approach is proposed to address the approximation to alleviate the computational requirements. Firstly, similar TFBSs are grouped from the available TF-TFBS data (TRANSFAC database). Secondly, approximate and highly conserved binding cores are discovered from TF sequences corresponding to each TFBS group. A customized algorithm is developed for the specific objective. We discover the approximate TF-TFBS rules by associating the grouped TFBS consensuses and TF cores. The rules discovered are evaluated by matching (verifying with) the actual protein-DNA binding pairs from Protein Data Bank (PDB) 3D structures. The approximate results exhibit many more verified rules and up to 300% better verification ratios than the exact ones. The customized algorithm achieves over 73% better verification ratios than traditional methods. Approximate rules (64-79%) are shown statistically significant. Detailed variation analysis and conservation verification on NCBI records demonstrate that the approximate rules reveal both the flexible and specific protein-DNA interactions accurately. The approximate TF-TFBS rules discovered show great generalized capability of exploring more informative binding rules.
Collapse
Affiliation(s)
- Tak-Ming Chan
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, N. T., Hong Kong
| | | | | | | | | | | | | |
Collapse
|
49
|
When needles look like hay: how to find tissue-specific enhancers in model organism genomes. Dev Biol 2010; 350:239-54. [PMID: 21130761 DOI: 10.1016/j.ydbio.2010.11.026] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2010] [Revised: 11/11/2010] [Accepted: 11/22/2010] [Indexed: 01/22/2023]
Abstract
A major prerequisite for the investigation of tissue-specific processes is the identification of cis-regulatory elements. No generally applicable technique is available to distinguish them from any other type of genomic non-coding sequence. Therefore, researchers often have to identify these elements by elaborate in vivo screens, testing individual regions until the right one is found. Here, based on many examples from the literature, we summarize how functional enhancers have been isolated from other elements in the genome and how they have been characterized in transgenic animals. Covering computational and experimental studies, we provide an overview of the global properties of cis-regulatory elements, like their specific interactions with promoters and target gene distances. We describe conserved non-coding elements (CNEs) and their internal structure, nucleotide composition, binding site clustering and overlap, with a special focus on developmental enhancers. Conflicting data and unresolved questions on the nature of these elements are highlighted. Our comprehensive overview of the experimental shortcuts that have been found in the different model organism communities and the new field of high-throughput assays should help during the preparation phase of a screen for enhancers. The review is accompanied by a list of general guidelines for such a project.
Collapse
|
50
|
Jaimovich A, Friedman N. From large-scale assays to mechanistic insights: computational analysis of interactions. Curr Opin Biotechnol 2010; 22:87-93. [PMID: 21109421 DOI: 10.1016/j.copbio.2010.10.017] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2010] [Accepted: 10/27/2010] [Indexed: 01/17/2023]
Abstract
The activity in the living cell is carried out by a myriad network of interactions between macromolecules. These include interactions between proteins that form a functional complex, a protein modifying another protein in a transient interaction, a transcription factor that binds a specific DNA locus triggering a change in chromatin or transcription, and so on. Characterization of these interactions in terms of timing, context, and function is crucial for understanding how cells carry out basic biological processes. The recent years have led to the introduction of many assays for probing these interactions in a systematic and large-scale manner. However, there is a large gap between assay results and understanding of biological systems. The challenge for computational methods is to bridge this gap by combining results of different assays and introducing statistical methodologies. In this review we discuss recent advances in approaches dealing with these challenges, and key directions for the future.
Collapse
Affiliation(s)
- Ariel Jaimovich
- School of Computer Science & Engineering, Hebrew University of Jerusalem, Jerusalem, Israel
| | | |
Collapse
|