1
|
Gill JK, Chetty M, Lim S, Hallinan J. Large language model based framework for automated extraction of genetic interactions from unstructured data. PLoS One 2024; 19:e0303231. [PMID: 38771886 PMCID: PMC11108146 DOI: 10.1371/journal.pone.0303231] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2023] [Accepted: 04/23/2024] [Indexed: 05/23/2024] Open
Abstract
Extracting biological interactions from published literature helps us understand complex biological systems, accelerate research, and support decision-making in drug or treatment development. Despite efforts to automate the extraction of biological relations using text mining tools and machine learning pipelines, manual curation continues to serve as the gold standard. However, the rapidly increasing volume of literature pertaining to biological relations poses challenges in its manual curation and refinement. These challenges are further compounded because only a small fraction of the published literature is relevant to biological relation extraction, and the embedded sentences of relevant sections have complex structures, which can lead to incorrect inference of relationships. To overcome these challenges, we propose GIX, an automated and robust Gene Interaction Extraction framework, based on pre-trained Large Language models fine-tuned through extensive evaluations on various gene/protein interaction corpora including LLL and RegulonDB. GIX identifies relevant publications with minimal keywords, optimises sentence selection to reduce computational overhead, simplifies sentence structure while preserving meaning, and provides a confidence factor indicating the reliability of extracted relations. GIX's Stage-2 relation extraction method performed well on benchmark protein/gene interaction datasets, assessed using 10-fold cross-validation, surpassing state-of-the-art approaches. We demonstrated that the proposed method, although fully automated, performs as well as manual relation extraction, with enhanced robustness. We also observed GIX's capability to augment existing datasets with new sentences, incorporating newly discovered biological terms and processes. Further, we demonstrated GIX's real-world applicability in inferring E. coli gene circuits.
Collapse
Affiliation(s)
- Jaskaran Kaur Gill
- Health Innovation and Transformation Centre, Federation University, Ballarat, Victoria, Australia
| | - Madhu Chetty
- Health Innovation and Transformation Centre, Federation University, Ballarat, Victoria, Australia
| | - Suryani Lim
- Health Innovation and Transformation Centre, Federation University, Ballarat, Victoria, Australia
| | - Jennifer Hallinan
- Health Innovation and Transformation Centre, Federation University, Ballarat, Victoria, Australia
- BioThink, Brisbane, Queensland, Australia
| |
Collapse
|
2
|
Su L, Chen S, Zheng C, Wei H, Song X. Meta-Analysis of Gene Expression and Identification of Biological Regulatory Mechanisms in Alzheimer's Disease. Front Neurosci 2019; 13:633. [PMID: 31333395 PMCID: PMC6616202 DOI: 10.3389/fnins.2019.00633] [Citation(s) in RCA: 36] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2019] [Accepted: 05/31/2019] [Indexed: 12/12/2022] Open
Abstract
Alzheimer's disease (AD), also known as senile dementia, is a progressive neurodegenerative disease. The etiology and pathogenesis of AD have not yet been elucidated. We examined common differentially expressed genes (DEGs) from different AD tissue microarray datasets by meta-analysis and screened the AD-associated genes from the common DEGs using GCBI. Then we studied the gene expression network using the STRING database and identified the hub genes using Cytoscape. Furthermore, we analyzed the microRNAs (miRNAs), long non-coding RNAs (lncRNAs), and single nucleotide polymorphisms (SNPs) associated with the AD-associated genes, and then identified feed-forward loops. Finally, we performed SNP analysis of the AD-associated genes. Our results identified 207 common DEGs, of which 57 have previously been reported to be associated with AD. The common DEG expression network identified eight hub genes, all of which were previously known to be associated with AD. Further study of the regulatory miRNAs associated with the AD-associated genes and other genes specific to neurodegenerative diseases revealed 65 AD-associated miRNAs. Analysis of the miRNA associated transcription factor-miRNA-gene-gene associated TF (mTF-miRNA-gene-gTF) network around the AD-associated genes revealed 131 feed-forward loops (FFLs). Among them, one important FFL was found between the gene SERPINA3, hsa-miR-27a, and the transcription factor MYC. Furthermore, SNP analysis of the AD-associated genes identified 173 SNPs, and also found a role in AD for miRNAs specific to other neurodegenerative diseases, including hsa-miR-34c, hsa-miR-212, hsa-miR-34a, and hsa-miR-7. The regulatory network constructed in this study describes the mechanism of cell regulation in AD, in which miRNAs and lncRNAs can be considered AD regulatory factors.
Collapse
Affiliation(s)
- Lining Su
- Department of Basic Medicine, Hebei North University, Zhangjiakou, China
| | - Sufen Chen
- Institute of Educational Science, Zhangjiakou, China
| | | | - Huiping Wei
- Department of Basic Medicine, Hebei North University, Zhangjiakou, China
| | - Xiaoqing Song
- Department of Basic Medicine, Hebei North University, Zhangjiakou, China
| |
Collapse
|
3
|
Hu J, Wang J, Lin J, Liu T, Zhong Y, Liu J, Zheng Y, Gao Y, He J, Shang X. MD-SVM: a novel SVM-based algorithm for the motif discovery of transcription factor binding sites. BMC Bioinformatics 2019; 20:200. [PMID: 31074373 PMCID: PMC6509868 DOI: 10.1186/s12859-019-2735-3] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022] Open
Abstract
BACKGROUND Transcription factors (TFs) play important roles in the regulation of gene expression. They can activate or block transcription of downstream genes in a manner of binding to specific genomic sequences. Therefore, motif discovery of these binding preference patterns is of central significance in the understanding of molecular regulation mechanism. Many algorithms have been proposed for the identification of transcription factor binding sites. However, it remains a challengeable problem. RESULTS Here, we proposed a novel motif discovery algorithm based on support vector machine (MD-SVM) to learn a discriminative model for TF binding sites. MD-SVM firstly obtains position weight matrix (PWM) from a set of training datasets. Then it translates the MD problem into a computational framework of multiple instance learning (MIL). It was applied to several real biological datasets. Results show that our algorithm outperforms MI-SVM in terms of both accuracy and specificity. CONCLUSIONS In this paper, we modeled the TF motif discovery problem as a MIL optimization problem. The SVM algorithm was adapted to discriminate positive and negative bags of instances. Compared to other svm-based algorithms, MD-SVM show its superiority over its competitors in term of ROC AUC. Hopefully, it could be of benefit to the research community in the understanding of molecular functions of DNA functional elements and transcription factors.
Collapse
Affiliation(s)
- Jialu Hu
- School of Computer Science, Northwestern Polytechnical University, West Youyi Road 127, Xi’an, 710072 China
- Centre of Multidisciplinary Convergence Computing, School of Computer Science, Northwestern Polytechnical University, 1 Dong Xiang Road, Xi’an, 710129 China
| | - Jingru Wang
- School of Computer Science, Northwestern Polytechnical University, West Youyi Road 127, Xi’an, 710072 China
| | - Jianan Lin
- School of Computer Science, Northwestern Polytechnical University, West Youyi Road 127, Xi’an, 710072 China
| | - Tianwei Liu
- School of Computer Science, Northwestern Polytechnical University, West Youyi Road 127, Xi’an, 710072 China
| | - Yuanke Zhong
- School of Computer Science, Northwestern Polytechnical University, West Youyi Road 127, Xi’an, 710072 China
| | - Jie Liu
- School of Computer Science, Northwestern Polytechnical University, West Youyi Road 127, Xi’an, 710072 China
| | - Yan Zheng
- School of Computer Science, Northwestern Polytechnical University, West Youyi Road 127, Xi’an, 710072 China
| | - Yiqun Gao
- School of Computer Science, Northwestern Polytechnical University, West Youyi Road 127, Xi’an, 710072 China
| | - Junhao He
- School of Computer Science, Northwestern Polytechnical University, West Youyi Road 127, Xi’an, 710072 China
| | - Xuequn Shang
- School of Computer Science, Northwestern Polytechnical University, West Youyi Road 127, Xi’an, 710072 China
| |
Collapse
|
4
|
Galley JC, Durgin BG, Miller MP, Hahn SA, Yuan S, Wood KC, Straub AC. Antagonism of Forkhead Box Subclass O Transcription Factors Elicits Loss of Soluble Guanylyl Cyclase Expression. Mol Pharmacol 2019; 95:629-637. [PMID: 30988014 PMCID: PMC6527398 DOI: 10.1124/mol.118.115386] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2018] [Accepted: 03/31/2019] [Indexed: 01/12/2023] Open
Abstract
Nitric oxide (NO) stimulates soluble guanylyl cyclase (sGC) activity, leading to elevated intracellular cyclic guanosine 3',5'-monophosphate (cGMP) and subsequent vascular smooth muscle relaxation. It is known that downregulation of sGC expression attenuates vascular dilation and contributes to the pathogenesis of cardiovascular disease. However, it is not well understood how sGC transcription is regulated. Here, we demonstrate that pharmacological inhibition of Forkhead box subclass O (FoxO) transcription factors using the small-molecule inhibitor AS1842856 significantly blunts sGC α and β mRNA expression by more than 90%. These effects are concentration-dependent and concomitant with greater than 90% reduced expression of the known FoxO transcriptional targets, glucose-6-phosphatase and growth arrest and DNA damage protein 45 α (Gadd45α). Similarly, sGC α and sGC β protein expression showed a concentration-dependent downregulation. Consistent with the loss of sGC α and β mRNA and protein expression, pretreatment of vascular smooth muscle cells with the FoxO inhibitor decreased sGC activity measured by cGMP production following stimulation with an NO donor. To determine if FoxO inhibition resulted in a functional impairment in vascular relaxation, we cultured mouse thoracic aortas with the FoxO inhibitor and conducted ex vivo two-pin myography studies. Results showed that aortas have significantly blunted sodium nitroprusside-induced (NO-dependent) vasorelaxation and a 42% decrease in sGC expression after 48-hour FoxO inhibitor treatment. Taken together, these data are the first to identify that FoxO transcription factor activity is necessary for sGC expression and NO-dependent relaxation.
Collapse
Affiliation(s)
- Joseph C Galley
- Heart, Lung, Blood and Vascular Medicine Institute (J.C.G., B.G.D., M.P.M., S.A.H., S.Y., K.C.W., A.C.S.) and Department of Pharmacology and Chemical Biology (J.C.G., A.C.S.), University of Pittsburgh, Pittsburgh, Pennsylvania
| | - Brittany G Durgin
- Heart, Lung, Blood and Vascular Medicine Institute (J.C.G., B.G.D., M.P.M., S.A.H., S.Y., K.C.W., A.C.S.) and Department of Pharmacology and Chemical Biology (J.C.G., A.C.S.), University of Pittsburgh, Pittsburgh, Pennsylvania
| | - Megan P Miller
- Heart, Lung, Blood and Vascular Medicine Institute (J.C.G., B.G.D., M.P.M., S.A.H., S.Y., K.C.W., A.C.S.) and Department of Pharmacology and Chemical Biology (J.C.G., A.C.S.), University of Pittsburgh, Pittsburgh, Pennsylvania
| | - Scott A Hahn
- Heart, Lung, Blood and Vascular Medicine Institute (J.C.G., B.G.D., M.P.M., S.A.H., S.Y., K.C.W., A.C.S.) and Department of Pharmacology and Chemical Biology (J.C.G., A.C.S.), University of Pittsburgh, Pittsburgh, Pennsylvania
| | - Shuai Yuan
- Heart, Lung, Blood and Vascular Medicine Institute (J.C.G., B.G.D., M.P.M., S.A.H., S.Y., K.C.W., A.C.S.) and Department of Pharmacology and Chemical Biology (J.C.G., A.C.S.), University of Pittsburgh, Pittsburgh, Pennsylvania
| | - Katherine C Wood
- Heart, Lung, Blood and Vascular Medicine Institute (J.C.G., B.G.D., M.P.M., S.A.H., S.Y., K.C.W., A.C.S.) and Department of Pharmacology and Chemical Biology (J.C.G., A.C.S.), University of Pittsburgh, Pittsburgh, Pennsylvania
| | - Adam C Straub
- Heart, Lung, Blood and Vascular Medicine Institute (J.C.G., B.G.D., M.P.M., S.A.H., S.Y., K.C.W., A.C.S.) and Department of Pharmacology and Chemical Biology (J.C.G., A.C.S.), University of Pittsburgh, Pittsburgh, Pennsylvania
| |
Collapse
|
5
|
Su L, Wang C, Zheng C, Wei H, Song X. A meta-analysis of public microarray data identifies biological regulatory networks in Parkinson's disease. BMC Med Genomics 2018; 11:40. [PMID: 29653596 PMCID: PMC5899355 DOI: 10.1186/s12920-018-0357-7] [Citation(s) in RCA: 27] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2017] [Accepted: 03/26/2018] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Parkinson's disease (PD) is a long-term degenerative disease that is caused by environmental and genetic factors. The networks of genes and their regulators that control the progression and development of PD require further elucidation. METHODS We examine common differentially expressed genes (DEGs) from several PD blood and substantia nigra (SN) microarray datasets by meta-analysis. Further we screen the PD-specific genes from common DEGs using GCBI. Next, we used a series of bioinformatics software to analyze the miRNAs, lncRNAs and SNPs associated with the common PD-specific genes, and then identify the mTF-miRNA-gene-gTF network. RESULT Our results identified 36 common DEGs in PD blood studies and 17 common DEGs in PD SN studies, and five of the genes were previously known to be associated with PD. Further study of the regulatory miRNAs associated with the common PD-specific genes revealed 14 PD-specific miRNAs in our study. Analysis of the mTF-miRNA-gene-gTF network about PD-specific genes revealed two feed-forward loops: one involving the SPRK2 gene, hsa-miR-19a-3p and SPI1, and the second involving the SPRK2 gene, hsa-miR-17-3p and SPI. The long non-coding RNA (lncRNA)-mediated regulatory network identified lncRNAs associated with PD-specific genes and PD-specific miRNAs. Moreover, single nucleotide polymorphism (SNP) analysis of the PD-specific genes identified two significant SNPs, and SNP analysis of the neurodegenerative disease-specific genes identified seven significant SNPs. Most of these SNPs are present in the 3'-untranslated region of genes and are controlled by several miRNAs. CONCLUSION Our study identified a total of 53 common DEGs in PD patients compared with healthy controls in blood and brain datasets and five of these genes were previously linked with PD. Regulatory network analysis identified PD-specific miRNAs, associated long non-coding RNA and feed-forward loops, which contribute to our understanding of the mechanisms underlying PD. The SNPs identified in our study can determine whether a genetic variant is associated with PD. Overall, these findings will help guide our study of the complex molecular mechanism of PD.
Collapse
Affiliation(s)
- Lining Su
- Department of Biology of Basic Medical Science College, Hebei North University, Zhangjiakou, 075000, Hebei, China
| | - Chunjie Wang
- Department of Basic Medicine, Zhangjiakou University, Zhangjiakou, 75000, Hebei, China
| | - Chenqing Zheng
- Shenzhen RealOmics (Biotech) Co., Ltd, Shenzhen, 518081, Guangdong, China
| | - Huiping Wei
- Department of Biology of Basic Medical Science College, Hebei North University, Zhangjiakou, 075000, Hebei, China.
| | - Xiaoqing Song
- Department of Biology of Basic Medical Science College, Hebei North University, Zhangjiakou, 075000, Hebei, China
| |
Collapse
|
6
|
Plasticity of the MFS1 Promoter Leads to Multidrug Resistance in the Wheat Pathogen Zymoseptoria tritici. mSphere 2017; 2:mSphere00393-17. [PMID: 29085913 PMCID: PMC5656749 DOI: 10.1128/msphere.00393-17] [Citation(s) in RCA: 47] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2017] [Accepted: 09/21/2017] [Indexed: 11/20/2022] Open
Abstract
The ascomycete Zymoseptoria tritici is the causal agent of Septoria leaf blotch on wheat. Disease control relies mainly on resistant wheat cultivars and on fungicide applications. The fungus displays a high potential to circumvent both methods. Resistance against all unisite fungicides has been observed over decades. A different type of resistance has emerged among wild populations with multidrug-resistant (MDR) strains. Active fungicide efflux through overexpression of the major facilitator gene MFS1 explains this emerging resistance mechanism. Applying a bulk-progeny sequencing approach, we identified in this study a 519-bp long terminal repeat (LTR) insert in the MFS1 promoter, a relic of a retrotransposon cosegregating with the MDR phenotype. Through gene replacement, we show the insert as a mutation responsible for MFS1 overexpression and the MDR phenotype. Besides this type I insert, we found two different types of promoter inserts in more recent MDR strains. Type I and type II inserts harbor potential transcription factor binding sites, but not the type III insert. Interestingly, all three inserts correspond to repeated elements present at different genomic locations in either IPO323 or other Z. tritici strains. These results underline the plasticity of repeated elements leading to fungicide resistance in Z. tritici and which contribute to its adaptive potential. IMPORTANCE Disease control through fungicides remains an important means to protect crops from fungal diseases and to secure the harvest. Plant-pathogenic fungi, especially Zymoseptoria tritici, have developed resistance against most currently used active ingredients, reducing or abolishing their efficacy. While target site modification is the most common resistance mechanism against single modes of action, active efflux of multiple drugs is an emerging phenomenon in fungal populations reducing additionally fungicides' efficacy in multidrug-resistant strains. We have investigated the mutations responsible for increased drug efflux in Z. tritici field strains. Our study reveals that three different insertions of repeated elements in the same promoter lead to multidrug resistance in Z. tritici. The target gene encodes the membrane transporter MFS1 responsible for drug efflux, with the promoter inserts inducing its overexpression. These results underline the plasticity of repeated elements leading to fungicide resistance in Z. tritici.
Collapse
|
7
|
Chen J, Zhang N, Wen J, Zhang Z. Silencing TAK1 alters gene expression signatures in bladder cancer cells. Oncol Lett 2017; 13:2975-2981. [PMID: 28521404 PMCID: PMC5431247 DOI: 10.3892/ol.2017.5819] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2015] [Accepted: 09/22/2016] [Indexed: 02/06/2023] Open
Abstract
The aim of the present study was to identify the differentially expressed genes (DEGs) that are induced by the silencing of transforming growth factor-β-activated kinase 1 (TAK1) in bladder cancer cells and to analyze the potential biological effects. Dataset GSE52452 from mutant fibroblast growth factor receptor 3 (FGFR3) bladder cancer cells transfected with control siRNA or TAK1-specific siRNA was downloaded from Gene Expression Omnibus. The DEGs between the two groups were identified using Limma package following data pre-processing by Affy in Bioconductor. Enrichment analysis of DEGs was performed using the Database for Annotation, Visualization and Integrated Discovery, followed by functional annotation using TRANSFAC, TSGene and TAG databases. Integrated networks were constructed by Cytoscape and sub-networks were extracted employing BioNet, followed by enrichment analysis of DEGs in the sub-network. A total of 43 downregulated and 21 upregulated genes were obtained. The downregulated genes were enriched in five pathways, including NOD-like receptor signaling pathway and functions related to cellular response. The upregulated genes were associated with cellular developmental processes. Transcription factor EGR1 and 9 tumor-associated genes were screened from the DEGs. Among the DEGs, 10 hub nodes may represent important roles in the complex metabolic network, including EGFR, CYP3A5, MAP3K7, GSTA1, PTHLH, ALDH1A1, KCND2, EGR1, ARRB1 and ITPR1. Additionally, EGFR was correlated with ERBB2, GRB2 and PIK3R1, and these were enriched in ErbB signaling pathway and various cancer-associated pathways. Silencing TAK1 may decrease cellular response to chemical stimulus via downregulating CYP3A5, MAP3K7, GSTA1, ALDH1A1, ARRB1 and ITPR1; increase cancer cell development via upregulating EGFR, EGR1 and PTHLH; and regulate cancer metastasis through EGFR, ERBB2, GRB2 and PIK3R1.
Collapse
Affiliation(s)
- Jimin Chen
- Department of Urology, Second Affiliated Hospital, School of Medicine, Zhejiang University, Hangzhou, Zhejiang 310009, P.R. China
| | - Nan Zhang
- Department of Urology, Second Affiliated Hospital, School of Medicine, Zhejiang University, Hangzhou, Zhejiang 310009, P.R. China
| | - Jiaming Wen
- Department of Urology, Second Affiliated Hospital, School of Medicine, Zhejiang University, Hangzhou, Zhejiang 310009, P.R. China
| | - Zhewei Zhang
- Department of Urology, Second Affiliated Hospital, School of Medicine, Zhejiang University, Hangzhou, Zhejiang 310009, P.R. China
| |
Collapse
|
8
|
Dai X, Li J, Liu T, Zhao PX. HRGRN: A Graph Search-Empowered Integrative Database of Arabidopsis Signaling Transduction, Metabolism and Gene Regulation Networks. PLANT & CELL PHYSIOLOGY 2016; 57:e12. [PMID: 26657893 PMCID: PMC4722177 DOI: 10.1093/pcp/pcv200] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/01/2015] [Accepted: 12/07/2015] [Indexed: 05/10/2023]
Abstract
The biological networks controlling plant signal transduction, metabolism and gene regulation are composed of not only tens of thousands of genes, compounds, proteins and RNAs but also the complicated interactions and co-ordination among them. These networks play critical roles in many fundamental mechanisms, such as plant growth, development and environmental response. Although much is known about these complex interactions, the knowledge and data are currently scattered throughout the published literature, publicly available high-throughput data sets and third-party databases. Many 'unknown' yet important interactions among genes need to be mined and established through extensive computational analysis. However, exploring these complex biological interactions at the network level from existing heterogeneous resources remains challenging and time-consuming for biologists. Here, we introduce HRGRN, a graph search-empowered integrative database of Arabidopsis signal transduction, metabolism and gene regulatory networks. HRGRN utilizes Neo4j, which is a highly scalable graph database management system, to host large-scale biological interactions among genes, proteins, compounds and small RNAs that were either validated experimentally or predicted computationally. The associated biological pathway information was also specially marked for the interactions that are involved in the pathway to facilitate the investigation of cross-talk between pathways. Furthermore, HRGRN integrates a series of graph path search algorithms to discover novel relationships among genes, compounds, RNAs and even pathways from heterogeneous biological interaction data that could be missed by traditional SQL database search methods. Users can also build subnetworks based on known interactions. The outcomes are visualized with rich text, figures and interactive network graphs on web pages. The HRGRN database is freely available at http://plantgrn.noble.org/hrgrn/.
Collapse
Affiliation(s)
- Xinbin Dai
- Plant Biology Division, The Samuel Roberts Noble Foundation, 2510 Sam Noble Parkway, Ardmore, OK 73401, USA
| | - Jun Li
- Plant Biology Division, The Samuel Roberts Noble Foundation, 2510 Sam Noble Parkway, Ardmore, OK 73401, USA
| | - Tingsong Liu
- Plant Biology Division, The Samuel Roberts Noble Foundation, 2510 Sam Noble Parkway, Ardmore, OK 73401, USA
| | - Patrick Xuechun Zhao
- Plant Biology Division, The Samuel Roberts Noble Foundation, 2510 Sam Noble Parkway, Ardmore, OK 73401, USA
| |
Collapse
|
9
|
Broin PÓ, Smith TJ, Golden AA. Alignment-free clustering of transcription factor binding motifs using a genetic-k-medoids approach. BMC Bioinformatics 2015; 16:22. [PMID: 25627106 PMCID: PMC4384390 DOI: 10.1186/s12859-015-0450-2] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2014] [Accepted: 01/02/2015] [Indexed: 11/10/2022] Open
Abstract
Background Familial binding profiles (FBPs) represent the average binding specificity for a group of structurally related DNA-binding proteins. The construction of such profiles allows the classification of novel motifs based on similarity to known families, can help to reduce redundancy in motif databases and de novo prediction algorithms, and can provide valuable insights into the evolution of binding sites. Many current approaches to automated motif clustering rely on progressive tree-based techniques, and can suffer from so-called frozen sub-alignments, where motifs which are clustered early on in the process remain ‘locked’ in place despite the potential for better placement at a later stage. In order to avoid this scenario, we have developed a genetic-k-medoids approach which allows motifs to move freely between clusters at any point in the clustering process. Results We demonstrate the performance of our algorithm, GMACS, on multiple benchmark motif datasets, comparing results obtained with current leading approaches. The first dataset includes 355 position weight matrices from the TRANSFAC database and indicates that the k-mer frequency vector approach used in GMACS outperforms other motif comparison techniques. We then cluster a set of 79 motifs from the JASPAR database previously used in several motif clustering studies and demonstrate that GMACS can produce a higher number of structurally homogeneous clusters than other methods without the need for a large number of singletons. Finally, we show the robustness of our algorithm to noise on multiple synthetic datasets consisting of known motifs convolved with varying degrees of noise. Conclusions Our proposed algorithm is generally applicable to any DNA or protein motifs, can produce highly stable and biologically meaningful clusters, and, by avoiding the problem of frozen sub-alignments, can provide improved results when compared with existing techniques on benchmark datasets. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0450-2) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Pilib Ó Broin
- Department of Genetics, Albert Einstein College of Medicine, 1301 Morris Park Avenue, Bronx, New York, 10461, USA. .,National Centre for Biomedical Engineering Science, National University of Ireland, University Road, Galway, Ireland.
| | - Terry J Smith
- Department of Genetics, Albert Einstein College of Medicine, 1301 Morris Park Avenue, Bronx, New York, 10461, USA.
| | - Aaron Aj Golden
- Department of Genetics, Albert Einstein College of Medicine, 1301 Morris Park Avenue, Bronx, New York, 10461, USA. .,Department of Mathematical Sciences, Yeshiva University, New York, 10033, NY, USA.
| |
Collapse
|
10
|
Dissecting neural differentiation regulatory networks through epigenetic footprinting. Nature 2014; 518:355-359. [PMID: 25533951 PMCID: PMC4336237 DOI: 10.1038/nature13990] [Citation(s) in RCA: 138] [Impact Index Per Article: 13.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2013] [Accepted: 10/21/2014] [Indexed: 12/16/2022]
Abstract
Models derived from human pluripotent stem cells that accurately recapitulate neural development in vitro and allow for the generation of specific neuronal subtypes are of major interest to the stem cell and biomedical community. Notch signalling, particularly through the Notch effector HES5, is a major pathway critical for the onset and maintenance of neural progenitor cells in the embryonic and adult nervous system. Here we report the transcriptional and epigenomic analysis of six consecutive neural progenitor cell stages derived from a HES5::eGFP reporter human embryonic stem cell line. Using this system, we aimed to model cell-fate decisions including specification, expansion and patterning during the ontogeny of cortical neural stem and progenitor cells. In order to dissect regulatory mechanisms that orchestrate the stage-specific differentiation process, we developed a computational framework to infer key regulators of each cell-state transition based on the progressive remodelling of the epigenetic landscape and then validated these through a pooled short hairpin RNA screen. We were also able to refine our previous observations on epigenetic priming at transcription factor binding sites and suggest here that they are mediated by combinations of core and stage-specific factors. Taken together, we demonstrate the utility of our system and outline a general framework, not limited to the context of the neural lineage, to dissect regulatory circuits of differentiation.
Collapse
|
11
|
Sebastian A, Contreras-Moreira B. footprintDB: a database of transcription factors with annotated cis elements and binding interfaces. ACTA ACUST UNITED AC 2013; 30:258-65. [PMID: 24234003 DOI: 10.1093/bioinformatics/btt663] [Citation(s) in RCA: 49] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
Abstract
MOTIVATION Traditional and high-throughput techniques for determining transcription factor (TF) binding specificities are generating large volumes of data of uneven quality, which are scattered across individual databases. RESULTS FootprintDB integrates some of the most comprehensive freely available libraries of curated DNA binding sites and systematically annotates the binding interfaces of the corresponding TFs. The first release contains 2422 unique TF sequences, 10 112 DNA binding sites and 3662 DNA motifs. A survey of the included data sources, organisms and TF families was performed together with proprietary database TRANSFAC, finding that footprintDB has a similar coverage of multicellular organisms, while also containing bacterial regulatory data. A search engine has been designed that drives the prediction of DNA motifs for input TFs, or conversely of TF sequences that might recognize input regulatory sequences, by comparison with database entries. Such predictions can also be extended to a single proteome chosen by the user, and results are ranked in terms of interface similarity. Benchmark experiments with bacterial, plant and human data were performed to measure the predictive power of footprintDB searches, which were able to correctly recover 10, 55 and 90% of the tested sequences, respectively. Correctly predicted TFs had a higher interface similarity than the average, confirming its diagnostic value. AVAILABILITY AND IMPLEMENTATION Web site implemented in PHP,Perl, MySQL and Apache. Freely available from http://floresta.eead.csic.es/footprintdb.
Collapse
Affiliation(s)
- Alvaro Sebastian
- Laboratory of Computational Biology, Department of Genetics and Plant Production, Estación Experimental de Aula Dei/CSIC, Av. Montañana 1005, Zaragoza (http://www.eead.csic.es/compbio) and Fundación ARAID, Paseo María Agustín 36, Zaragoza, Spain
| | | |
Collapse
|
12
|
Nagore LI, Nadeau RJ, Guo Q, Jadhav YLA, Jarrett HW, Haskins WE. Purification and characterization of transcription factors. MASS SPECTROMETRY REVIEWS 2013; 32:386-398. [PMID: 23832591 PMCID: PMC3758410 DOI: 10.1002/mas.21369] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/31/2012] [Revised: 11/19/2012] [Accepted: 11/19/2012] [Indexed: 06/02/2023]
Abstract
Transcription factors (TFs) are essential for the expression of all proteins, including those involved in human health and disease. However, TFs are resistant to proteomic characterization because they are frequently masked by more abundant proteins due to the limited dynamic range of capillary liquid chromatography-tandem mass spectrometry and protein database searching. Purification methods, particularly strategies that exploit the high affinity of TFs for DNA response elements (REs) on gene promoters, can enrich TFs prior to proteomic analysis to improve dynamic range and penetrance of the TF proteome. For example, trapping of TF complexes specific for particular REs has been achieved by recovering the element DNA-protein complex on solid supports. Additional methods for improving dynamic range include two- and three-dimensional gel electrophoresis incorporating electrophoretic mobility shift assays and Southwestern blotting for detection. Here we review methods for TF purification and characterization. We fully expect that future investigations will apply these and other methods to illuminate this important but challenging proteome.
Collapse
Affiliation(s)
- LI Nagore
- Department of Chemistry, University of Texas at San Antonio, San Antonio, TX, 78249
| | - RJ Nadeau
- Department of Chemistry, University of Texas at San Antonio, San Antonio, TX, 78249
- Protein Biomarkers Cores, University of Texas at San Antonio, San Antonio, TX, 78249
- Center for Interdisciplinary Health Research, University of Texas at San Antonio, San Antonio, TX, 78249
- Center for Research & Training in the Sciences, University of Texas at San Antonio, San Antonio, TX, 78249
| | - Q Guo
- Department of Chemistry, University of Texas at San Antonio, San Antonio, TX, 78249
- Protein Biomarkers Cores, University of Texas at San Antonio, San Antonio, TX, 78249
- Center for Interdisciplinary Health Research, University of Texas at San Antonio, San Antonio, TX, 78249
- Center for Research & Training in the Sciences, University of Texas at San Antonio, San Antonio, TX, 78249
| | - YLA Jadhav
- Pediatric Biochemistry Laboratory, University of Texas at San Antonio, San Antonio, TX, 78249
- RCMI Proteomics, University of Texas at San Antonio, San Antonio, TX, 78249
- Protein Biomarkers Cores, University of Texas at San Antonio, San Antonio, TX, 78249
- Center for Interdisciplinary Health Research, University of Texas at San Antonio, San Antonio, TX, 78249
- Center for Research & Training in the Sciences, University of Texas at San Antonio, San Antonio, TX, 78249
| | - HW Jarrett
- Department of Chemistry, University of Texas at San Antonio, San Antonio, TX, 78249
- Protein Biomarkers Cores, University of Texas at San Antonio, San Antonio, TX, 78249
- Center for Interdisciplinary Health Research, University of Texas at San Antonio, San Antonio, TX, 78249
| | - WE Haskins
- Pediatric Biochemistry Laboratory, University of Texas at San Antonio, San Antonio, TX, 78249
- Department of Chemistry, University of Texas at San Antonio, San Antonio, TX, 78249
- Departments of Biology, University of Texas at San Antonio, San Antonio, TX, 78249
- RCMI Proteomics, University of Texas at San Antonio, San Antonio, TX, 78249
- Protein Biomarkers Cores, University of Texas at San Antonio, San Antonio, TX, 78249
- Center for Interdisciplinary Health Research, University of Texas at San Antonio, San Antonio, TX, 78249
- Center for Research & Training in the Sciences, University of Texas at San Antonio, San Antonio, TX, 78249
- Departments of Medicine, Division of Hematology & Medical Oncology, University of Texas Health Science Center at San Antonio, San Antonio, TX, 78229
- Cancer Therapy & Research Center, University of Texas Health Science Center at San Antonio, San Antonio, TX, 78229
| |
Collapse
|
13
|
Thompson JA, Congdon CB. An Exploration Into Improving DNA Motif Inference by Looking for Highly Conserved Core Regions. IEEE SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE IN BIOINFORMATICS AND COMPUTATIONAL BIOLOGY PROCEEDINGS. IEEE SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE IN BIOINFORMATICS AND COMPUTATIONAL BIOLOGY 2013; 2013:60-67. [PMID: 31008453 PMCID: PMC6474685 DOI: 10.1109/cibcb.2013.6595389] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Although most verified functional elements in noncoding DNA contain a highly conserved core region, this concept is not generally incorporated into de novo motif inference systems. In this work, we explore the utility of adding the notion of conserved core regions into a comparative genomics approach for the search for putative functional elements in noncoding DNA. By modifying the scoring function for GAMI, Genetic Algorithms for Motif Inference, we investigate tradeoffs between the strength of conservation of the full motif vs. the strength of conservation of a core region. This work illustrates that incorporating information about the structure of transcription factor binding sites can be helpful in identifying biologically functional elements.
Collapse
Affiliation(s)
- Jeffrey A Thompson
- Department of Computer Science, University of Southern Maine, Portland, Maine 04104
| | - Clare Bates Congdon
- Department of Computer Science, University of Southern Maine, Portland, Maine 04104
| |
Collapse
|
14
|
Hu J, Dang N, Menu E, De Bruyne E, De Bryune E, Xu D, Van Camp B, Van Valckenborgh E, Vanderkerken K. Activation of ATF4 mediates unwanted Mcl-1 accumulation by proteasome inhibition. Blood 2012; 119:826-37. [PMID: 22128141 DOI: 10.1182/blood-2011-07-366492] [Citation(s) in RCA: 66] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022] Open
Abstract
Myeloid cell leukemia-1 (Mcl-1) protein is an anti-apoptotic Bcl-2 family protein that plays essential roles in multiple myeloma (MM) survival and drug resistance. In MM, it has been demonstrated that proteasome inhibition can trigger the accumulation of Mcl-1, which has been shown to confer MM cell resistance to bortezomib-induced lethality. However, the mechanisms involved in this unwanted Mcl-1 accumulation are still unclear. The aim of the present study was to determine whether the unwanted Mcl-1 accumulation could be induced by the unfolded protein response (UPR) and to elucidate the role of the endoplasmic reticulum stress response in regulating Mcl-1 expression. Using quantitative RT-PCR and Western blot, we found that the translation of activating transcription factor-4 (ATF4), an important effector of the UPR, was also greatly enhanced by proteasome inhibition. ChIP analysis further revealed that bortezomib stimulated binding of ATF4 to a regulatory site (at position -332 to -324) at the promoter of the Mcl-1 gene. Knocking down ATF4 was paralleled by down-regulation of Mcl-1 induction by bortezomib and significantly increased bortezomib-induced apoptosis. These data identify the UPR and, more specifically, its ATF4 branch as an important mechanism mediating up-regulation of Mcl-1 by proteasome inhibition.
Collapse
Affiliation(s)
- Jinsong Hu
- Department of Genetics and Molecular Biology, Medical School of Xi'an Jiaotong University, China
| | | | | | | | | | | | | | | | | |
Collapse
|
15
|
Bischoff E, Vaquero C. In silico and biological survey of transcription-associated proteins implicated in the transcriptional machinery during the erythrocytic development of Plasmodium falciparum. BMC Genomics 2010; 11:34. [PMID: 20078850 PMCID: PMC2821373 DOI: 10.1186/1471-2164-11-34] [Citation(s) in RCA: 71] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2009] [Accepted: 01/15/2010] [Indexed: 11/12/2022] Open
Abstract
Background Malaria is the most important parasitic disease in the world with approximately two million people dying every year, mostly due to Plasmodium falciparum infection. During its complex life cycle in the Anopheles vector and human host, the parasite requires the coordinated and modulated expression of diverse sets of genes involved in epigenetic, transcriptional and post-transcriptional regulation. However, despite the availability of the complete sequence of the Plasmodium falciparum genome, we are still quite ignorant about Plasmodium mechanisms of transcriptional gene regulation. This is due to the poor prediction of nuclear proteins, cognate DNA motifs and structures involved in transcription. Results A comprehensive directory of proteins reported to be potentially involved in Plasmodium transcriptional machinery was built from all in silico reports and databanks. The transcription-associated proteins were clustered in three main sets of factors: general transcription factors, chromatin-related proteins (structuring, remodelling and histone modifying enzymes), and specific transcription factors. Only a few of these factors have been molecularly analysed. Furthermore, from transcriptome and proteome data we modelled expression patterns of transcripts and corresponding proteins during the intra-erythrocytic cycle. Finally, an interactome of these proteins based either on in silico or on 2-yeast-hybrid experimental approaches is discussed. Conclusion This is the first attempt to build a comprehensive directory of potential transcription-associated proteins in Plasmodium. In addition, all complete transcriptome, proteome and interactome raw data were re-analysed, compared and discussed for a better comprehension of the complex biological processes of Plasmodium falciparum transcriptional regulation during the erythrocytic development.
Collapse
Affiliation(s)
- Emmanuel Bischoff
- Institut Pasteur, Unité d'Immunologie Moléculaire des Parasites, CNRS URA 2581, 25-28 rue du Dr Roux, 75724, Paris cedex 15, France.
| | | |
Collapse
|
16
|
FISim: a new similarity measure between transcription factor binding sites based on the fuzzy integral. BMC Bioinformatics 2009; 10:224. [PMID: 19615102 PMCID: PMC2722654 DOI: 10.1186/1471-2105-10-224] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2008] [Accepted: 07/20/2009] [Indexed: 01/22/2023] Open
Abstract
BACKGROUND Regulatory motifs describe sets of related transcription factor binding sites (TFBSs) and can be represented as position frequency matrices (PFMs). De novo identification of TFBSs is a crucial problem in computational biology which includes the issue of comparing putative motifs with one another and with motifs that are already known. The relative importance of each nucleotide within a given position in the PFMs should be considered in order to compute PFM similarities. Furthermore, biological data are inherently noisy and imprecise. Fuzzy set theory is particularly suitable for modeling imprecise data, whereas fuzzy integrals are highly appropriate for representing the interaction among different information sources. RESULTS We propose FISim, a new similarity measure between PFMs, based on the fuzzy integral of the distance of the nucleotides with respect to the information content of the positions. Unlike existing methods, FISim is designed to consider the higher contribution of better conserved positions to the binding affinity. FISim provides excellent results when dealing with sets of randomly generated motifs, and outperforms the remaining methods when handling real datasets of related motifs. Furthermore, we propose a new cluster methodology based on kernel theory together with FISim to obtain groups of related motifs potentially bound by the same TFs, providing more robust results than existing approaches. CONCLUSION FISim corrects a design flaw of the most popular methods, whose measures favour similarity of low information content positions. We use our measure to successfully identify motifs that describe binding sites for the same TF and to solve real-life problems. In this study the reliability of fuzzy technology for motif comparison tasks is proven.
Collapse
|
17
|
Meier S, Gehring C. A guide to the integrated application of on-line data mining tools for the inference of gene functions at the systems level. Biotechnol J 2009; 3:1375-87. [PMID: 18830970 DOI: 10.1002/biot.200800142] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023]
Abstract
Genes function in networks to achieve a common biological response. Thus, inferences into the biological role of individual genes can be gained by analyzing their association with other genes with more precisely defined functions. Here, we present a guide, using the well-characterized Arabidopsis thaliana pathogenesis-related protein 2 gene (PR-2) as an example, to document how the sequential use of web-based tools can be applied to integrate information from different databases and associate the function of an individual gene with a network of genes and additionally identify specific biological processes in which they collectively function. The analysis begins by performing a global expression correlation analysis to build a functionally associated gene network. The network is subsequently analyzed for Gene Ontology enrichment, stimuli and mutant-specific transcriptional responses and enriched putative promoter regulatory elements that may be responsible for their correlated relationships. The results for the PR-2 gene are entirely consistent with the published literature documenting the accuracy of this type of analysis. Furthermore, this type of analysis can also be performed on other organisms with the appropriate data available and will greatly assist in understanding individual gene functions in a systems context.
Collapse
Affiliation(s)
- Stuart Meier
- South African National Bioinformatics Institute, University of the Western Cape, Cape Town, South Africa
| | | |
Collapse
|
18
|
de Vooght KMK, van Wijk R, van Solinge WW. Management of gene promoter mutations in molecular diagnostics. Clin Chem 2009; 55:698-708. [PMID: 19246615 DOI: 10.1373/clinchem.2008.120931] [Citation(s) in RCA: 58] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
BACKGROUND Although promoter mutations are known to cause functionally important consequences for gene expression, promoter analysis is not a regular part of DNA diagnostics. CONTENT This review covers different important aspects of promoter mutation analysis and includes a proposed model procedure for studying promoter mutations. Characterization of a promoter sequence variation includes a comprehensive study of the literature and databases of human mutations and transcription factors. Phylogenetic footprinting is also used to evaluate the putative importance of the promoter region of interest. This in silico analysis is, in general, followed by in vitro functional assays, of which transient and stable transfection assays are considered the gold-standard methods. Electrophoretic mobility shift and supershift assays are used to identify trans-acting proteins that putatively interact with the promoter region of interest. Finally, chromatin immunoprecipitation assays are essential to confirm in vivo binding of these proteins to the promoter. SUMMARY Although promoter mutation analysis is complex, often laborious, and difficult to perform, it is an essential part of the diagnosis of disease-causing promoter mutations and improves our understanding of the role of transcriptional regulation in human disease. We recommend that routine laboratories and research groups specialized in gene promoter research cooperate to expand general knowledge and diagnosis of gene-promoter defects.
Collapse
Affiliation(s)
- Karen M K de Vooght
- Department of Clinical Chemistry and Haematology, Laboratory for Red Blood Cell Research, University Medical Center Utrecht, Utrecht, the Netherlands.
| | | | | |
Collapse
|
19
|
Cai Y, He J, Li X, Lu L, Yang X, Feng K, Lu W, Kong X. A Novel Computational Approach To Predict Transcription Factor DNA Binding Preference. J Proteome Res 2008; 8:999-1003. [DOI: 10.1021/pr800717y] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Affiliation(s)
- Yudong Cai
- CAS-MPG Partner Institute for Computational Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, 320 Yueyang Road, Shanghai 200031, China, Department of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai 200040, People’s Republic of China, Institute of Health Sciences, Shanghai Jiao Tong University School of Medicine and Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200025, China, Division of Imaging Science & Biomedical
| | - JianFeng He
- CAS-MPG Partner Institute for Computational Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, 320 Yueyang Road, Shanghai 200031, China, Department of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai 200040, People’s Republic of China, Institute of Health Sciences, Shanghai Jiao Tong University School of Medicine and Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200025, China, Division of Imaging Science & Biomedical
| | - XinLei Li
- CAS-MPG Partner Institute for Computational Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, 320 Yueyang Road, Shanghai 200031, China, Department of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai 200040, People’s Republic of China, Institute of Health Sciences, Shanghai Jiao Tong University School of Medicine and Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200025, China, Division of Imaging Science & Biomedical
| | - Lin Lu
- CAS-MPG Partner Institute for Computational Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, 320 Yueyang Road, Shanghai 200031, China, Department of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai 200040, People’s Republic of China, Institute of Health Sciences, Shanghai Jiao Tong University School of Medicine and Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200025, China, Division of Imaging Science & Biomedical
| | - XinYi Yang
- CAS-MPG Partner Institute for Computational Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, 320 Yueyang Road, Shanghai 200031, China, Department of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai 200040, People’s Republic of China, Institute of Health Sciences, Shanghai Jiao Tong University School of Medicine and Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200025, China, Division of Imaging Science & Biomedical
| | - KaiYan Feng
- CAS-MPG Partner Institute for Computational Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, 320 Yueyang Road, Shanghai 200031, China, Department of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai 200040, People’s Republic of China, Institute of Health Sciences, Shanghai Jiao Tong University School of Medicine and Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200025, China, Division of Imaging Science & Biomedical
| | - WenCong Lu
- CAS-MPG Partner Institute for Computational Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, 320 Yueyang Road, Shanghai 200031, China, Department of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai 200040, People’s Republic of China, Institute of Health Sciences, Shanghai Jiao Tong University School of Medicine and Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200025, China, Division of Imaging Science & Biomedical
| | - XiangYin Kong
- CAS-MPG Partner Institute for Computational Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, 320 Yueyang Road, Shanghai 200031, China, Department of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai 200040, People’s Republic of China, Institute of Health Sciences, Shanghai Jiao Tong University School of Medicine and Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200025, China, Division of Imaging Science & Biomedical
| |
Collapse
|
20
|
Fogel GB, Porto VW, Varga G, Dow ER, Craven AM, Powers DM, Harlow HB, Su EW, Onyia JE, Su C. Evolutionary computation for discovery of composite transcription factor binding sites. Nucleic Acids Res 2008; 36:e142. [PMID: 18927103 PMCID: PMC2588514 DOI: 10.1093/nar/gkn738] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2008] [Revised: 09/05/2008] [Accepted: 10/02/2008] [Indexed: 12/02/2022] Open
Abstract
Previous research demonstrated the use of evolutionary computation for the discovery of transcription factor binding sites (TFBS) in promoter regions upstream of coexpressed genes. However, it remained unclear whether or not composite TFBS elements, commonly found in higher organisms where two or more TFBSs form functional complexes, could also be identified by using this approach. Here, we present an important refinement of our previous algorithm and test the identification of composite elements using NFAT/AP-1 as an example. We demonstrate that by using appropriate existing parameters such as window size, novel-scoring methods such as central bonusing and methods of self-adaptation to automatically adjust the variation operators during the evolutionary search, TFBSs of different sizes and complexity can be identified as top solutions. Some of these solutions have known experimental relationships with NFAT/AP-1. We also indicate that even after properly tuning the model parameters, the choice of the appropriate window size has a significant effect on algorithm performance. We believe that this improved algorithm will greatly augment TFBS discovery.
Collapse
Affiliation(s)
- Gary B. Fogel
- Natural Selection, Inc., 9330 Scranton Rd., Suite 150, San Diego, CA 92121 and Lilly Research Laboratories, Eli Lilly and Company, Indianapolis, IN 46285, USA
| | - V. William Porto
- Natural Selection, Inc., 9330 Scranton Rd., Suite 150, San Diego, CA 92121 and Lilly Research Laboratories, Eli Lilly and Company, Indianapolis, IN 46285, USA
| | - Gabor Varga
- Natural Selection, Inc., 9330 Scranton Rd., Suite 150, San Diego, CA 92121 and Lilly Research Laboratories, Eli Lilly and Company, Indianapolis, IN 46285, USA
| | - Ernst R. Dow
- Natural Selection, Inc., 9330 Scranton Rd., Suite 150, San Diego, CA 92121 and Lilly Research Laboratories, Eli Lilly and Company, Indianapolis, IN 46285, USA
| | - Andrew M. Craven
- Natural Selection, Inc., 9330 Scranton Rd., Suite 150, San Diego, CA 92121 and Lilly Research Laboratories, Eli Lilly and Company, Indianapolis, IN 46285, USA
| | - David M. Powers
- Natural Selection, Inc., 9330 Scranton Rd., Suite 150, San Diego, CA 92121 and Lilly Research Laboratories, Eli Lilly and Company, Indianapolis, IN 46285, USA
| | - Harry B. Harlow
- Natural Selection, Inc., 9330 Scranton Rd., Suite 150, San Diego, CA 92121 and Lilly Research Laboratories, Eli Lilly and Company, Indianapolis, IN 46285, USA
| | - Eric W. Su
- Natural Selection, Inc., 9330 Scranton Rd., Suite 150, San Diego, CA 92121 and Lilly Research Laboratories, Eli Lilly and Company, Indianapolis, IN 46285, USA
| | - Jude E. Onyia
- Natural Selection, Inc., 9330 Scranton Rd., Suite 150, San Diego, CA 92121 and Lilly Research Laboratories, Eli Lilly and Company, Indianapolis, IN 46285, USA
| | - Chen Su
- Natural Selection, Inc., 9330 Scranton Rd., Suite 150, San Diego, CA 92121 and Lilly Research Laboratories, Eli Lilly and Company, Indianapolis, IN 46285, USA
| |
Collapse
|
21
|
Maston GA, Evans SK, Green MR. Transcriptional regulatory elements in the human genome. Annu Rev Genomics Hum Genet 2008; 7:29-59. [PMID: 16719718 DOI: 10.1146/annurev.genom.7.080505.115623] [Citation(s) in RCA: 551] [Impact Index Per Article: 34.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
The faithful execution of biological processes requires a precise and carefully orchestrated set of steps that depend on the proper spatial and temporal expression of genes. Here we review the various classes of transcriptional regulatory elements (core promoters, proximal promoters, distal enhancers, silencers, insulators/boundary elements, and locus control regions) and the molecular machinery (general transcription factors, activators, and coactivators) that interacts with the regulatory elements to mediate precisely controlled patterns of gene expression. The biological importance of transcriptional regulation is highlighted by examples of how alterations in these transcriptional components can lead to disease. Finally, we discuss the methods currently used to identify transcriptional regulatory elements, and the ability of these methods to be scaled up for the purpose of annotating the entire human genome.
Collapse
Affiliation(s)
- Glenn A Maston
- Howard Hughes Medical Institute, Programs in Gene Function and Expression and Molecular Medicine, University of Massachusetts Medical School, Worcester, Massachusetts 01605, USA.
| | | | | |
Collapse
|
22
|
Sandve GK, Abul O, Walseng V, Drabløs F. Improved benchmarks for computational motif discovery. BMC Bioinformatics 2007; 8:193. [PMID: 17559676 PMCID: PMC1903367 DOI: 10.1186/1471-2105-8-193] [Citation(s) in RCA: 51] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2006] [Accepted: 06/08/2007] [Indexed: 12/03/2022] Open
Abstract
Background An important step in annotation of sequenced genomes is the identification of transcription factor binding sites. More than a hundred different computational methods have been proposed, and it is difficult to make an informed choice. Therefore, robust assessment of motif discovery methods becomes important, both for validation of existing tools and for identification of promising directions for future research. Results We use a machine learning perspective to analyze collections of transcription factors with known binding sites. Algorithms are presented for finding position weight matrices (PWMs), IUPAC-type motifs and mismatch motifs with optimal discrimination of binding sites from remaining sequence. We show that for many data sets in a recently proposed benchmark suite for motif discovery, none of the common motif models can accurately discriminate the binding sites from remaining sequence. This may obscure the distinction between the potential performance of the motif discovery tool itself versus the intrinsic complexity of the problem we are trying to solve. Synthetic data sets may avoid this problem, but we show on some previously proposed benchmarks that there may be a strong bias towards a presupposed motif model. We also propose a new approach to benchmark data set construction. This approach is based on collections of binding site fragments that are ranked according to the optimal level of discrimination achieved with our algorithms. This allows us to select subsets with specific properties. We present one benchmark suite with data sets that allow good discrimination between positive and negative instances with the common motif models. These data sets are suitable for evaluating algorithms for motif discovery that rely on these models. We present another benchmark suite where PWM, IUPAC and mismatch motif models are not able to discriminate reliably between positive and negative instances. This suite could be used for evaluating more powerful motif models. Conclusion Our improved benchmark suites have been designed to differentiate between the performance of motif discovery algorithms and the power of motif models. We provide a web server where users can download our benchmark suites, submit predictions and visualize scores on the benchmarks.
Collapse
Affiliation(s)
- Geir Kjetil Sandve
- Department of Computer and Information Science, Norwegian University of Science and Technology (NTNU), Trondheim, Norway
| | - Osman Abul
- Department of Computer Engineering, TOBB University of Economics and Technology, Ankara, Turkey
| | - Vegard Walseng
- Department of Computer and Information Science, Norwegian University of Science and Technology (NTNU), Trondheim, Norway
| | - Finn Drabløs
- Department of Cancer Research and Molecular Medicine, Norwegian University of Science and Technology (NTNU), Trondheim, Norway
| |
Collapse
|
23
|
Yan B, Lovley DR, Krushkal J. Genome-wide similarity search for transcription factors and their binding sites in a metal-reducing prokaryote Geobacter sulfurreducens. Biosystems 2006; 90:421-41. [PMID: 17184904 DOI: 10.1016/j.biosystems.2006.10.006] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2006] [Revised: 09/21/2006] [Accepted: 10/20/2006] [Indexed: 12/26/2022]
Abstract
The knowledge obtained from understanding individual elements involved in gene regulation is important for reconstructing gene regulatory networks, a key for understanding cellular behavior. To study gene regulatory interactions in a model microorganism, Geobacter sulfurreducens, which participates in metal reduction and energy harvesting, we investigated the presence of 59 known Escherichia coli transcription factors and predicted transcription regulatory sites in its genome. The supplementary material, available at http://www.geobacter.org/research/genomescan/, provides the results of similarity comparisons that identified regulatory proteins of G. sulfurreducens and the genome locations of the predicted regulatory sites, including the list of putative regulatory elements in the upstream regions of every predicted operon and singleton open reading frame. Regulatory sequence elements, predicted using genome similarity searches to matrices of established transcription regulatory elements from E. coli, provide an initial insight into regulation of genes and operons in G. sulfurreducens. The predicted regulatory elements were predominantly located in the upstream regions of operons and singleton open reading frames. The validity of the predictions was examined using a permutation approach. Sequence similarity searches indicate that E. coli transcription factors ArgR, CytR, DeoR, FlhCD (both FlhC and FlhD subunits), FruR, GalR, GlpR, H-NS, LacI, MetJ, PurR, TrpR, and Tus are likely missing from G. sulfurreducens. Phylogenetic analysis suggests that one HU subunit is present in G. sulfurreducens as compared to two subunits in E. coli, while each of the two E. coli IHF subunits, HimA and HimD, have two homologs in G. sulfurreducens. The closest homolog of E. coli RpoE in G. sulfurreducens may be more similar to FecI than to RpoE. These findings represent the first step in the understanding of the regulatory relationships in G. sulfurreducens on the genome scale.
Collapse
Affiliation(s)
- Bin Yan
- Department of Preventive Medicine, University of Tennessee Health Science Center, 66 N. Pauline St., Ste. 633, Memphis, TN 38163, USA
| | | | | |
Collapse
|
24
|
Liu CC, Lin CC, Chen WSE, Chen HY, Chang PC, Chen JJ, Yang PC. CRSD: a comprehensive web server for composite regulatory signature discovery. Nucleic Acids Res 2006; 34:W571-7. [PMID: 16845073 PMCID: PMC1538777 DOI: 10.1093/nar/gkl279] [Citation(s) in RCA: 33] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023] Open
Abstract
Transcription factors (TFs) and microRNAs play important roles in the regulation of human gene expression, and the study of their combinatory regulations of gene expression is a new research field. We constructed a comprehensive web server, the composite regulatory signature database (CRSD), that can be applied in investigating complex regulatory behaviors involving gene expression signatures (GESs), microRNA regulatory signatures (MRSs) and TF regulatory signatures (TRSs). Six well-known and large-scale databases, including the human UniGene, mature microRNAs, putative promoter, TRANSFAC, pathway and Gene Ontology (GO) databases, were integrated to provide the comprehensive analysis in CRSD. Two new genome-wide databases, of MRSs and TRSs, were also constructed and further integrated into CRSD. To accomplish the microarray data analysis at one go, several methods, including microarray data pretreatment, statistical and clustering analysis, iterative enrichment analysis and motif discovery, were closely integrated in the web server, which has not been the case in previous studies. Our implementation showed that the published literature could demonstrate the results of genome-wide enrichment analysis. We conclude that CRSD is a powerful and useful bioinformatic web server and may provide new insights into gene regulation networks. CRSD and the online tutorial are publicly available at .
Collapse
Affiliation(s)
- Chun-Chi Liu
- Department of Computer Science, National Chung-Hsing UniversityTaichung, Taiwan, ROC
- Institutes of Biomedical Sciences and Molecular Biology, National Chung-Hsing UniversityTaichung, Taiwan, ROC
| | - Chin-Chung Lin
- Institutes of Biomedical Sciences and Molecular Biology, National Chung-Hsing UniversityTaichung, Taiwan, ROC
| | - Wen-Shyen E. Chen
- Department of Computer Science, National Chung-Hsing UniversityTaichung, Taiwan, ROC
| | - Hsuan-Yu Chen
- Graduate Institute of Epidemiology, National Taiwan UniversityTaipei, Taiwan, ROC
| | - Pei-Chun Chang
- Departments of Biotechnology and Bioinformatics, Asia UniversityTaichung, Taiwan, ROC,
| | - Jeremy J.W. Chen
- Institutes of Biomedical Sciences and Molecular Biology, National Chung-Hsing UniversityTaichung, Taiwan, ROC
- NTU Center for Genomic Medicine, National Taiwan University College of MedicineTaipei, Taiwan, ROC
- To whom correspondence should be addressed. Tel: 886 4 22840485, ext. 226; Fax: 886 4 22853469;
| | - Pan-Chyr Yang
- NTU Center for Genomic Medicine, National Taiwan University College of MedicineTaipei, Taiwan, ROC
| |
Collapse
|
25
|
Perco P, Rapberger R, Siehs C, Lukas A, Oberbauer R, Mayer G, Mayer B. Transforming omics data into context: Bioinformatics on genomics and proteomics raw data. Electrophoresis 2006; 27:2659-75. [PMID: 16739231 DOI: 10.1002/elps.200600064] [Citation(s) in RCA: 32] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Differential gene expression analysis and proteomics have exerted significant impact on the elucidation of concerted cellular processes, as simultaneous measurement of hundreds to thousands of individual objects on the level of RNA and protein ensembles became technically feasible. The availability of such data sets has promised a profound understanding of phenomena on an aggregate level, expressed as the phenotypic response (observables) of cells, e.g., in the presence of drugs, or characterization of cells and tissue displaying distinct patho-physiological states. However, the step of transforming these data into context, i.e., linking distinct expression or abundance patterns with phenotypic observables - and furthermore enabling a sound biological interpretation on the level of reaction networks and concerted pathways, is still a major shortcoming. This finding is certainly based on the enormous complexity embedded in cellular reaction networks, but a variety of computational approaches have been developed over the last few years to overcome these issues. This review provides an overview on computational procedures for analysis of genomic and proteomic data introducing a sequential analysis workflow: Explorative statistics for deriving a first, from the purely statistical viewpoint, relevant candidate gene/protein list, followed by co-regulation and network analysis to biologically expand this core list toward functional networks and pathways. The review on these procedures is complemented by example applications tailored at identification of disease-associated proteins. Optimization of computational procedures involved, in conjunction with the continuous increase in additional biological data, clearly has the potential of boosting our understanding of processes on a cell-wide level.
Collapse
Affiliation(s)
- Paul Perco
- Department of Nephrology, Medical University of Vienna, Austria
| | | | | | | | | | | | | |
Collapse
|
26
|
Perco P, Kainz A, Mayer G, Lukas A, Oberbauer R, Mayer B. Detection of coregulation in differential gene expression profiles. Biosystems 2005; 82:235-47. [PMID: 16181729 DOI: 10.1016/j.biosystems.2005.08.001] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2005] [Revised: 08/02/2005] [Accepted: 08/02/2005] [Indexed: 01/04/2023]
Abstract
Genomics and proteomics approaches generate distinct gene expression and protein profiles, listing individual genes embedded in broad functional terms as gene ontologies. However, interpretation of gene profiles in a regulatory and functional context remains a major issue. Elucidation of regulatory mechanisms at the gene expression level via analysis of promoter regions is a prominent procedure to decipher such gene regulatory networks. We propose a novel genetic algorithm (GA) to extract joint promoter modules in a set of coexpressed genes as resulting from differential gene expression experiments. Algorithm design has focused on the following constraints: (I) identification of the major promoter modules, which are (II) characterized by a maximum number of joint motifs and (III) are found in a maximum number of coexpressed genes. The capability of the GA in detecting multiple modules was evaluated on various test data sets, analyzing the impact of the number of motifs per promoter module, the number of genes associated with a module, as well as the total number of distinct promoter modules encoded in a sequence set. In addition to the test data sets, the GA was evaluated on two biological examples, namely a muscle-specific data set and the upstream sequences of the beta-actin gene (ACTB) derived from different species, complemented by a comparison to alternative promoter module identification routines.
Collapse
Affiliation(s)
- Paul Perco
- Institute for Biomolecular Structural Chemistry, University of Vienna, Campus Vienna Biocenter 6, 1030 Vienna, Austria
| | | | | | | | | | | |
Collapse
|