1
|
Hesami M, Alizadeh M, Jones AMP, Torkamaneh D. Machine learning: its challenges and opportunities in plant system biology. Appl Microbiol Biotechnol 2022; 106:3507-3530. [PMID: 35575915 DOI: 10.1007/s00253-022-11963-6] [Citation(s) in RCA: 18] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2022] [Revised: 03/14/2022] [Accepted: 05/07/2022] [Indexed: 12/25/2022]
Abstract
Sequencing technologies are evolving at a rapid pace, enabling the generation of massive amounts of data in multiple dimensions (e.g., genomics, epigenomics, transcriptomic, metabolomics, proteomics, and single-cell omics) in plants. To provide comprehensive insights into the complexity of plant biological systems, it is important to integrate different omics datasets. Although recent advances in computational analytical pipelines have enabled efficient and high-quality exploration and exploitation of single omics data, the integration of multidimensional, heterogenous, and large datasets (i.e., multi-omics) remains a challenge. In this regard, machine learning (ML) offers promising approaches to integrate large datasets and to recognize fine-grained patterns and relationships. Nevertheless, they require rigorous optimizations to process multi-omics-derived datasets. In this review, we discuss the main concepts of machine learning as well as the key challenges and solutions related to the big data derived from plant system biology. We also provide in-depth insight into the principles of data integration using ML, as well as challenges and opportunities in different contexts including multi-omics, single-cell omics, protein function, and protein-protein interaction. KEY POINTS: • The key challenges and solutions related to the big data derived from plant system biology have been highlighted. • Different methods of data integration have been discussed. • Challenges and opportunities of the application of machine learning in plant system biology have been highlighted and discussed.
Collapse
Affiliation(s)
- Mohsen Hesami
- Department of Plant Agriculture, University of Guelph, Guelph, ON, N1G 2W1, Canada
| | - Milad Alizadeh
- Department of Botany, University of British Columbia, Vancouver, BC, V6T 1Z4, Canada
| | | | - Davoud Torkamaneh
- Département de Phytologie, Université Laval, Québec City, QC, G1V 0A6, Canada. .,Institut de Biologie Intégrative Et Des Systèmes (IBIS), Université Laval, Québec City, QC, G1V 0A6, Canada.
| |
Collapse
|
2
|
Silva JCF, Teixeira RM, Silva FF, Brommonschenkel SH, Fontes EPB. Machine learning approaches and their current application in plant molecular biology: A systematic review. PLANT SCIENCE : AN INTERNATIONAL JOURNAL OF EXPERIMENTAL PLANT BIOLOGY 2019; 284:37-47. [PMID: 31084877 DOI: 10.1016/j.plantsci.2019.03.020] [Citation(s) in RCA: 27] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/18/2018] [Revised: 02/28/2019] [Accepted: 03/26/2019] [Indexed: 05/19/2023]
Abstract
Machine learning (ML) is a field of artificial intelligence that has rapidly emerged in molecular biology, thus allowing the exploitation of Big Data concepts in plant genomics. In this context, the main challenges are given in terms of how to analyze massive datasets and extract new knowledge in all levels of cellular systems research. In summary, ML techniques allow complex interactions to be inferred in several biological systems. Despite its potential, ML has been underused due to complex computational algorithms and definition terms. Therefore, a systematic review to disentangle ML approaches is relevant for plant scientists and has been considered in this study. We presented the main steps for ML development (from data selection to evaluation of classification/prediction models) with a respective discussion approaching functional genomics mainly in terms of pathogen effector genes in plant immunity. Additionally, we also considered how to access public source databases under an ML framework towards advancing plant molecular biology and introduced novel powerful tools, such as deep learning.
Collapse
Affiliation(s)
- Jose Cleydson F Silva
- National Institute of Science and Technology in Plant-Pest Interactions, Bioagro, Universidade Federal de Viçosa, Av. PH Rolfs s/n, Centro, Viçosa, MG, 36570-000, Brazil; Department of Biochemistry and Molecular Biology/Bioagro, Universidade Federal de Viçosa, Viçosa, MG, Brazil
| | - Ruan M Teixeira
- National Institute of Science and Technology in Plant-Pest Interactions, Bioagro, Universidade Federal de Viçosa, Av. PH Rolfs s/n, Centro, Viçosa, MG, 36570-000, Brazil; Department of Biochemistry and Molecular Biology/Bioagro, Universidade Federal de Viçosa, Viçosa, MG, Brazil
| | - Fabyano F Silva
- Department of Animal Science, Universidade Federal de Viçosa, Viçosa, MG, Brazil
| | - Sergio H Brommonschenkel
- National Institute of Science and Technology in Plant-Pest Interactions, Bioagro, Universidade Federal de Viçosa, Av. PH Rolfs s/n, Centro, Viçosa, MG, 36570-000, Brazil; Plant Pathology Department /Bioagro, Universidade Federal de Viçosa, Viçosa, MG, Brazil
| | - Elizabeth P B Fontes
- National Institute of Science and Technology in Plant-Pest Interactions, Bioagro, Universidade Federal de Viçosa, Av. PH Rolfs s/n, Centro, Viçosa, MG, 36570-000, Brazil; Department of Biochemistry and Molecular Biology/Bioagro, Universidade Federal de Viçosa, Viçosa, MG, Brazil.
| |
Collapse
|
3
|
Image-based promoter prediction: a promoter prediction method based on evolutionarily generated patterns. Sci Rep 2018; 8:17695. [PMID: 30523308 PMCID: PMC6283834 DOI: 10.1038/s41598-018-36308-0] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2017] [Accepted: 11/12/2018] [Indexed: 11/18/2022] Open
Abstract
Prediction of promoter regions is crucial for studying gene function and regulation. The well-accepted position weight matrix method for this purpose relies on predefined motifs, which would hinder application across different species. Here, we introduce image-based promoter prediction (IBPP) as a method that creates an “image” from training promoter sequences using an evolutionary approach and predicts promoters by matching with the “image”. We used Escherichia coli σ70 promoter sequences to test the performance of IBPP and the combination of IBPP and a support vector machine algorithm (IBPP-SVM). The “images” generated with IBPP could effectively distinguish promoter from non-promoter sequences. Compared with IBPP, IBPP-SVM showed a substantial improvement in sensitivity. Furthermore, both methods showed good performance for sequences of up to 2,000 nt in length. The performances of IBPP and IBPP-SVM were largely affected by the threshold and dimension of vectors, respectively. The source code and documentation are freely available at https://github.com/hahatcdg/IBPP.
Collapse
|
4
|
Zhang S, Li M, Ji H, Fang Z. Landscape of transcriptional deregulation in lung cancer. BMC Genomics 2018; 19:435. [PMID: 29866045 PMCID: PMC5987572 DOI: 10.1186/s12864-018-4828-1] [Citation(s) in RCA: 25] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2017] [Accepted: 05/25/2018] [Indexed: 02/07/2023] Open
Abstract
BACKGROUND Lung cancer is a very heterogeneous disease that can be pathologically classified into different subtypes including small-cell lung carcinoma (SCLC), lung adenocarcinoma (LUAD), lung squamous cell carcinoma (LUSC) and large-cell carcinoma (LCC). Although much progress has been made towards the oncogenic mechanism of each subtype, transcriptional circuits mediating the upstream signaling pathways and downstream functional consequences remain to be systematically studied. RESULTS Here we trained a one-class support vector machine (OC-SVM) model to establish a general transcription factor (TF) regulatory network containing 325 TFs and 18724 target genes. We then applied this network to lung cancer subtypes and identified those deregulated TFs and downstream targets. We found that the TP63/SOX2/DMRT3 module was specific to LUSC, corresponding to squamous epithelial differentiation and/or survival. Moreover, the LEF1/MSC module was specifically activated in LUAD and likely to confer epithelial-to-mesenchymal transition, known important for cancer malignant progression and metastasis. The proneural factor, ASCL1, was specifically up-regulated in SCLC which is known to have a neuroendocrine phenotype. Also, ID2 was differentially regulated between SCLC and LUSC, with its up-regulation in SCLC linking to energy supply for fast mitosis and its down-regulation in LUSC linking to the attenuation of immune response. We further described the landscape of TF regulation among the three major subtypes of lung cancer, highlighting their functional commonalities and specificities. CONCLUSIONS Our approach uncovered the landscape of transcriptional deregulation in lung cancer, and provided a useful resource of TF regulatory network for future studies.
Collapse
Affiliation(s)
- Shu Zhang
- School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, 200240 People’s Republic of China
- State Key Laboratory of Cell Biology, Shanghai, China
- CAS Center for Excellence in Molecular Cell Science, Shanghai, China
- Innovation Center for Cell Signaling Network, Institute of Biochemistry and Cell Biology, Shanghai, 200031 China
| | - Mingfa Li
- School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, 200240 People’s Republic of China
| | - Hongbin Ji
- State Key Laboratory of Cell Biology, Shanghai, China
- CAS Center for Excellence in Molecular Cell Science, Shanghai, China
- Innovation Center for Cell Signaling Network, Institute of Biochemistry and Cell Biology, Shanghai, 200031 China
- School of Life Science and Technology, Shanghai Tech University, Shanghai, 200120 China
| | - Zhaoyuan Fang
- State Key Laboratory of Cell Biology, Shanghai, China
- CAS Center for Excellence in Molecular Cell Science, Shanghai, China
- Innovation Center for Cell Signaling Network, Institute of Biochemistry and Cell Biology, Shanghai, 200031 China
- Shanghai Institutes for Biological Sciences, Chinese Academy of Science, Shanghai, 200031 China
| |
Collapse
|
5
|
Meher PK, Sahu TK, Rao AR, Wahi SD. Identification of donor splice sites using support vector machine: a computational approach based on positional, compositional and dependency features. Algorithms Mol Biol 2016; 11:16. [PMID: 27252772 PMCID: PMC4888255 DOI: 10.1186/s13015-016-0078-4] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2015] [Accepted: 05/17/2016] [Indexed: 11/16/2022] Open
Abstract
Background Identification of splice sites is essential for annotation of genes. Though existing approaches have achieved an acceptable level of accuracy, still there is a need for further improvement. Besides, most of the approaches are species-specific and hence it is required to develop approaches compatible across species. Results Each splice site sequence was transformed into a numeric vector of length 49, out of which four were positional, four were dependency and 41 were compositional features. Using the transformed vectors as input, prediction was made through support vector machine. Using balanced training set, the proposed approach achieved area under ROC curve (AUC-ROC) of 96.05, 96.96, 96.95, 96.24 % and area under PR curve (AUC-PR) of 97.64, 97.89, 97.91, 97.90 %, while tested on human, cattle, fish and worm datasets respectively. On the other hand, AUC-ROC of 97.21, 97.45, 97.41, 98.06 % and AUC-PR of 93.24, 93.34, 93.38, 92.29 % were obtained, while imbalanced training datasets were used. The proposed approach was found comparable with state-of-art splice site prediction approaches, while compared using the bench mark NN269 dataset and other datasets. Conclusions The proposed approach achieved consistent accuracy across different species as well as found comparable with the existing approaches. Thus, we believe that the proposed approach can be used as a complementary method to the existing methods for the prediction of splice sites. A web server named as ‘HSplice’ has also been developed based on the proposed approach for easy prediction of 5′ splice sites by the users and is freely available at http://cabgrid.res.in:8080/HSplice.
Collapse
|
6
|
Boeva V. Analysis of Genomic Sequence Motifs for Deciphering Transcription Factor Binding and Transcriptional Regulation in Eukaryotic Cells. Front Genet 2016; 7:24. [PMID: 26941778 PMCID: PMC4763482 DOI: 10.3389/fgene.2016.00024] [Citation(s) in RCA: 87] [Impact Index Per Article: 10.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2015] [Accepted: 02/05/2016] [Indexed: 12/27/2022] Open
Abstract
Eukaryotic genomes contain a variety of structured patterns: repetitive elements, binding sites of DNA and RNA associated proteins, splice sites, and so on. Often, these structured patterns can be formalized as motifs and described using a proper mathematical model such as position weight matrix and IUPAC consensus. Two key tasks are typically carried out for motifs in the context of the analysis of genomic sequences. These are: identification in a set of DNA regions of over-represented motifs from a particular motif database, and de novo discovery of over-represented motifs. Here we describe existing methodology to perform these two tasks for motifs characterizing transcription factor binding. When applied to the output of ChIP-seq and ChIP-exo experiments, or to promoter regions of co-modulated genes, motif analysis techniques allow for the prediction of transcription factor binding events and enable identification of transcriptional regulators and co-regulators. The usefulness of motif analysis is further exemplified in this review by how motif discovery improves peak calling in ChIP-seq and ChIP-exo experiments and, when coupled with information on gene expression, allows insights into physical mechanisms of transcriptional modulation.
Collapse
Affiliation(s)
- Valentina Boeva
- Centre de Recherche, Institut CurieParis, France; INSERM, U900Paris, France; Mines ParisTechFontainebleau, France; PSL Research UniversityParis, France; Department of Development, Reproduction and Cancer, Institut CochinParis, France; INSERM, U1016Paris, France; Centre National de la Recherche Scientifique UMR 8104Paris, France; Université Paris Descartes UMR-S1016Paris, France
| |
Collapse
|
7
|
Kamath U, De Jong K, Shehu A. Effective automated feature construction and selection for classification of biological sequences. PLoS One 2014; 9:e99982. [PMID: 25033270 PMCID: PMC4102475 DOI: 10.1371/journal.pone.0099982] [Citation(s) in RCA: 45] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2013] [Accepted: 05/21/2014] [Indexed: 11/25/2022] Open
Abstract
BACKGROUND Many open problems in bioinformatics involve elucidating underlying functional signals in biological sequences. DNA sequences, in particular, are characterized by rich architectures in which functional signals are increasingly found to combine local and distal interactions at the nucleotide level. Problems of interest include detection of regulatory regions, splice sites, exons, hypersensitive sites, and more. These problems naturally lend themselves to formulation as classification problems in machine learning. When classification is based on features extracted from the sequences under investigation, success is critically dependent on the chosen set of features. METHODOLOGY We present an algorithmic framework (EFFECT) for automated detection of functional signals in biological sequences. We focus here on classification problems involving DNA sequences which state-of-the-art work in machine learning shows to be challenging and involve complex combinations of local and distal features. EFFECT uses a two-stage process to first construct a set of candidate sequence-based features and then select a most effective subset for the classification task at hand. Both stages make heavy use of evolutionary algorithms to efficiently guide the search towards informative features capable of discriminating between sequences that contain a particular functional signal and those that do not. RESULTS To demonstrate its generality, EFFECT is applied to three separate problems of importance in DNA research: the recognition of hypersensitive sites, splice sites, and ALU sites. Comparisons with state-of-the-art algorithms show that the framework is both general and powerful. In addition, a detailed analysis of the constructed features shows that they contain valuable biological information about DNA architecture, allowing biologists and other researchers to directly inspect the features and potentially use the insights obtained to assist wet-laboratory studies on retainment or modification of a specific signal. Code, documentation, and all data for the applications presented here are provided for the community at http://www.cs.gmu.edu/~ashehu/?q=OurTools.
Collapse
Affiliation(s)
- Uday Kamath
- Computer Science, George Mason University, Fairfax, Virginia, United States of America
| | - Kenneth De Jong
- Computer Science, George Mason University, Fairfax, Virginia, United States of America
- Krasnow Institute, George Mason University, Fairfax, Virginia, United States of America
| | - Amarda Shehu
- Computer Science, George Mason University, Fairfax, Virginia, United States of America
- Bioengineering, George Mason University, Fairfax, Virginia, United States of America
- School of Systems Biology, George Mason University, Fairfax, Virginia, United States of America
| |
Collapse
|
8
|
Lim JH, Iggo RD, Barker D. Models incorporating chromatin modification data identify functionally important p53 binding sites. Nucleic Acids Res 2013; 41:5582-93. [PMID: 23599002 PMCID: PMC3675478 DOI: 10.1093/nar/gkt260] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open
Abstract
Genome-wide prediction of transcription factor binding sites is notoriously difficult. We have developed and applied a logistic regression approach for prediction of binding sites for the p53 transcription factor that incorporates sequence information and chromatin modification data. We tested this by comparison of predicted sites with known binding sites defined by chromatin immunoprecipitation (ChIP), by the location of predictions relative to genes, by the function of nearby genes and by analysis of gene expression data after p53 activation. We compared the predictions made by our novel model with predictions based only on matches to a sequence position weight matrix (PWM). In whole genome assays, the fraction of known sites identified by the two models was similar, suggesting that there was little to be gained from including chromatin modification data. In contrast, there were highly significant and biologically relevant differences between the two models in the location of the predicted binding sites relative to genes, in the function of nearby genes and in the responsiveness of nearby genes to p53 activation. We propose that these contradictory results can be explained by PWM and ChIP data reflecting primarily biophysical properties of protein–DNA interactions, whereas chromatin modification data capture biologically important functional information.
Collapse
Affiliation(s)
- Ji-Hyun Lim
- Sir Harold Mitchell Building, School of Biology, University of St Andrews, St Andrews, Fife, KY16 9TH, UK
| | | | | |
Collapse
|
9
|
Genome-wide identification of Polycomb target genes in human embryonic stem cells. Gene 2013; 518:425-30. [PMID: 23313299 DOI: 10.1016/j.gene.2012.12.022] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2012] [Revised: 10/22/2012] [Accepted: 12/02/2012] [Indexed: 11/22/2022]
Abstract
Polycomb group (PcG) proteins are epigenetic regulators that are essential for stem cell differentiation. Identifying PcG binding profiles is important for understanding the mechanisms of PcG-mediated repression in mammals. We used a mapping-convergence (M-C) algorithm using support vector machine (SVM) technology for genome-wide identification of PcG target genes in human embryonic stem cells. The method combined histone modifications and transcription factor binding motifs, eliminating the need for negative training samples as in traditional SVM. Good prediction accuracy comprising 3-fold cross-validation was obtained. In the analysis of 3133 PcG target genes identified by the model, PcG proteins were observed to suppress gene expression during differentiation. The results suggested that PcG and DNA methylation non-redundantly repress gene expression during differentiation. The genome-wide identification of PcG target genes will aid the further analysis of PcG mechanisms.
Collapse
|
10
|
He Y, Zhang Y, Zheng G, Wei C. CTF: a CRF-based transcription factor binding sites finding system. BMC Genomics 2012; 13 Suppl 8:S18. [PMID: 23282203 PMCID: PMC3535700 DOI: 10.1186/1471-2164-13-s8-s18] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
Background Identifying the location of transcription factor bindings is crucial to understand transcriptional regulation. Currently, Chromatin Immunoprecipitation followed with high-throughput Sequencing (ChIP-seq) is able to locate the transcription factor binding sites (TFBSs) accurately in high throughput and it has become the gold-standard method for TFBS finding experimentally. However, due to its high cost, it is impractical to apply the method in a very large scale. Considering the large number of transcription factors, numerous cell types and various conditions, computational methods are still very valuable to accurate TFBS identification. Results In this paper, we proposed a novel integrated TFBS prediction system, CTF, based on Conditional Random Fields (CRFs). Integrating information from different sources, CTF was able to capture patterns of TFBSs contained in different features (sequence, chromatin and etc) and predicted the TFBS locations with a high accuracy. We compared CTF with several existing tools as well as the PWM baseline method on a dataset generated by ChIP-seq experiments (TFBSs of 13 transcription factors in mouse genome). Results showed that CTF performed significantly better than existing methods tested. Conclusions CTF is a powerful tool to predict TFBSs by integrating high throughput data and different features. It can be a useful complement to ChIP-seq and other experimental methods for TFBS identification and thus improve our ability to investigate functional elements in post-genomic era. Availability: CTF is freely available to academic users at: http://cbb.sjtu.edu.cn/~ccwei/pub/software/CTF/CTF.php
Collapse
Affiliation(s)
- Yupeng He
- School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai 200240, China
| | | | | | | |
Collapse
|
11
|
Zhu X, Ding W, Yu PS, Zhang C. One-class learning and concept summarization for data streams. Knowl Inf Syst 2010. [DOI: 10.1007/s10115-010-0331-y] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
12
|
Daenen F, van Roy F, De Bleser PJ. Low nucleosome occupancy is encoded around functional human transcription factor binding sites. BMC Genomics 2008; 9:332. [PMID: 18627598 PMCID: PMC2490708 DOI: 10.1186/1471-2164-9-332] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2008] [Accepted: 07/15/2008] [Indexed: 12/04/2022] Open
Abstract
Background Transcriptional regulation of genes in eukaryotes is achieved by the interactions of multiple transcription factors with arrays of transcription factor binding sites (TFBSs) on DNA and with each other. Identification of these TFBSs is an essential step in our understanding of gene regulatory networks, but computational prediction of TFBSs with either consensus or commonly used stochastic models such as Position-Specific Scoring Matrices (PSSMs) results in an unacceptably high number of hits consisting of a few true functional binding sites and numerous false non-functional binding sites. This is due to the inability of the models to incorporate higher order properties of sequences including sequences surrounding TFBSs and influencing the positioning of nucleosomes and/or the interactions that might occur between transcription factors. Results Significant improvement can be expected through the development of a new framework for the modeling and prediction of TFBSs that considers explicitly these higher order sequence properties. It would be particularly interesting to include in the new modeling framework the information present in the nucleosome positioning sequences (NPSs) surrounding TFBSs, as it can be hypothesized that genomes use this information to encode the formation of stable nucleosomes over non-functional sites, while functional sites have a more open chromatin configuration. In this report we evaluate the usefulness of the latter feature by comparing the nucleosome occupancy probabilities around experimentally verified human TFBSs with the nucleosome occupancy probabilities around false positive TFBSs and in random sequences. Conclusion We present evidence that nucleosome occupancy is remarkably lower around true functional human TFBSs as compared to non-functional human TFBSs, which supports the use of this feature to improve current TFBS prediction approaches in higher eukaryotes.
Collapse
|
13
|
Hannenhalli S. Eukaryotic transcription factor binding sites--modeling and integrative search methods. Bioinformatics 2008; 24:1325-31. [PMID: 18426806 DOI: 10.1093/bioinformatics/btn198] [Citation(s) in RCA: 76] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023] Open
Abstract
A comprehensive knowledge of transcription factor binding sites (TFBS) is important for a mechanistic understanding of transcriptional regulation as well as for inferring gene regulatory networks. Because the DNA motif recognized by a transcription factor is typically short and degenerate, computational approaches for identifying binding sites based only on the sequence motif inevitably suffer from high error rates. Current state-of-the-art techniques for improving computational identification of binding sites can be broadly categorized into two classes: (1) approaches that aim to improve binding motif models by extracting maximal sequence information from experimentally determined binding sites and (2) approaches that supplement binding motif models with additional genomic or other attributes (such as evolutionary conservation). In this review we will discuss recent attempts to improve computational identification of TFBS through these two types of approaches and conclude with thoughts on future development.
Collapse
Affiliation(s)
- Sridhar Hannenhalli
- Penn Center for Bioinformatics and Department of Genetics, University of Pennsylvania, Philadelphia, USA.
| |
Collapse
|