1
|
Development of a Tree Growth Difference Equation and Its Application in Forecasting the Biomass Carbon Stocks of Chinese Forests in 2050. FORESTS 2019. [DOI: 10.3390/f10070582] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Global climate change has raised concerns about the relationship between ecosystems and forests, which is a core component of the carbon cycle and a critical factor in understanding and mitigating the effects of climate change. Forest models and sufficient information for predictions are important for ensuring efficient afforestation activities and sustainable forest development. Based on the theory of difference equations and the general rules of tree growth, this study established a difference equation for the relationship between the ratio of tree diameter at breast height (DBH) to the tree height and age of age of China’s main arbor species. A comparison with equations that represent the traditional tree growth models, i.e., Logistic and Richards equations, showed that the difference equations exhibited higher precision for both fitting and verification data. Moreover, the biomass carbon stocks (BCS) of Chinese forests from 2013 to 2050 were predicted by combining the 8th Chinese Ministry of Forestry and partial continuous forest inventory (CFI) data sets. The results showed that the BCS of Chinese forests would increase from 7342 to 11,030 terra grams of carbon (Tg C) in 2013–2050, with an annual biomass C (carbon) sink of 99.68 Tg C year−1, and they indicated that the Chinese land-surface forest vegetation has an important carbon sequestration capability.
Collapse
|
2
|
Pashaei E, Yilmaz A, Ozen M, Aydin N. A novel method for splice sites prediction using sequence component and hidden Markov model. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2017; 2016:3076-3079. [PMID: 28268961 DOI: 10.1109/embc.2016.7591379] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
With increasing growth of DNA sequence data, it has become an urgent demand to develop new methods to accurately predict the genes. The performance of gene detection methods mainly depend on the efficiency of splice site prediction methods. In this paper, a novel method for detecting splice sites is proposed by using a new effective DNA encoding method and AdaBoost.M1 classifier. Our proposed DNA encoding method is based on multi-scale component (MSC) and first order Markov model (MM1). It has been applied to the HS3D dataset with repeated 10 fold cross validation. The experimental results indicate that the new method has increased the classification accuracy and outperformed some current methods such as MM1-SVM, Reduced MM1-SVM, SVM-B, LVMM, DM-SVM, DM2-AdaBoost and MS C+Pos(+APR)-SVM.
Collapse
|
3
|
|
4
|
Herndon N, Caragea D. A Study of Domain Adaptation Classifiers Derived From Logistic Regression for the Task of Splice Site Prediction. IEEE Trans Nanobioscience 2016; 15:75-83. [PMID: 26849871 DOI: 10.1109/tnb.2016.2522400] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
Supervised classifiers are highly dependent on abundant labeled training data. Alternatives for addressing the lack of labeled data include: labeling data (but this is costly and time consuming); training classifiers with abundant data from another domain (however, the classification accuracy usually decreases as the distance between domains increases); or complementing the limited labeled data with abundant unlabeled data from the same domain and learning semi-supervised classifiers (but the unlabeled data can mislead the classifier). A better alternative is to use both the abundant labeled data from a source domain, the limited labeled data and optionally the unlabeled data from the target domain to train classifiers in a domain adaptation setting. We propose two such classifiers, based on logistic regression, and evaluate them for the task of splice site prediction-a difficult and essential step in gene prediction. Our classifiers achieved high accuracy, with highest areas under the precision-recall curve between 50.83% and 82.61%.
Collapse
|
5
|
Random Forest in Splice Site Prediction of Human Genome. XIV MEDITERRANEAN CONFERENCE ON MEDICAL AND BIOLOGICAL ENGINEERING AND COMPUTING 2016 2016. [DOI: 10.1007/978-3-319-32703-7_100] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
|
6
|
Kamath U, De Jong K, Shehu A. Effective automated feature construction and selection for classification of biological sequences. PLoS One 2014; 9:e99982. [PMID: 25033270 PMCID: PMC4102475 DOI: 10.1371/journal.pone.0099982] [Citation(s) in RCA: 45] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2013] [Accepted: 05/21/2014] [Indexed: 11/25/2022] Open
Abstract
BACKGROUND Many open problems in bioinformatics involve elucidating underlying functional signals in biological sequences. DNA sequences, in particular, are characterized by rich architectures in which functional signals are increasingly found to combine local and distal interactions at the nucleotide level. Problems of interest include detection of regulatory regions, splice sites, exons, hypersensitive sites, and more. These problems naturally lend themselves to formulation as classification problems in machine learning. When classification is based on features extracted from the sequences under investigation, success is critically dependent on the chosen set of features. METHODOLOGY We present an algorithmic framework (EFFECT) for automated detection of functional signals in biological sequences. We focus here on classification problems involving DNA sequences which state-of-the-art work in machine learning shows to be challenging and involve complex combinations of local and distal features. EFFECT uses a two-stage process to first construct a set of candidate sequence-based features and then select a most effective subset for the classification task at hand. Both stages make heavy use of evolutionary algorithms to efficiently guide the search towards informative features capable of discriminating between sequences that contain a particular functional signal and those that do not. RESULTS To demonstrate its generality, EFFECT is applied to three separate problems of importance in DNA research: the recognition of hypersensitive sites, splice sites, and ALU sites. Comparisons with state-of-the-art algorithms show that the framework is both general and powerful. In addition, a detailed analysis of the constructed features shows that they contain valuable biological information about DNA architecture, allowing biologists and other researchers to directly inspect the features and potentially use the insights obtained to assist wet-laboratory studies on retainment or modification of a specific signal. Code, documentation, and all data for the applications presented here are provided for the community at http://www.cs.gmu.edu/~ashehu/?q=OurTools.
Collapse
Affiliation(s)
- Uday Kamath
- Computer Science, George Mason University, Fairfax, Virginia, United States of America
| | - Kenneth De Jong
- Computer Science, George Mason University, Fairfax, Virginia, United States of America
- Krasnow Institute, George Mason University, Fairfax, Virginia, United States of America
| | - Amarda Shehu
- Computer Science, George Mason University, Fairfax, Virginia, United States of America
- Bioengineering, George Mason University, Fairfax, Virginia, United States of America
- School of Systems Biology, George Mason University, Fairfax, Virginia, United States of America
| |
Collapse
|
7
|
Li JL, Wang LF, Wang HY, Bai LY, Yuan ZM. High-accuracy splice site prediction based on sequence component and position features. GENETICS AND MOLECULAR RESEARCH 2012; 11:3432-51. [PMID: 23079837 DOI: 10.4238/2012.september.25.12] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
Abstract
Identification of splice sites plays a key role in the annotation of genes. Consequently, improvement of computational prediction of splice sites would be very useful. We examined the effect of the window size and the number and position of the consensus bases with a chi-square test, and then extracted the sequence multi-scale component features and the position and adjacent position relationship features of consensus sites. Then, we constructed a novel classification model using a support vector machine with the previously selected features and applied it to the Homo sapiens splice site dataset. This method greatly improved cross-validation accuracies for training sets with true and spurious splice sites of both equal and different proportions. This method was also applied to the NN269 dataset for further evaluation and independent testing. The results were superior to those obtained with previous methods, and demonstrate the stability and superiority of this method for prediction of splice sites.
Collapse
Affiliation(s)
- J L Li
- Hunan Provincial Key Laboratory of Crop Germplasm Innovation and Utilization, Hunan Agricultural University, Changsha, China
| | | | | | | | | |
Collapse
|
8
|
Abstract
MOTIVATION Quantification of sequence abundance in RNA-Seq experiments is often conflated by protocol-specific sequence bias. The exact sources of the bias are unknown, but may be influenced by polymerase chain reaction amplification, or differing primer affinities and mixtures, for example. The result is decreased accuracy in many applications, such as de novo gene annotation and transcript quantification. RESULTS We present a new method to measure and correct for these influences using a simple graphical model. Our model does not rely on existing gene annotations, and model selection is performed automatically making it applicable with few assumptions. We evaluate our method on several datasets, and by multiple criteria, demonstrating that it effectively decreases bias and increases uniformity. Additionally, we provide theoretical and empirical results showing that the method is unlikely to have any effect on unbiased data, suggesting it can be applied with little risk of spurious adjustment. AVAILABILITY The method is implemented in the seqbias R/Bioconductor package, available freely under the LGPL license from http://bioconductor.org CONTACT dcjones@cs.washington.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Daniel C Jones
- Department of Computer Science and Engineering, University of Washington, Seattle, WA 98195-2350, USA.
| | | | | | | |
Collapse
|
9
|
Nasibov E, Tunaboylu S. Classification of splice-junction sequences via weighted position specific scoring approach. Comput Biol Chem 2010; 34:293-9. [PMID: 21056007 DOI: 10.1016/j.compbiolchem.2010.10.003] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2010] [Accepted: 10/06/2010] [Indexed: 11/30/2022]
Abstract
The prediction of the complete structure of genes is one of the very important tasks of bioinformatics, especially in eukaryotes. A crucial part in the gene structure prediction is to determine the splice sites in the coding region. Identification of splice sites depends on the precise recognition of the boundaries between exons and introns of a given DNA sequence. This problem can be formulated as a classification of sequence elements into 'exon-intron' (EI), 'intron-exon' (IE) or 'None' (N) boundary classes. In this study we propose a new Weighted Position Specific Scoring Method (WPSSM) to recognize splice sites which uses a position-specific scoring matrix constructed by nucleotide base frequencies. A genetic algorithm is used in order to tune the weight and threshold parameters of the positions on. This method consists of two phases: learning phase and identification phase. The proposed WPSS method poses efficient results compared with the performance of many methods proposed in the literature. Computational experiments are performed on the DNA sequence datasets from 'UCI Repository of machine learning databases'.
Collapse
Affiliation(s)
- Efendi Nasibov
- Department of Computer Science, Dokuz Eylul University, Izmir, Turkey. efendi
| | | |
Collapse
|
10
|
Assi SA, Tanaka T, Rabbitts TH, Fernandez-Fuentes N. PCRPi: Presaging Critical Residues in Protein interfaces, a new computational tool to chart hot spots in protein interfaces. Nucleic Acids Res 2010; 38:e86. [PMID: 20008102 PMCID: PMC2847225 DOI: 10.1093/nar/gkp1158] [Citation(s) in RCA: 61] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2009] [Revised: 11/13/2009] [Accepted: 11/24/2009] [Indexed: 11/30/2022] Open
Abstract
Protein-protein interactions (PPIs) are ubiquitous in Biology, and thus offer an enormous potential for the discovery of novel therapeutics. Although protein interfaces are large and lack defining physiochemical traits, is well established that only a small portion of interface residues, the so-called hot spot residues, contribute the most to the binding energy of the protein complex. Moreover, recent successes in development of novel drugs aimed at disrupting PPIs rely on targeting such residues. Experimental methods for describing critical residues are lengthy and costly; therefore, there is a need for computational tools that can complement experimental efforts. Here, we describe a new computational approach to predict hot spot residues in protein interfaces. The method, called Presaging Critical Residues in Protein interfaces (PCRPi), depends on the integration of diverse metrics into a unique probabilistic measure by using Bayesian Networks. We have benchmarked our method using a large set of experimentally verified hot spot residues and on a blind prediction on the protein complex formed by HRAS protein and a single domain antibody. Under both scenarios, PCRPi delivered consistent and accurate predictions. Finally, PCRPi is able to handle cases where some of the input data is either missing or not reliable (e.g. evolutionary information).
Collapse
Affiliation(s)
| | | | | | - Narcis Fernandez-Fuentes
- Leeds Institute of Molecular Medicine, Section of Experimental Therapeutics, St James’s University Hospital, University of Leeds, Leeds, LS9 7TF, UK
| |
Collapse
|
11
|
Keilwagen J, Grau J, Posch S, Grosse I. Apples and oranges: avoiding different priors in Bayesian DNA sequence analysis. BMC Bioinformatics 2010; 11:149. [PMID: 20307305 PMCID: PMC2859755 DOI: 10.1186/1471-2105-11-149] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2009] [Accepted: 03/22/2010] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND One of the challenges of bioinformatics remains the recognition of short signal sequences in genomic DNA such as donor or acceptor splice sites, splicing enhancers or silencers, translation initiation sites, transcription start sites, transcription factor binding sites, nucleosome binding sites, miRNA binding sites, or insulator binding sites. During the last decade, a wealth of algorithms for the recognition of such DNA sequences has been developed and compared with the goal of improving their performance and to deepen our understanding of the underlying cellular processes. Most of these algorithms are based on statistical models belonging to the family of Markov random fields such as position weight matrix models, weight array matrix models, Markov models of higher order, or moral Bayesian networks. While in many comparative studies different learning principles or different statistical models have been compared, the influence of choosing different prior distributions for the model parameters when using different learning principles has been overlooked, and possibly lead to questionable conclusions. RESULTS With the goal of allowing direct comparisons of different learning principles for models from the family of Markov random fields based on the same a-priori information, we derive a generalization of the commonly-used product-Dirichlet prior. We find that the derived prior behaves like a Gaussian prior close to the maximum and like a Laplace prior in the far tails. In two case studies, we illustrate the utility of the derived prior for a direct comparison of different learning principles with different models for the recognition of binding sites of the transcription factor Sp1 and human donor splice sites. CONCLUSIONS We find that comparisons of different learning principles using the same a-priori information can lead to conclusions different from those of previous studies in which the effect resulting from different priors has been neglected. We implement the derived prior is implemented in the open-source library Jstacs to enable an easy application to comparative studies of different learning principles in the field of sequence analysis.
Collapse
Affiliation(s)
- Jens Keilwagen
- Molecular Genetics, Leibniz Institute of Plant Genetics and Crop Plant Research (IPK), Gatersleben, Germany.
| | | | | | | |
Collapse
|
12
|
Keilwagen J, Grau J, Posch S, Strickert M, Grosse I. Unifying generative and discriminative learning principles. BMC Bioinformatics 2010; 11:98. [PMID: 20175896 PMCID: PMC2848239 DOI: 10.1186/1471-2105-11-98] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2009] [Accepted: 02/22/2010] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The recognition of functional binding sites in genomic DNA remains one of the fundamental challenges of genome research. During the last decades, a plethora of different and well-adapted models has been developed, but only little attention has been payed to the development of different and similarly well-adapted learning principles. Only recently it was noticed that discriminative learning principles can be superior over generative ones in diverse bioinformatics applications, too. RESULTS Here, we propose a generalization of generative and discriminative learning principles containing the maximum likelihood, maximum a posteriori, maximum conditional likelihood, maximum supervised posterior, generative-discriminative trade-off, and penalized generative-discriminative trade-off learning principles as special cases, and we illustrate its efficacy for the recognition of vertebrate transcription factor binding sites. CONCLUSIONS We find that the proposed learning principle helps to improve the recognition of transcription factor binding sites, enabling better computational approaches for extracting as much information as possible from valuable wet-lab data. We make all implementations available in the open-source library Jstacs so that this learning principle can be easily applied to other classification problems in the field of genome and epigenome analysis.
Collapse
Affiliation(s)
- Jens Keilwagen
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK), Gatersleben, Germany
| | - Jan Grau
- Institute of Computer Science, Martin Luther University Halle-Wittenberg, Germany
| | - Stefan Posch
- Institute of Computer Science, Martin Luther University Halle-Wittenberg, Germany
| | - Marc Strickert
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK), Gatersleben, Germany
| | - Ivo Grosse
- Institute of Computer Science, Martin Luther University Halle-Wittenberg, Germany
| |
Collapse
|
13
|
Buckley KM, Florea LD, Smith LC. A method for identifying alternative or cryptic donor splice sites within gene and mRNA sequences. Comparisons among sequences from vertebrates, echinoderms and other groups. BMC Genomics 2009; 10:318. [PMID: 19607703 PMCID: PMC2721852 DOI: 10.1186/1471-2164-10-318] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2008] [Accepted: 07/16/2009] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND As the amount of genome sequencing data grows, so does the problem of computational gene identification, and in particular, the splicing signals that flank exon borders. Traditional methods for identifying splicing signals have been created and optimized using sequences from model organisms, mostly vertebrate and yeast species. However, as genome sequencing extends across the animal kingdom and includes various invertebrate species, the need for mechanisms to recognize splice signals in these organisms increases as well. With that aim in mind, we generated a model for identifying donor and acceptor splice sites that was optimized using sequences from the purple sea urchin, Strongylocentrotus purpuratus. This model was then used to assess the possibility of alternative or cryptic splicing within the highly variable immune response gene family known as 185/333. RESULTS A donor splice site model was generated from S. purpuratus sequences that incorporates non-adjacent dependences among positions within the 9 nt splice signal and uses position weight matrices to determine the probability that the site is used for splicing. The Purpuratus model was shown to predict splice signals better than a similar model created from vertebrate sequences. Although the Purpuratus model was able to correctly predict the true splice sites within the 185/333 genes, no evidence for alternative or trans-gene splicing was observed. CONCLUSION The data presented herein describe the first published analyses of echinoderm splice sites and suggest that the previous methods of identifying splice signals that are based largely on vertebrate sequences may be insufficient. Furthermore, alternative or trans-gene splicing does not appear to be acting as a diversification mechanism in the 185/333 gene family.
Collapse
Affiliation(s)
- Katherine M Buckley
- The Department of Biological Sciences, Washington University, Washington, DC 20052, USA.
| | | | | |
Collapse
|
14
|
Baten AKMA, Halgamuge SK, Chang BCH. Fast splice site detection using information content and feature reduction. BMC Bioinformatics 2008; 9 Suppl 12:S8. [PMID: 19091031 PMCID: PMC2638148 DOI: 10.1186/1471-2105-9-s12-s8] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Accurate identification of splice sites in DNA sequences plays a key role in the prediction of gene structure in eukaryotes. Already many computational methods have been proposed for the detection of splice sites and some of them showed high prediction accuracy. However, most of these methods are limited in terms of their long computation time when applied to whole genome sequence data. RESULTS In this paper we propose a hybrid algorithm which combines several effective and informative input features with the state of the art support vector machine (SVM). To obtain the input features we employ information content method based on Shannon's information theory, Shapiro's score scheme, and Markovian probabilities. We also use a feature elimination scheme to reduce the less informative features from the input data. CONCLUSION In this study we propose a new feature based splice site detection method that shows improved acceptor and donor splice site detection in DNA sequences when the performance is compared with various state of the art and well known methods.
Collapse
Affiliation(s)
- AKMA Baten
- Biomechanical Engineering Research Group, Department of Mechanical Engineering, Melbourne School of Engineering, The University of Melbourne, Victoria 3010, Australia
| | - SK Halgamuge
- Biomechanical Engineering Research Group, Department of Mechanical Engineering, Melbourne School of Engineering, The University of Melbourne, Victoria 3010, Australia
| | - BCH Chang
- Institute of Plant and Microbial Biology, Academia Sinica, Taiwan
| |
Collapse
|
15
|
Sinha R, Hiller M, Pudimat R, Gausmann U, Platzer M, Backofen R. Improved identification of conserved cassette exons using Bayesian networks. BMC Bioinformatics 2008; 9:477. [PMID: 19014490 PMCID: PMC2621368 DOI: 10.1186/1471-2105-9-477] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2008] [Accepted: 11/12/2008] [Indexed: 12/14/2022] Open
Abstract
Background Alternative splicing is a major contributor to the diversity of eukaryotic transcriptomes and proteomes. Currently, large scale detection of alternative splicing using expressed sequence tags (ESTs) or microarrays does not capture all alternative splicing events. Moreover, for many species genomic data is being produced at a far greater rate than corresponding transcript data, hence in silico methods of predicting alternative splicing have to be improved. Results Here, we show that the use of Bayesian networks (BNs) allows accurate prediction of evolutionary conserved exon skipping events. At a stringent false positive rate of 0.5%, our BN achieves an improved true positive rate of 61%, compared to a previously reported 50% on the same dataset using support vector machines (SVMs). Incorporating several novel discriminative features such as intronic splicing regulatory elements leads to the improvement. Features related to mRNA secondary structure increase the prediction performance, corroborating previous findings that secondary structures are important for exon recognition. Random labelling tests rule out overfitting. Cross-validation on another dataset confirms the increased performance. When using the same dataset and the same set of features, the BN matches the performance of an SVM in earlier literature. Remarkably, we could show that about half of the exons which are labelled constitutive but receive a high probability of being alternative by the BN, are in fact alternative exons according to the latest EST data. Finally, we predict exon skipping without using conservation-based features, and achieve a true positive rate of 29% at a false positive rate of 0.5%. Conclusion BNs can be used to achieve accurate identification of alternative exons and provide clues about possible dependencies between relevant features. The near-identical performance of the BN and SVM when using the same features shows that good classification depends more on features than on the choice of classifier. Conservation based features continue to be the most informative, and hence distinguishing alternative exons from constitutive ones without using conservation based features remains a challenging problem.
Collapse
Affiliation(s)
- Rileen Sinha
- Genome Analysis, Leibniz Institute for Age Research, Fritz Lipmann Institute, Jena, Germany.
| | | | | | | | | | | |
Collapse
|
16
|
Sonnenburg S, Schweikert G, Philips P, Behr J, Rätsch G. Accurate splice site prediction using support vector machines. BMC Bioinformatics 2007; 8 Suppl 10:S7. [PMID: 18269701 PMCID: PMC2230508 DOI: 10.1186/1471-2105-8-s10-s7] [Citation(s) in RCA: 118] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023] Open
Abstract
BACKGROUND For splice site recognition, one has to solve two classification problems: discriminating true from decoy splice sites for both acceptor and donor sites. Gene finding systems typically rely on Markov Chains to solve these tasks. RESULTS In this work we consider Support Vector Machines for splice site recognition. We employ the so-called weighted degree kernel which turns out well suited for this task, as we will illustrate in several experiments where we compare its prediction accuracy with that of recently proposed systems. We apply our method to the genome-wide recognition of splice sites in Caenorhabditis elegans, Drosophila melanogaster, Arabidopsis thaliana, Danio rerio, and Homo sapiens. Our performance estimates indicate that splice sites can be recognized very accurately in these genomes and that our method outperforms many other methods including Markov Chains, GeneSplicer and SpliceMachine. We provide genome-wide predictions of splice sites and a stand-alone prediction tool ready to be used for incorporation in a gene finder. AVAILABILITY Data, splits, additional information on the model selection, the whole genome predictions, as well as the stand-alone prediction tool are available for download at http://www.fml.mpg.de/raetsch/projects/splice.
Collapse
Affiliation(s)
| | - Gabriele Schweikert
- Friedrich Miescher Laboratory of the Max Planck Society, Spemannstr. 39, 72076 Tübingen, Germany,Max Planck Institute for Biological Cybernetics, Spemannstr. 38, 72076 Tübingen, Germany,Max Planck Institute for Developmental Biology, Spemannstr. 35, 72076 Tübingen, Germany
| | - Petra Philips
- Friedrich Miescher Laboratory of the Max Planck Society, Spemannstr. 39, 72076 Tübingen, Germany
| | - Jonas Behr
- Friedrich Miescher Laboratory of the Max Planck Society, Spemannstr. 39, 72076 Tübingen, Germany
| | - Gunnar Rätsch
- Friedrich Miescher Laboratory of the Max Planck Society, Spemannstr. 39, 72076 Tübingen, Germany
| |
Collapse
|
17
|
Levitsky VG, Ignatieva EV, Ananko EA, Turnaev II, Merkulova TI, Kolchanov NA, Hodgman TC. Effective transcription factor binding site prediction using a combination of optimization, a genetic algorithm and discriminant analysis to capture distant interactions. BMC Bioinformatics 2007; 8:481. [PMID: 18093302 PMCID: PMC2265442 DOI: 10.1186/1471-2105-8-481] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2007] [Accepted: 12/19/2007] [Indexed: 12/22/2022] Open
Abstract
Background Reliable transcription factor binding site (TFBS) prediction methods are essential for computer annotation of large amount of genome sequence data. However, current methods to predict TFBSs are hampered by the high false-positive rates that occur when only sequence conservation at the core binding-sites is considered. Results To improve this situation, we have quantified the performance of several Position Weight Matrix (PWM) algorithms, using exhaustive approaches to find their optimal length and position. We applied these approaches to bio-medically important TFBSs involved in the regulation of cell growth and proliferation as well as in inflammatory, immune, and antiviral responses (NF-κB, ISGF3, IRF1, STAT1), obesity and lipid metabolism (PPAR, SREBP, HNF4), regulation of the steroidogenic (SF-1) and cell cycle (E2F) genes expression. We have also gained extra specificity using a method, entitled SiteGA, which takes into account structural interactions within TFBS core and flanking regions, using a genetic algorithm (GA) with a discriminant function of locally positioned dinucleotide (LPD) frequencies. To ensure a higher confidence in our approach, we applied resampling-jackknife and bootstrap tests for the comparison, it appears that, optimized PWM and SiteGA have shown similar recognition performances. Then we applied SiteGA and optimized PWMs (both separately and together) to sequences in the Eukaryotic Promoter Database (EPD). The resulting SiteGA recognition models can now be used to search sequences for BSs using the web tool, SiteGA. Analysis of dependencies between close and distant LPDs revealed by SiteGA models has shown that the most significant correlations are between close LPDs, and are generally located in the core (footprint) region. A greater number of less significant correlations are mainly between distant LPDs, which spanned both core and flanking regions. When SiteGA and optimized PWM models were applied together, this substantially reduced false positives at least at higher stringencies. Conclusion Based on this analysis, SiteGA adds substantial specificity even to optimized PWMs and may be considered for large-scale genome analysis. It adds to the range of techniques available for TFBS prediction, and EPD analysis has led to a list of genes which appear to be regulated by the above TFs.
Collapse
Affiliation(s)
- Victor G Levitsky
- Institute of Cytology and Genetics SB RAS, Novosibirsk, 630090, Russia.
| | | | | | | | | | | | | |
Collapse
|
18
|
Posch S, Grau J, Gohr A, Ben-Gal I, Kel AE, Grosse I. Recognition of cis-regulatory elements with vombat. J Bioinform Comput Biol 2007; 5:561-77. [PMID: 17636862 DOI: 10.1142/s0219720007002886] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2006] [Revised: 02/14/2007] [Accepted: 02/15/2007] [Indexed: 11/18/2022]
Abstract
Variable order Markov models and variable order Bayesian trees have been proposed for the recognition of cis-regulatory elements, and it has been demonstrated that they outperform traditional models such as position weight matrices, Markov models, and Bayesian trees for the recognition of binding sites in prokaryotes. Here, we study to which degree variable order models can improve the recognition of eukaryotic cis-regulatory elements. We find that variable order models can improve the recognition of binding sites of all the studied transcription factors. To ease a systematic evaluation of different model combinations based on problem-specific data sets and allow genomic scans of cis-regulatory elements based on fixed and variable order Markov models and Bayesian trees, we provide the VOMBATserver to the public community.
Collapse
Affiliation(s)
- Stefan Posch
- Institute of Computer Science, University Halle, 06099 Halle (Saale), Germany
| | | | | | | | | | | |
Collapse
|
19
|
Nikolajewa S, Pudimat R, Hiller M, Platzer M, Backofen R. BioBayesNet: a web server for feature extraction and Bayesian network modeling of biological sequence data. Nucleic Acids Res 2007; 35:W688-93. [PMID: 17537825 PMCID: PMC1933181 DOI: 10.1093/nar/gkm292] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023] Open
Abstract
BioBayesNet is a new web application that allows the easy modeling and classification of biological data using Bayesian networks. To learn Bayesian networks the user can either upload a set of annotated FASTA sequences or a set of pre-computed feature vectors. In case of FASTA sequences, the server is able to generate a wide range of sequence and structural features from the sequences. These features are used to learn Bayesian networks. An automatic feature selection procedure assists in selecting discriminative features, providing an (locally) optimal set of features. The output includes several quality measures of the overall network and individual features as well as a graphical representation of the network structure, which allows to explore dependencies between features. Finally, the learned Bayesian network or another uploaded network can be used to classify new data. BioBayesNet facilitates the use of Bayesian networks in biological sequences analysis and is flexible to support modeling and classification applications in various scientific fields. The BioBayesNet server is available at http://biwww3.informatik.uni-freiburg.de:8080/BioBayesNet/.
Collapse
Affiliation(s)
- Swetlana Nikolajewa
- Department of Bioinformatics, Friedrich-Schiller-University Jena, Ernst-Abbe-Platz 2, 07743 Jena, Germany Institute of Computer Science, Bioinformatics Group, Albert-Ludwigs-University Freiburg, Georges-Koehler-Allee 106, 79110 Freiburg, Germany and Genome Analysis, Leibniz Institute for Age Research - Fritz Lipmann Institute, Beutenbergstr. 11, 07745 Jena, Germany
| | - Rainer Pudimat
- Department of Bioinformatics, Friedrich-Schiller-University Jena, Ernst-Abbe-Platz 2, 07743 Jena, Germany Institute of Computer Science, Bioinformatics Group, Albert-Ludwigs-University Freiburg, Georges-Koehler-Allee 106, 79110 Freiburg, Germany and Genome Analysis, Leibniz Institute for Age Research - Fritz Lipmann Institute, Beutenbergstr. 11, 07745 Jena, Germany
| | - Michael Hiller
- Department of Bioinformatics, Friedrich-Schiller-University Jena, Ernst-Abbe-Platz 2, 07743 Jena, Germany Institute of Computer Science, Bioinformatics Group, Albert-Ludwigs-University Freiburg, Georges-Koehler-Allee 106, 79110 Freiburg, Germany and Genome Analysis, Leibniz Institute for Age Research - Fritz Lipmann Institute, Beutenbergstr. 11, 07745 Jena, Germany
| | - Matthias Platzer
- Department of Bioinformatics, Friedrich-Schiller-University Jena, Ernst-Abbe-Platz 2, 07743 Jena, Germany Institute of Computer Science, Bioinformatics Group, Albert-Ludwigs-University Freiburg, Georges-Koehler-Allee 106, 79110 Freiburg, Germany and Genome Analysis, Leibniz Institute for Age Research - Fritz Lipmann Institute, Beutenbergstr. 11, 07745 Jena, Germany
| | - Rolf Backofen
- Department of Bioinformatics, Friedrich-Schiller-University Jena, Ernst-Abbe-Platz 2, 07743 Jena, Germany Institute of Computer Science, Bioinformatics Group, Albert-Ludwigs-University Freiburg, Georges-Koehler-Allee 106, 79110 Freiburg, Germany and Genome Analysis, Leibniz Institute for Age Research - Fritz Lipmann Institute, Beutenbergstr. 11, 07745 Jena, Germany
- *To whom correspondence should be addressed. +49 (761) 203-7461+49 (761) 203-7462
| |
Collapse
|
20
|
Baten AKMA, Chang BCH, Halgamuge SK, Li J. Splice site identification using probabilistic parameters and SVM classification. BMC Bioinformatics 2006; 7 Suppl 5:S15. [PMID: 17254299 PMCID: PMC1764471 DOI: 10.1186/1471-2105-7-s5-s15] [Citation(s) in RCA: 69] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Recent advances and automation in DNA sequencing technology has created a vast amount of DNA sequence data. This increasing growth of sequence data demands better and efficient analysis methods. Identifying genes in this newly accumulated data is an important issue in bioinformatics, and it requires the prediction of the complete gene structure. Accurate identification of splice sites in DNA sequences plays one of the central roles of gene structural prediction in eukaryotes. Effective detection of splice sites requires the knowledge of characteristics, dependencies, and relationship of nucleotides in the splice site surrounding region. A higher-order Markov model is generally regarded as a useful technique for modeling higher-order dependencies. However, their implementation requires estimating a large number of parameters, which is computationally expensive. RESULTS The proposed method for splice site detection consists of two stages: a first order Markov model (MM1) is used in the first stage and a support vector machine (SVM) with polynomial kernel is used in the second stage. The MM1 serves as a pre-processing step for the SVM and takes DNA sequences as its input. It models the compositional features and dependencies of nucleotides in terms of probabilistic parameters around splice site regions. The probabilistic parameters are then fed into the SVM, which combines them nonlinearly to predict splice sites. When the proposed MM1-SVM model is compared with other existing standard splice site detection methods, it shows a superior performance in all the cases. CONCLUSION We proposed an effective pre-processing scheme for the SVM and applied it for the identification of splice sites. This is a simple yet effective splice site detection method, which shows a better classification accuracy and computational speed than some other more complex methods.
Collapse
Affiliation(s)
- AKMA Baten
- Dynamic Systems and Control Research Group, DoMME, The University of Melbourne, Victoria 3010, Australia
| | - BCH Chang
- Dynamic Systems and Control Research Group, DoMME, The University of Melbourne, Victoria 3010, Australia
| | - SK Halgamuge
- Dynamic Systems and Control Research Group, DoMME, The University of Melbourne, Victoria 3010, Australia
| | - Jason Li
- Dynamic Systems and Control Research Group, DoMME, The University of Melbourne, Victoria 3010, Australia
| |
Collapse
|
21
|
Hung CM, Huang YM, Chang MS. Alignment using genetic programming with causal trees for identification of protein functions. NONLINEAR ANALYSIS, THEORY, METHODS & APPLICATIONS 2006; 65:1070-1093. [PMID: 32288048 PMCID: PMC7117053 DOI: 10.1016/j.na.2005.09.048] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
A hybrid evolutionary model is used to propose a hierarchical homology of protein sequences to identify protein functions systematically. The proposed model offers considerable potentials, considering the inconsistency of existing methods for predicting novel proteins. Because some novel proteins might align without meaningful conserved domains, maximizing the score of sequence alignment is not the best criterion for predicting protein functions. This work presents a decision model that can minimize the cost of making a decision for predicting protein functions using the hierarchical homologies. Particularly, the model has three characteristics: (i) it is a hybrid evolutionary model with multiple fitness functions that uses genetic programming to predict protein functions on a distantly related protein family, (ii) it incorporates modified robust point matching to accurately compare all feature points using the moment invariant and thin-plate spline theorems, and (iii) the hierarchical homologies holding up a novel protein sequence in the form of a causal tree can effectively demonstrate the relationship between proteins. This work describes the comparisons of nucleocapsid proteins from the putative polyprotein SARS virus and other coronaviruses in other hosts using the model.
Collapse
Affiliation(s)
- Chun-Min Hung
- Department of Engineering Science, National Cheng Kung University, No.1, Ta-Hsueh Road, Tainan 701, Taiwan, ROC
| | - Yueh-Min Huang
- Department of Engineering Science, National Cheng Kung University, No.1, Ta-Hsueh Road, Tainan 701, Taiwan, ROC
| | - Ming-Shi Chang
- Department of Biochemistry, National Cheng Kung University, No.1, Ta-Hsueh Road, Tainan 701, Taiwan, ROC
| |
Collapse
|
22
|
Grau J, Ben-Gal I, Posch S, Grosse I. VOMBAT: prediction of transcription factor binding sites using variable order Bayesian trees. Nucleic Acids Res 2006; 34:W529-33. [PMID: 16845064 PMCID: PMC1538886 DOI: 10.1093/nar/gkl212] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
Variable order Markov models and variable order Bayesian trees have been proposed for the recognition of transcription factor binding sites, and it could be demonstrated that they outperform traditional models, such as position weight matrices, Markov models and Bayesian trees. We develop a web server for the recognition of DNA binding sites based on variable order Markov models and variable order Bayesian trees offering the following functionality: (i) given datasets with annotated binding sites and genomic background sequences, variable order Markov models and variable order Bayesian trees can be trained; (ii) given a set of trained models, putative DNA binding sites can be predicted in a given set of genomic sequences and (iii) given a dataset with annotated binding sites and a dataset with genomic background sequences, cross-validation experiments for different model combinations with different parameter settings can be performed. Several of the offered services are computationally demanding, such as genome-wide predictions of DNA binding sites in mammalian genomes or sets of 104-fold cross-validation experiments for different model combinations based on problem-specific data sets. In order to execute these jobs, and in order to serve multiple users at the same time, the web server is attached to a Linux cluster with 150 processors. VOMBAT is available at .
Collapse
Affiliation(s)
| | - Irad Ben-Gal
- Department of Industrial Engineering, Tel-Aviv UniversityTel-Aviv 69978, Israel
| | | | - Ivo Grosse
- Leibniz-Institute of Plant Genetics and Crop Plant Research (IPK)06466 Gatersleben, Germany
- To whom correspondence should be addressed. Tel: ++49 39482 5755; Fax: ++49 39482 5357;
| |
Collapse
|
23
|
Bradford JR, Needham CJ, Bulpitt AJ, Westhead DR. Insights into protein-protein interfaces using a Bayesian network prediction method. J Mol Biol 2006; 362:365-86. [PMID: 16919296 DOI: 10.1016/j.jmb.2006.07.028] [Citation(s) in RCA: 66] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2006] [Revised: 06/15/2006] [Accepted: 07/13/2006] [Indexed: 11/26/2022]
Abstract
Identifying the interface between two interacting proteins provides important clues to the function of a protein, and is becoming increasing relevant to drug discovery. Here, surface patch analysis was combined with a Bayesian network to predict protein-protein binding sites with a success rate of 82% on a benchmark dataset of 180 proteins, improving by 6% on previous work and well above the 36% that would be achieved by a random method. A comparable success rate was achieved even when evolutionary information was missing, a further improvement on our previous method which was unable to handle incomplete data automatically. In a case study of the Mog1p family, we showed that our Bayesian network method can aid the prediction of previously uncharacterised binding sites and provide important clues to protein function. On Mog1p itself a putative binding site involved in the SLN1-SKN7 signal transduction pathway was detected, as was a Ran binding site, previously characterized solely by conservation studies, even though our automated method operated without using homologous proteins. On the remaining members of the family (two structural genomics targets, and a protein involved in the photosystem II complex in higher plants) we identified novel binding sites with little correspondence to those on Mog1p. These results suggest that members of the Mog1p family bind to different proteins and probably have different functions despite sharing the same overall fold. We also demonstrated the applicability of our method to drug discovery efforts by successfully locating a number of binding sites involved in the protein-protein interaction network of papilloma virus infection. In a separate study, we attempted to distinguish between the two types of binding site, obligate and non-obligate, within our dataset using a second Bayesian network. This proved difficult although some separation was achieved on the basis of patch size, electrostatic potential and conservation. Such was the similarity between the two interacting patch types, we were able to use obligate binding site properties to predict the location of non-obligate binding sites and vice versa.
Collapse
Affiliation(s)
- James R Bradford
- Institute of Molecular and Cellular Biology, University of Leeds, Leeds, LS2 9JT, UK
| | | | | | | |
Collapse
|
24
|
Huang W, Umbach DM, Ohler U, Li L. Optimized mixed Markov models for motif identification. BMC Bioinformatics 2006; 7:279. [PMID: 16749929 PMCID: PMC1534070 DOI: 10.1186/1471-2105-7-279] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2006] [Accepted: 06/02/2006] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Identifying functional elements, such as transcriptional factor binding sites, is a fundamental step in reconstructing gene regulatory networks and remains a challenging issue, largely due to limited availability of training samples. RESULTS We introduce a novel and flexible model, the Optimized Mixture Markov model (OMiMa), and related methods to allow adjustment of model complexity for different motifs. In comparison with other leading methods, OMiMa can incorporate more than the NNSplice's pairwise dependencies; OMiMa avoids model over-fitting better than the Permuted Variable Length Markov Model (PVLMM); and OMiMa requires smaller training samples than the Maximum Entropy Model (MEM). Testing on both simulated and actual data (regulatory cis-elements and splice sites), we found OMiMa's performance superior to the other leading methods in terms of prediction accuracy, required size of training data or computational time. Our OMiMa system, to our knowledge, is the only motif finding tool that incorporates automatic selection of the best model. OMiMa is freely available at 1. CONCLUSION Our optimized mixture of Markov models represents an alternative to the existing methods for modeling dependent structures within a biological motif. Our model is conceptually simple and effective, and can improve prediction accuracy and/or computational speed over other leading methods.
Collapse
Affiliation(s)
- Weichun Huang
- Bioinformatics Research Center, North Carolina State University, Raleigh, NC 27606, USA
- Biostatistics Branch, The National Institute of Environmental Health Sciences, National Institutes of Health, RTP, NC 27709, USA
- Institute for Genome Sciences & Policy, Duke University Medical Center, Durham, NC 27708, USA
| | - David M Umbach
- Biostatistics Branch, The National Institute of Environmental Health Sciences, National Institutes of Health, RTP, NC 27709, USA
| | - Uwe Ohler
- Institute for Genome Sciences & Policy, Duke University Medical Center, Durham, NC 27708, USA
| | - Leping Li
- Biostatistics Branch, The National Institute of Environmental Health Sciences, National Institutes of Health, RTP, NC 27709, USA
| |
Collapse
|
25
|
Churbanov A, Rogozin IB, Deogun JS, Ali H. Method of predicting splice sites based on signal interactions. Biol Direct 2006; 1:10. [PMID: 16584568 PMCID: PMC1526722 DOI: 10.1186/1745-6150-1-10] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2006] [Accepted: 04/03/2006] [Indexed: 02/07/2023] Open
Abstract
BACKGROUND Predicting and proper ranking of canonical splice sites (SSs) is a challenging problem in bioinformatics and machine learning communities. Any progress in SSs recognition will lead to better understanding of splicing mechanism. We introduce several new approaches of combining a priori knowledge for improved SS detection. First, we design our new Bayesian SS sensor based on oligonucleotide counting. To further enhance prediction quality, we applied our new de novo motif detection tool MHMMotif to intronic ends and exons. We combine elements found with sensor information using Naive Bayesian Network, as implemented in our new tool SpliceScan. RESULTS According to our tests, the Bayesian sensor outperforms the contemporary Maximum Entropy sensor for 5' SS detection. We report a number of putative Exonic (ESE) and Intronic (ISE) Splicing Enhancers found by MHMMotif tool. T-test statistics on mouse/rat intronic alignments indicates, that detected elements are on average more conserved as compared to other oligos, which supports our assumption of their functional importance. The tool has been shown to outperform the SpliceView, GeneSplicer, NNSplice, Genio and NetUTR tools for the test set of human genes. SpliceScan outperforms all contemporary ab initio gene structural prediction tools on the set of 5' UTR gene fragments. CONCLUSION Designed methods have many attractive properties, compared to existing approaches. Bayesian sensor, MHMMotif program and SpliceScan tools are freely available on our web site. REVIEWERS This article was reviewed by Manyuan Long, Arcady Mushegian and Mikhail Gelfand.
Collapse
Affiliation(s)
- Alexander Churbanov
- Department of Computer Science, College of Information Science and Technology, University of Nebraska at Omaha, Omaha, NE68182-0116, USA
| | - Igor B Rogozin
- NCBI/NLM/NIH, Bldg.38-A, room 5N505A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Jitender S Deogun
- Department of Computer Science and Engineering, University of Nebraska-Lincoln, Lincoln, NE 68588-0115, USA
| | - Hesham Ali
- Department of Computer Science, College of Information Science and Technology, University of Nebraska at Omaha, Omaha, NE68182-0116, USA
| |
Collapse
|
26
|
Larrañaga P, Calvo B, Santana R, Bielza C, Galdiano J, Inza I, Lozano JA, Armañanzas R, Santafé G, Pérez A, Robles V. Machine learning in bioinformatics. Brief Bioinform 2006; 7:86-112. [PMID: 16761367 DOI: 10.1093/bib/bbk007] [Citation(s) in RCA: 368] [Impact Index Per Article: 19.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
This article reviews machine learning methods for bioinformatics. It presents modelling methods, such as supervised classification, clustering and probabilistic graphical models for knowledge discovery, as well as deterministic and stochastic heuristics for optimization. Applications in genomics, proteomics, systems biology, evolution and text mining are also shown.
Collapse
Affiliation(s)
- Pedro Larrañaga
- Intelligent Systems Group, Department of Computer Science and Artificial Intelligence, University of the Basque Country, Paseo Manuel de Lardizabal, 1, 20018 San Sebastian, Spain.
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
27
|
Abstract
Many short DNA motifs, such as transcription factor binding sites (TFBS) and splice sites, exhibit strong local as well as nonlocal dependence. We introduce permuted variable length Markov models (PVLMM) which could capture the potentially important dependencies among positions and apply them to the problem of detecting splice and TFB sites. They have been satisfactory from the viewpoint of prediction performance and also give ready biological interpretations of the sequence dependence observed. The issue of model selection is also studied.
Collapse
Affiliation(s)
- Xiaoyue Zhao
- Department of Statistics, University of California- Berkeley, 367 Evans Hall, Berkeley CA 94720-3860, USA.
| | | | | |
Collapse
|
28
|
Zheng Y, Anton BP, Roberts RJ, Kasif S. Phylogenetic detection of conserved gene clusters in microbial genomes. BMC Bioinformatics 2005; 6:243. [PMID: 16202130 PMCID: PMC1266350 DOI: 10.1186/1471-2105-6-243] [Citation(s) in RCA: 33] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2005] [Accepted: 10/03/2005] [Indexed: 11/15/2022] Open
Abstract
Background Microbial genomes contain an abundance of genes with conserved proximity forming clusters on the chromosome. However, the conservation can be a result of many factors such as vertical inheritance, or functional selection. Thus, identification of conserved gene clusters that are under functional selection provides an effective channel for gene annotation, microarray screening, and pathway reconstruction. The problem of devising a robust method to identify these conserved gene clusters and to evaluate the significance of the conservation in multiple genomes has a number of implications for comparative, evolutionary and functional genomics as well as synthetic biology. Results In this paper we describe a new method for detecting conserved gene clusters that incorporates the information captured by a genome phylogenetic tree. We show that our method can overcome the common problem of overestimation of significance due to the bias in the genome database and thereby achieve better accuracy when detecting functionally connected gene clusters. Our results can be accessed at database GeneChords . Conclusion The methodology described in this paper gives a scalable framework for discovering conserved gene clusters in microbial genomes. It serves as a platform for many other functional genomic analyses in microorganisms, such as operon prediction, regulatory site prediction, functional annotation of genes, evolutionary origin and development of gene clusters.
Collapse
Affiliation(s)
- Yu Zheng
- Bioinformatics Graduate Program, Boston University, Boston, MA, USA
| | - Brian P Anton
- Bioinformatics Graduate Program, Boston University, Boston, MA, USA
- New England Biolabs, Beverly, MA, USA
| | | | - Simon Kasif
- Bioinformatics Graduate Program, Boston University, Boston, MA, USA
- Department of Biomedical Engineering, Boston University, Boston, MA, USA
- Center for Advanced Genomic Technology, Boston University, Boston, MA, USA
| |
Collapse
|
29
|
Pudimat R, Schukat-Talamazzini EG, Backofen R. A multiple-feature framework for modelling and predicting transcription factor binding sites. Bioinformatics 2005; 21:3082-8. [PMID: 15905283 DOI: 10.1093/bioinformatics/bti477] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION The identification of transcription factor binding sites in promoter sequences is an important problem, since it reveals information about the transcriptional regulation of genes. For analysing transcriptional regulation, computational approaches for predicting putative binding sites are applied. Commonly used stochastic models for binding sites are position-specific score matrices, which show weak predictive power. RESULTS We have developed a probabilistic modelling approach, which allows to consider diverse characteristic binding site properties to obtain more accurate representations of binding sites. These properties are modelled as random variables in Bayesian networks, which are capable of dealing with dependencies among binding site properties. Cross-validation on several datasets shows improvements in the false positive error rate and the significance (P-value) of true binding sites.
Collapse
Affiliation(s)
- Rainer Pudimat
- Institut für Informatik, Friedrich-Schiller-Universität Ernst-Abbe-Platz 3, D-07743 Jena, Germany
| | | | | |
Collapse
|
30
|
Roca X, Sachidanandam R, Krainer AR. Determinants of the inherent strength of human 5' splice sites. RNA (NEW YORK, N.Y.) 2005; 11:683-98. [PMID: 15840817 PMCID: PMC1370755 DOI: 10.1261/rna.2040605] [Citation(s) in RCA: 105] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/07/2005] [Accepted: 02/09/2005] [Indexed: 05/24/2023]
Abstract
We previously showed that the authentic 5' splice site (5'ss) of the first exon in the human beta-globin gene is intrinsically stronger than a cryptic 5'ss located 16 nucleotides upstream. Here we examined by mutational analysis the contribution of individual 5'ss nucleotides to discrimination between these two 5'ss. Based on the in vitro splicing efficiencies of a panel of 26 wild-type and mutant substrates in two separate 5'ss competition assays, we established a hierarchy of 5'ss and grouped them into three functional subclasses: strong, intermediate, and weak. Competition between two 5'ss from different subclasses always resulted in selection of the 5'ss that belongs to the stronger subclass. Moreover, each subclass has different characteristic features. Strong and intermediate 5'ss can be distinguished by their predicted free energy of base-pairing to the U1 snRNA 5' terminus (DeltaG). Whereas the extent of splicing via the strong 5'ss correlates well with the DeltaG, this is not the case for competition between intermediate 5'ss. Weak 5'ss were used only when the competing authentic 5'ss was inactivated by mutation. These results indicate that extensive complementarity to U1 snRNA exerts a dominant effect for 5'ss selection, but in the case of competing 5'ss with similarly modest complementarity to U1, the role of other 5'ss features is more prominent. This study reveals the importance of additional submotifs present in certain 5'ss sequences, whose characterization will be critical for understanding 5'ss selection in human genes.
Collapse
Affiliation(s)
- Xavier Roca
- Cold Spring Harbor Laboratory, PO Box 100, Cold Spring Harbor, NY 11724, USA
| | | | | |
Collapse
|
31
|
Ben-Gal I, Shani A, Gohr A, Grau J, Arviv S, Shmilovici A, Posch S, Grosse I. Identification of transcription factor binding sites with variable-order Bayesian networks. Bioinformatics 2005; 21:2657-66. [PMID: 15797905 DOI: 10.1093/bioinformatics/bti410] [Citation(s) in RCA: 124] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION We propose a new class of variable-order Bayesian network (VOBN) models for the identification of transcription factor binding sites (TFBSs). The proposed models generalize the widely used position weight matrix (PWM) models, Markov models and Bayesian network models. In contrast to these models, where for each position a fixed subset of the remaining positions is used to model dependencies, in VOBN models, these subsets may vary based on the specific nucleotides observed, which are called the context. This flexibility turns out to be of advantage for the classification and analysis of TFBSs, as statistical dependencies between nucleotides in different TFBS positions (not necessarily adjacent) may be taken into account efficiently--in a position-specific and context-specific manner. RESULTS We apply the VOBN model to a set of 238 experimentally verified sigma-70 binding sites in Escherichia coli. We find that the VOBN model can distinguish these 238 sites from a set of 472 intergenic 'non-promoter' sequences with a higher accuracy than fixed-order Markov models or Bayesian trees. We use a replicated stratified-holdout experiment having a fixed true-negative rate of 99.9%. We find that for a foreground inhomogeneous VOBN model of order 1 and a background homogeneous variable-order Markov (VOM) model of order 5, the obtained mean true-positive (TP) rate is 47.56%. In comparison, the best TP rate for the conventional models is 44.39%, obtained from a foreground PWM model and a background 2nd-order Markov model. As the standard deviation of the estimated TP rate is approximately 0.01%, this improvement is highly significant.
Collapse
Affiliation(s)
- I Ben-Gal
- Department of Industrial Engineering, Tel-Aviv University, Tel-Aviv, 69978, Israel.
| | | | | | | | | | | | | | | |
Collapse
|
32
|
Degroeve S, Saeys Y, De Baets B, Rouzé P, Van de Peer Y. SpliceMachine: predicting splice sites from high-dimensional local context representations. Bioinformatics 2004; 21:1332-8. [PMID: 15564294 DOI: 10.1093/bioinformatics/bti166] [Citation(s) in RCA: 75] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION In this age of complete genome sequencing, finding the location and structure of genes is crucial for further molecular research. The accurate prediction of intron boundaries largely facilitates the correct prediction of gene structure in nuclear genomes. Many tools for localizing these boundaries on DNA sequences have been developed and are available to researchers through the internet. Nevertheless, these tools still make many false positive predictions. RESULTS This manuscript presents a novel publicly available splice site prediction tool named SpliceMachine that (i) shows state-of-the-art prediction performance on Arabidopsis thaliana and human sequences, (ii) performs a computationally fast annotation and (iii) can be trained by the user on its own data. AVAILABILITY Results, figures and software are available at http://www.bioinformatics.psb.ugent.be/supplementary_data/ CONTACT sven.degroeve@psb.ugent.be; yves.vandepeer@psb.ugent.be.
Collapse
Affiliation(s)
- Sven Degroeve
- Department of Plant Systems Biology, Flanders Interuniversity Institute for Biotechnology (VIB), Technologiepark 927, Gent 9052, Belgium.
| | | | | | | | | |
Collapse
|
33
|
Chen TM, Lu CC, Li WH. Prediction of splice sites with dependency graphs and their expanded bayesian networks. Bioinformatics 2004; 21:471-82. [PMID: 15374869 DOI: 10.1093/bioinformatics/bti025] [Citation(s) in RCA: 53] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Owing to the complete sequencing of human and many other genomes, huge amounts of DNA sequence data have been accumulated. In bioinformatics, an important issue is how to predict the complete structure of genes from the genomic DNA sequence, especially the human genome. A crucial part in the gene structure prediction is to determine the precise exon-intron boundaries, i.e. the splice sites, in the coding region. RESULTS We have developed a dependency graph model to fully capture the intrinsic interdependency between base positions in a splice site. The establishment of dependency between two position is based on a chi2-test from known sample data. To facilitate statistical inference, we have expanded the dependency graph (which is usually a graph with cycles that make probabilistic reasoning very difficult, if not impossible) into a Bayesian network (which is a directed acyclic graph that facilitates statistical reasoning). When compared with the existing models such as weight matrix model, weight array model, maximal dependence decomposition, Cai et al.'s tree model as well as the less-studied second-order and third-order Markov chain models, the expanded Bayesian networks from our dependency graph models perform the best in nearly all the cases studied. AVAILABILITY Software (a program called DGSplicer) and datasets used are available at http://csrl.ee.nthu.edu.tw/bioinf/ CONTACT cclu@ee.nthu.edu.tw.
Collapse
Affiliation(s)
- Te-Ming Chen
- Department of Electrical Engineering, National Tsing Hua University Hsinchu 30013, Taiwan
| | | | | |
Collapse
|
34
|
Yeo G, Burge CB. Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals. J Comput Biol 2004; 11:377-94. [PMID: 15285897 DOI: 10.1089/1066527041410418] [Citation(s) in RCA: 1471] [Impact Index Per Article: 70.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
We propose a framework for modeling sequence motifs based on the maximum entropy principle (MEP). We recommend approximating short sequence motif distributions with the maximum entropy distribution (MED) consistent with low-order marginal constraints estimated from available data, which may include dependencies between nonadjacent as well as adjacent positions. Many maximum entropy models (MEMs) are specified by simply changing the set of constraints. Such models can be utilized to discriminate between signals and decoys. Classification performance using different MEMs gives insight into the relative importance of dependencies between different positions. We apply our framework to large datasets of RNA splicing signals. Our best models out-perform previous probabilistic models in the discrimination of human 5' (donor) and 3' (acceptor) splice sites from decoys. Finally, we discuss mechanistically motivated ways of comparing models.
Collapse
Affiliation(s)
- Gene Yeo
- Department of Biology, Massachusetts Institute of Technology, 77 Massachusetts Avenue Building 68-223, Cambridge, MA 02319, USA
| | | |
Collapse
|
35
|
Zhang L, Luo L. Splice site prediction with quadratic discriminant analysis using diversity measure. Nucleic Acids Res 2003; 31:6214-20. [PMID: 14576308 PMCID: PMC275452 DOI: 10.1093/nar/gkg805] [Citation(s) in RCA: 68] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Based on the conservation of nucleotides at splicing sites and the features of base composition and base correlation around these sites we use the method of increment of diversity combined with quadratic discriminant analysis (IDQD) to study the dependence structure of splicing sites and predict the exons/introns and their boundaries for four model genomes: Caenorhabditis elegans, Arabidopsis thaliana, Drosophila melanogaster and human. The comparison of compositional features between two sequences and the comparison of base dependencies at adjacent or non-adjacent positions of two sequences can be integrated automatically in the increment of diversity (ID). Eight feature variables around a potential splice site are defined in terms of ID. They are integrated in a single formal framework given by IDQD. In our calculations 7 (8) base region around the donor (acceptor) sites have been considered in studying the conservation of nucleotides and sequences of 48 bp on either side of splice sites have been used in studying the compositional and base-correlating features. The windows are enlarged to 16 (donor), 29 (acceptor) and 80 bp (either side) to improve the prediction for human splice sites. The prediction capability of the present method is comparable with the leading splice site detector--GeneSplicer.
Collapse
Affiliation(s)
- Lirong Zhang
- Laboratory of Theoretical Biophysics, Faculty of Science and Technology, Inner Mongolia University, Hohhot, 010021 China
| | | |
Collapse
|
36
|
Zhang L, Pavlovic V, Cantor CR, Kasif S. Human-mouse gene identification by comparative evidence integration and evolutionary analysis. Genome Res 2003; 13:1190-202. [PMID: 12743024 PMCID: PMC403647 DOI: 10.1101/gr.703903] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2002] [Accepted: 02/03/2003] [Indexed: 11/24/2022]
Abstract
The identification of genes in the human genome remains a challenge, as the actual predictions appear to disagree tremendously and vary dramatically on the basis of the specific gene-finding methodology used. Because the pattern of conservation in coding regions is expected to be different from intronic or intergenic regions, a comparative computational analysis can lead, in principle, to an improved computational identification of genes in the human genome by using a reference, such as mouse genome. However, this comparative methodology critically depends on three important factors: (1) the selection of the most appropriate reference genome. In particular, it is not clear whether the mouse is at the correct evolutionary distance from the human to provide sufficiently distinctive conservation levels in different genomic regions, (2) the selection of comparative features that provide the most benefit to gene recognition, and (3) the selection of evidence integration architecture that effectively interprets the comparative features. We address the first question by a novel evolutionary analysis that allows us to explicitly correlate the performance of the gene recognition system with the evolutionary distance (time) between the two genomes. Our simulation results indicate that there is a wide range of reference genomes at different evolutionary time points that appear to deliver reasonable comparative prediction of human genes. In particular, the evolutionary time between human and mouse generally falls in the region of good performance; however, better accuracy might be achieved with a reference genome further than mouse. To address the second question, we propose several natural comparative measures of conservation for identifying exons and exon boundaries. Finally, we experiment with Bayesian networks for the integration of comparative and compositional evidence.
Collapse
Affiliation(s)
- Lingang Zhang
- Center for Advanced Biotechnology, Boston University, Boston, Massachusetts 02215, USA
| | | | | | | |
Collapse
|
37
|
Cowell LG, Davila M, Kepler TB, Kelsoe G. Identification and utilization of arbitrary correlations in models of recombination signal sequences. Genome Biol 2002; 3:RESEARCH0072. [PMID: 12537561 PMCID: PMC151174 DOI: 10.1186/gb-2002-3-12-research0072] [Citation(s) in RCA: 51] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2002] [Revised: 09/04/2002] [Accepted: 10/10/2002] [Indexed: 01/26/2023] Open
Abstract
BACKGROUND A significant challenge in bioinformatics is to develop methods for detecting and modeling patterns in variable DNA sequence sites, such as protein-binding sites in regulatory DNA. Current approaches sometimes perform poorly when positions in the site do not independently affect protein binding. We developed a statistical technique for modeling the correlation structure in variable DNA sequence sites. The method places no restrictions on the number of correlated positions or on their spatial relationship within the site. No prior empirical evidence for the correlation structure is necessary. RESULTS We applied our method to the recombination signal sequences (RSS) that direct assembly of B-cell and T-cell antigen-receptor genes via V(D)J recombination. The technique is based on model selection by cross-validation and produces models that allow computation of an information score for any signal-length sequence. We also modeled RSS using order zero and order one Markov chains. The scores from all models are highly correlated with measured recombination efficiencies, but the models arising from our technique are better than the Markov models at discriminating RSS from non-RSS. CONCLUSIONS Our model-development procedure produces models that estimate well the recombinogenic potential of RSS and are better at RSS recognition than the order zero and order one Markov models. Our models are, therefore, valuable for studying the regulation of both physiologic and aberrant V(D)J recombination. The approach could be equally powerful for the study of promoter and enhancer elements, splice sites, and other DNA regulatory sites that are highly variable at the level of individual nucleotide positions.
Collapse
Affiliation(s)
- Lindsay G Cowell
- Department of Immunology, Duke University Medical Center, Durham, NC 27710, USA
| | | | | | | |
Collapse
|
38
|
Sonnenburg S, Rätsch G, Jagota A, Müller KR. New Methods for Splice Site Recognition. ARTIFICIAL NEURAL NETWORKS — ICANN 2002 2002. [DOI: 10.1007/3-540-46084-5_54] [Citation(s) in RCA: 18] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/15/2023]
|