1
|
Bari ATMG, Reaz MR, Choi HJ, Jeong BS. DNA Encoding for Splice Site Prediction in Large DNA Sequence. DATABASE SYSTEMS FOR ADVANCED APPLICATIONS 2013. [DOI: 10.1007/978-3-642-40270-8_4] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
|
2
|
Bari AG, Choi HJ, Reaz MR, Jeong BS. Survey on Nucleotide Encoding Techniques and SVM Kernel Design for Human Splice Site Prediction. ACTA ACUST UNITED AC 2012. [DOI: 10.4051/ibc.2012.4.4.0014] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
3
|
Pietsch C, Sreenivasulu N, Wobus U, Röder MS. Linkage mapping of putative regulator genes of barley grain development characterized by expression profiling. BMC PLANT BIOLOGY 2009; 9:4. [PMID: 19134169 PMCID: PMC2648977 DOI: 10.1186/1471-2229-9-4] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/22/2008] [Accepted: 01/09/2009] [Indexed: 05/09/2023]
Abstract
BACKGROUND Barley (Hordeum vulgare L.) seed development is a highly regulated process with fine-tuned interaction of various tissues controlling distinct physiological events during prestorage, storage and dessication phase. As potential regulators involved within this process we studied 172 transcription factors and 204 kinases for their expression behaviour and anchored a subset of them to the barley linkage map to promote marker-assisted studies on barley grains. RESULTS By a hierachical clustering of the expression profiles of 376 potential regulatory genes expressed in 37 different tissues, we found 50 regulators preferentially expressed in one of the three grain tissue fractions pericarp, endosperm and embryo during seed development. In addition, 27 regulators found to be expressed during both seed development and germination and 32 additional regulators are characteristically expressed in multiple tissues undergoing cell differentiation events during barley plant ontogeny. Another 96 regulators were, beside in the developing seed, ubiquitously expressed among all tissues of germinating seedlings as well as in reproductive tissues. SNP-marker development for those regulators resulted in anchoring 61 markers on the genetic linkage map of barley and the chromosomal assignment of another 12 loci by using wheat-barley addition lines. The SNP frequency ranged from 0.5 to 1.0 SNP/kb in the parents of the various mapping populations and was 2.3 SNP/kb over all eight lines tested. Exploration of macrosynteny to rice revealed that the chromosomal orders of the mapped putative regulatory factors were predominantly conserved during evolution. CONCLUSION We identified expression patterns of major transcription factors and signaling related genes expressed during barley ontogeny and further assigned possible functions based on likely orthologs functionally well characterized in model plant species. The combined linkage map and reference expression map of regulators defined in the present study offers the possibility of further directed research of the functional role of regulators during seed development in barley.
Collapse
Affiliation(s)
- Christof Pietsch
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK), 06466 Gatersleben, Germany
| | - Nese Sreenivasulu
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK), 06466 Gatersleben, Germany
| | - Ulrich Wobus
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK), 06466 Gatersleben, Germany
| | - Marion S Röder
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK), 06466 Gatersleben, Germany
| |
Collapse
|
4
|
Muth J, Hartje S, Twyman RM, Hofferbert HR, Tacke E, Prüfer D. Precision breeding for novel starch variants in potato. PLANT BIOTECHNOLOGY JOURNAL 2008; 6:576-84. [PMID: 18422889 DOI: 10.1111/j.1467-7652.2008.00340.x] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/23/2023]
Abstract
Potato can be used as a source of modified starches for culinary and industrial processes, but its allelic diversity and tetraploid genome make the identification of novel alleles a challenge, and breeding such alleles into elite lines is a slow and difficult process. An efficient and reliable strategy has been developed for the rapid introduction and identification of new alleles in elite potato breeding lines, based on the ethylmethanesulphonate mutagenesis of dihaploid seeds. Using the granule-bound starch synthase I gene (waxy) as a model, a series of point mutations that potentially affect gene expression or enzyme function was identified. The most promising loss-of-function allele (waxy(E1100)) carried a mutation in the 5'-splice donor site of intron 1 that caused mis-splicing and protein truncation. This was used to establish elite breeding lineages lacking granule-bound starch synthase I protein activity and producing high-amylopectin starch. This is the first report of rapid and efficient mutation analysis in potato, a genetically complex and vegetatively propagated crop.
Collapse
Affiliation(s)
- Jost Muth
- Fraunhofer Institute for Molecular Biology and Applied Ecology, Forckenbeckstrasse 6, 52074 Aachen, Germany
| | | | | | | | | | | |
Collapse
|
5
|
Abstract
As the number of sequenced genomes increases, the ability to deduce genome function becomes increasingly salient. For many genome sequences, the only annotation that will be available for the foreseeable future will be based on computational predictions and comparisons with functional elements in related species. Here we discuss computational approaches for automated genome-wide annotation of functional elements in mammalian genomes. These include methods for ab initio and comparative gene-structure predictions. Gene features such as intron splice sites, 3' untranslated regions, promoters, and cis-regulatory elements are discussed, as is a novel method for predicting DNaseI hypersensitive sites. Recent methodologies for predicting noncoding RNA genes, including microRNA genes and their targets, are also reviewed.
Collapse
Affiliation(s)
- Steven J M Jones
- Genome Sciences Centre, British Columbia Cancer Research Center, Vancouver, British Columbia, V5Z 1L3, Canada.
| |
Collapse
|
6
|
|
7
|
Lal A, Radhakrishnan S, Srinivas SS, Najarian K, Mays LE. Splice site detection using pruned maximum likelihood model. CONFERENCE PROCEEDINGS : ... ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL CONFERENCE 2007; 2004:2836-9. [PMID: 17270868 DOI: 10.1109/iembs.2004.1403809] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/13/2023]
Abstract
In this paper we propose a novel method for splice site prediction using the maximum likelihood model. We performed maximum likelihood over the acceptor and donor datasets, and calculated sensitivity to measure the prediction performance. Then, by aggressive pruning of less informative nucleotide sites, while maintaining the high sensitivity of the method, we improved the model's performance in terms of the computational speed. In addition, after pruning fewer nucleotide sites need to be tagged, which in turn simplifies the development of an assay. The proposed method was tested on the human splice dataset. The results indicate that the proposed method was successful at splice site prediction with optimal sensitivity.
Collapse
Affiliation(s)
- Anuradha Lal
- Coll. of Inf. Technol., North Carolina Univ., Charlotte, NC, USA
| | | | | | | | | |
Collapse
|
8
|
Churbanov A, Rogozin IB, Deogun JS, Ali H. Method of predicting splice sites based on signal interactions. Biol Direct 2006; 1:10. [PMID: 16584568 PMCID: PMC1526722 DOI: 10.1186/1745-6150-1-10] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2006] [Accepted: 04/03/2006] [Indexed: 02/07/2023] Open
Abstract
BACKGROUND Predicting and proper ranking of canonical splice sites (SSs) is a challenging problem in bioinformatics and machine learning communities. Any progress in SSs recognition will lead to better understanding of splicing mechanism. We introduce several new approaches of combining a priori knowledge for improved SS detection. First, we design our new Bayesian SS sensor based on oligonucleotide counting. To further enhance prediction quality, we applied our new de novo motif detection tool MHMMotif to intronic ends and exons. We combine elements found with sensor information using Naive Bayesian Network, as implemented in our new tool SpliceScan. RESULTS According to our tests, the Bayesian sensor outperforms the contemporary Maximum Entropy sensor for 5' SS detection. We report a number of putative Exonic (ESE) and Intronic (ISE) Splicing Enhancers found by MHMMotif tool. T-test statistics on mouse/rat intronic alignments indicates, that detected elements are on average more conserved as compared to other oligos, which supports our assumption of their functional importance. The tool has been shown to outperform the SpliceView, GeneSplicer, NNSplice, Genio and NetUTR tools for the test set of human genes. SpliceScan outperforms all contemporary ab initio gene structural prediction tools on the set of 5' UTR gene fragments. CONCLUSION Designed methods have many attractive properties, compared to existing approaches. Bayesian sensor, MHMMotif program and SpliceScan tools are freely available on our web site. REVIEWERS This article was reviewed by Manyuan Long, Arcady Mushegian and Mikhail Gelfand.
Collapse
Affiliation(s)
- Alexander Churbanov
- Department of Computer Science, College of Information Science and Technology, University of Nebraska at Omaha, Omaha, NE68182-0116, USA
| | - Igor B Rogozin
- NCBI/NLM/NIH, Bldg.38-A, room 5N505A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Jitender S Deogun
- Department of Computer Science and Engineering, University of Nebraska-Lincoln, Lincoln, NE 68588-0115, USA
| | - Hesham Ali
- Department of Computer Science, College of Information Science and Technology, University of Nebraska at Omaha, Omaha, NE68182-0116, USA
| |
Collapse
|
9
|
Rajapakse JC, Ho LS. Markov encoding for detecting signals in genomic sequences. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2005; 2:131-42. [PMID: 17044178 DOI: 10.1109/tcbb.2005.27] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/12/2023]
Abstract
We present a technique to encode the inputs to neural networks for the detection of signals in genomic sequences. The encoding is based on lower-order Markov models which incorporate known biological characteristics in genomic sequences. The neural networks then learn intrinsic higher-order dependencies of nucleotides at the signal sites. We demonstrate the efficacy of the Markov encoding method in the detection of three genomic signals, namely, splice sites, transcription start sites, and translation initiation sites.
Collapse
Affiliation(s)
- Jagath C Rajapakse
- BioInformatics Research Center, School of Computer Engineering, Nanyang Technological University, Singapore 639798.
| | | |
Collapse
|
10
|
Tyagi AK, Khurana JP, Khurana P, Raghuvanshi S, Gaur A, Kapur A, Gupta V, Kumar D, Ravi V, Vij S, Khurana P, Sharma S. Structural and functional analysis of rice genome. J Genet 2004; 83:79-99. [PMID: 15240912 DOI: 10.1007/bf02715832] [Citation(s) in RCA: 39] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
Rice is an excellent system for plant genomics as it represents a modest size genome of 430 Mb. It feeds more than half the population of the world. Draft sequences of the rice genome, derived by whole-genome shotgun approach at relatively low coverage (4-6 X), were published and the International Rice Genome Sequencing Project (IRGSP) declared high quality (>10 X), genetically anchored, phase 2 level sequence in 2002. In addition, phase 3 level finished sequence of chromosomes 1, 4 and 10 (out of 12 chromosomes of rice) has already been reported by scientists from IRGSP consortium. Various estimates of genes in rice place the number at >50,000. Already, over 28,000 full-length cDNAs have been sequenced, most of which map to genetically anchored genome sequence. Such information is very useful in revealing novel features of macro- and micro-level synteny of rice genome with other cereals. Microarray analysis is unraveling the identity of rice genes expressing in temporal and spatial manner and should help target candidate genes useful for improving traits of agronomic importance. Simultaneously, functional analysis of rice genome has been initiated by marker-based characterization of useful genes and employing functional knock-outs created by mutation or gene tagging. Integration of this enormous information is expected to catalyze tremendous activity on basic and applied aspects of rice genomics.
Collapse
Affiliation(s)
- Akhilesh K Tyagi
- Department of Plant Molecular Biology, University of Delhi South Campus, Benito Juarez Road, New Delhi 110 021, India.
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
11
|
Samson D, Legeai F, Karsenty E, Reboux S, Veyrieras JB, Just J, Barillot E. GénoPlante-Info (GPI): a collection of databases and bioinformatics resources for plant genomics. Nucleic Acids Res 2003; 31:179-82. [PMID: 12519976 PMCID: PMC165507 DOI: 10.1093/nar/gkg060] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Génoplante is a partnership program between public French institutes (INRA, CIRAD, IRD and CNRS) and private companies (Biogemma, Bayer CropScience and Bioplante) that aims at developing genome analysis programs for crop species (corn, wheat, rapeseed, sunflower and pea) and model plants (Arabidopsis and rice). The outputs of these programs form a wealth of information (genomic sequence, transcriptome, proteome, allelic variability, mapping and synteny, and mutation data) and tools (databases, interfaces, analysis software), that are being integrated and made public at the public bioinformatics resource centre of Génoplante: GénoPlante-Info (GPI). This continuous flood of data and tools is regularly updated and will grow continuously during the coming two years. Access to the GPI databases and tools is available at http://genoplante-info.infobiogen.fr/.
Collapse
Affiliation(s)
- Delphine Samson
- Génoplante-Info, Unité de Recherche Génomique-Info, INRA, Infobiogen, 523 Place des Terrasses, F-91000 Evry, France.
| | | | | | | | | | | | | |
Collapse
|
12
|
Mathé C, Sagot MF, Schiex T, Rouzé P. Current methods of gene prediction, their strengths and weaknesses. Nucleic Acids Res 2002; 30:4103-17. [PMID: 12364589 PMCID: PMC140543 DOI: 10.1093/nar/gkf543] [Citation(s) in RCA: 209] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2002] [Revised: 08/07/2002] [Accepted: 08/07/2002] [Indexed: 11/14/2022] Open
Abstract
While the genomes of many organisms have been sequenced over the last few years, transforming such raw sequence data into knowledge remains a hard task. A great number of prediction programs have been developed that try to address one part of this problem, which consists of locating the genes along a genome. This paper reviews the existing approaches to predicting genes in eukaryotic genomes and underlines their intrinsic advantages and limitations. The main mathematical models and computational algorithms adopted are also briefly described and the resulting software classified according to both the method and the type of evidence used. Finally, the several difficulties and pitfalls encountered by the programs are detailed, showing that improvements are needed and that new directions must be considered.
Collapse
Affiliation(s)
- Catherine Mathé
- Institut de Pharmacologie et Biologie Structurale, UMR 5089, 205 route de Narbonne, F-31077 Toulouse Cedex, France.
| | | | | | | |
Collapse
|
13
|
Usuka J, Brendel V. Gene structure prediction by spliced alignment of genomic DNA with protein sequences: increased accuracy by differential splice site scoring. J Mol Biol 2000; 297:1075-85. [PMID: 10764574 DOI: 10.1006/jmbi.2000.3641] [Citation(s) in RCA: 40] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Gene identification in genomic DNA from eukaryotes is complicated by the vast combinatorial possibilities of potential exon assemblies. If the gene encodes a protein that is closely related to known proteins, gene identification is aided by matching similarity of potential translation products to those target proteins. The genomic DNA and protein sequences can be aligned directly by scoring the implied residues of in-frame nucleotide triplets against the protein residues in conventional ways, while allowing for long gaps in the alignment corresponding to introns in the genomic DNA. We describe a novel method for such spliced alignment. The method derives an optimal alignment based on scoring for both sequence similarity of the predicted gene product to the protein sequence and intrinsic splice site strength of the predicted introns. Application of the method to a representative set of 50 known genes from Arabidopsis thaliana showed significant improvement in prediction accuracy compared to previous spliced alignment methods. The method is also more accurate than ab initio gene prediction methods, provided sufficiently close target proteins are available. In view of the fast growth of public sequence repositories, we argue that close targets will be available for the majority of novel genes, making spliced alignment an excellent practical tool for high-throughput automated genome annotation.
Collapse
Affiliation(s)
- J Usuka
- Department of Chemistry, Stanford University, Stanford, CA, 94305, USA
| | | |
Collapse
|
14
|
Sasaki T, Burr B. International Rice Genome Sequencing Project: the effort to completely sequence the rice genome. CURRENT OPINION IN PLANT BIOLOGY 2000; 3:138-41. [PMID: 10712951 DOI: 10.1016/s1369-5266(99)00047-3] [Citation(s) in RCA: 202] [Impact Index Per Article: 8.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/22/2023]
Abstract
The International Rice Genome Sequencing Project (IRGSP) involves researchers from ten countries who are working to completely and accurately sequence the rice genome within a short period. Sequencing uses a map-based clone-by-clone shotgun strategy; shared bacterial artificial chromosome/P1-derived artificial chromosome libraries have been constructed from Oryza sativa ssp. japonica variety 'Nipponbare'. End-sequencing, fingerprinting and marker-aided PCR screening are being used to make sequence-ready contigs. Annotated sequences are immediately released for public use and are made available with supplemental information at each IRGSP member's website. The IRGSP works to promote the development of rice and cereal genomics in addition to producing genome sequence data.
Collapse
Affiliation(s)
- T Sasaki
- Rice Genome Research Program, National Institute of Agrobiological Resources, Tsukuba, 305-8602, Japan.
| | | |
Collapse
|
15
|
Chopra S, Brendel V, Zhang J, Axtell JD, Peterson T. Molecular characterization of a mutable pigmentation phenotype and isolation of the first active transposable element from Sorghum bicolor. Proc Natl Acad Sci U S A 1999; 96:15330-5. [PMID: 10611384 PMCID: PMC24819 DOI: 10.1073/pnas.96.26.15330] [Citation(s) in RCA: 87] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Accumulation of red phlobaphene pigments in sorghum grain pericarp is under the control of the Y gene. A mutable allele of Y, designated as y-cs (y-candystripe), produces a variegated pericarp phenotype. Using probes from the maize p1 gene that cross-hybridize with the sorghum Y gene, we isolated the y-cs allele containing a large insertion element. Our results show that the Y gene is a member of the MYB-transcription factor family. The insertion element, named Candystripe1 (Cs1), is present in the second intron of the Y gene and shares features of the CACTA superfamily of transposons. Cs1 is 23,018 bp in size and is bordered by 20-bp terminal inverted repeat sequences. It generated a 3-bp target site duplication upon insertion within the Y gene and excised from y-cs, leaving a 2-bp footprint in two cases analyzed. Reinsertion of the excised copy of Cs1 was identified by Southern hybridization in the genome of each of seven red pericarp revertant lines tested. Cs1 is the first active transposable element isolated from sorghum. Our analysis suggests that Cs1-homologous sequences are present in low copy number in sorghum and other grasses, including sudangrass, maize, rice, teosinte, and sugarcane. The low copy number and high transposition frequency of Cs1 imply that this transposon could prove to be an efficient gene isolation tool in sorghum.
Collapse
Affiliation(s)
- S Chopra
- Department of Zoology, Iowa State University, Ames, IA 50011, USA
| | | | | | | | | |
Collapse
|
16
|
Latijnhouwers MJ, Pairoba CF, Brendel V, Walbot V, Carle-Urisote JC. Test of the combinatorial model of intron recognition in a native maize gene. PLANT MOLECULAR BIOLOGY 1999; 41:637-644. [PMID: 10645723 DOI: 10.1023/a:1006329517740] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/23/2023]
Abstract
Previous studies have established that splice site selection and splicing efficiency in plants depend strongly on local compositional contrast consisting of high exon G+C content relative to high intron U content. The combinatorial model of plant intron recognition posits that splice site sequences as well as local intron and exon sequences contribute to splice site selection and splicing efficiency. Most of the previous studies used synthetic or chimeric constructs, often tested in heterologous hosts. To perform a more critical test of the combinatorial model in a native context, the single intron of the maize Bronze2 gene and its flanking exons were modified by site-directed mutagenesis. Splicing efficiency was tested in maize protoplasts. Results show that a higher U content in the flanking 5' exon, whether close to or distant from the 5' splice site, did not modify splicing efficiency. Decreasing exon G+C content dramatically impaired splicing. Increasing intron G+C content or decreasing intron U content adversely impacted splicing. In all constructs splicing occurred exclusively at the original 5' and 3' splice sites. These results are consistent with the hypothesis that exon G+C content and intron U content contribute separate but complementary aspects of intron definition in the native Bz2 transcript.
Collapse
Affiliation(s)
- M J Latijnhouwers
- Department of Biological Sciences, Stanford University, CA 94305-5020, USA
| | | | | | | | | |
Collapse
|
17
|
Vignal L, Lisacek F, Quinqueton J, d'Aubenton-Carafa Y, Thermes C. A multi-agent system simulating human splice site recognition. COMPUTERS & CHEMISTRY 1999; 23:219-31. [PMID: 10404617 DOI: 10.1016/s0097-8485(99)00019-4] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
The present paper describes a method detecting splice sites automatically on the basis of sequence data and models of site/signal recognition supported by experimental evidences. The method is designed to simulate splicing and while doing so, track prediction failures, missing information and possibly test correcting hypotheses. Correlations between nucleotides in the splice site regions and the various elements of the acceptor region are evaluated and combined to assess compensating interactions between elements of the splicing machinery. A scanning model of the acceptor region and a model of interaction between the splicing complexes (exon definition model) are also incorporated in the detection process. Subsets of sites presenting deficiencies of several splice site elements could be identified. Further examination of these sites helps to determine lacking elements and refine models.
Collapse
|
18
|
Mathé C, Peresetsky A, Déhais P, Van Montagu M, Rouzé P. Classification of Arabidopsis thaliana gene sequences: clustering of coding sequences into two groups according to codon usage improves gene prediction. J Mol Biol 1999; 285:1977-91. [PMID: 9925779 DOI: 10.1006/jmbi.1998.2451] [Citation(s) in RCA: 30] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
While genomic sequences are accumulating, finding the location of the genes remains a major issue that can be solved only for about a half of them by homology searches. Prediction methods are thus required, but unfortunately are not fully satisfying. Most prediction methods implicitly assume a unique model for genes. This is an oversimplification as demonstrated by the possibility to group coding sequences into several classes in Escherichia coli and other genomes. As no classification existed for Arabidopsis thaliana, we classified genes according to the statistical features of their coding sequences. A clustering algorithm using a codon usage model was developed and applied to coding sequences from A. thaliana, E. coli, and a mixture of both. By using it, Arabidopsis sequences were clustered into two classes. The CU1 and CU2 classes differed essentially by the choice of pyrimidine bases at the codon silent sites: CU2 genes often use C whereas CU1 genes prefer T. This classification discriminated the Arabidopsis genes according to their expressiveness, highly expressed genes being clustered in CU2 and genes expected to have a lower expression, such as the regulatory genes, in CU1. The algorithm separated the sequences of the Escherichia-Arabidopsis mixed data set into five classes according to the species, except for one class. This mixed class contained 89 % Arabidopsis genes from CU1 and 11 % E. coli genes, mostly horizontally transferred. Interestingly, most genes encoding organelle-targeted proteins, except the photosynthetic and photoassimilatory ones, were clustered in CU1. By tailoring the GeneMark CDS prediction algorithm to the observed coding sequence classes, its quality of prediction was greatly improved. Similar improvement can be expected with other prediction systems.
Collapse
Affiliation(s)
- C Mathé
- Laboratorium voor Genetica Department of Genetics, Flanders Interuniversity Institute for Biotechnology (VIB), Universiteit Gent, Gent, B-9000, Belgium
| | | | | | | | | |
Collapse
|
19
|
Brendel V, Kleffe J. Prediction of locally optimal splice sites in plant pre-mRNA with applications to gene identification in Arabidopsis thaliana genomic DNA. Nucleic Acids Res 1998; 26:4748-57. [PMID: 9753745 PMCID: PMC147908 DOI: 10.1093/nar/26.20.4748] [Citation(s) in RCA: 49] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Prediction of splice site selection and efficiency from sequence inspection is of fundamental interest (testing the current knowledge of requisite sequence features) and practical importance (genome annotation, design of mutant or transgenic organisms). In plants, the dominant variables affecting splice site selection and efficiency include the degree of matching to the extended splice site consensus and the local gradient of U- and G+C-composition (introns being U-rich and exons G+C-rich). We present a novel method for splice site prediction, which was particularly trained for maize and Arabidopsis thaliana. The method extends our previous algorithm based on logitlinear models by considering three variables simultaneously: intrinsic splice site strength, local optimality and fit with respect to the overall splice pattern prediction. We show that the method considerably improves prediction specificity without compromising the high degree of sensitivity required in gene prediction algorithms. Applications to gene identification are illustrated for Arabidopsis and suggest that successful methods must combine scoring for splice sites, coding potential and similarity with potential homologs in non-trivial ways. A WWW version of the SplicePredictor program is available at http:/gnomic.stanford.edu/volker/SplicePredi ctor.html/
Collapse
Affiliation(s)
- V Brendel
- Department of Mathematics, Stanford University, Stanford, CA 94305, USA.
| | | |
Collapse
|
20
|
Frances H, Bligh J, Larkin PD, Roach PS, Jones CA, Fu H, Park WD. Use of alternate splice sites in granule-bound starch synthase mRNA from low-amylose rice varieties. PLANT MOLECULAR BIOLOGY 1998; 38:407-15. [PMID: 9747848 DOI: 10.1023/a:1006021807799] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
The rice Waxy gene encodes a granule-bound starch synthase (GBSS) necessary for the synthesis of amylose in endosperm tissue. We have previously shown that a CT microsatellite near the transcriptional start site of the GBSS gene can distinguish 7 alleles that accounted for more than 80% of the variation in apparent amylose content in an extended pedigree of 89 US rice cultivars (Oryza sativa L.). Furthermore, all the cultivars with 18% or less amylose were shown to have the sequence AGTTATA at the putative leader intron 5' splice site, while all cultivars with a higher proportion of amylose had AGGTATA. Here we demonstrate that this single-base mutation reduces the efficiency of GBSS pre-mRNA processing and results in alternate splicing at three cryptic sites. The predominant 5' splice site in CT18 low-amylose varieties is 93 bp upstream of the splice site used in intermediate and high amylose varieties and is immediately 5' to the CT microsatellite that we previously demonstrated to be tightly correlated with amylose content. Use of the leader intron 5' splice site at either -93 or -1 in conjunction with the predominant 3' splice site results in formation of a small open reading frame 38 bp upstream of the normal ATG and out of frame with it. This open reading frame is not produced when any of the 5' leader intron splice sites are used in conjunction with an alternate 3' splice site five bases further downstream which was observed in all rice varieties tested.
Collapse
Affiliation(s)
- H Frances
- Department of Biochemistry, Queens Medical Centre, University of Nottingham, UK
| | | | | | | | | | | | | |
Collapse
|
21
|
Brendel V, Kleffe J, Carle-Urioste JC, Walbot V. Prediction of splice sites in plant pre-mRNA from sequence properties. J Mol Biol 1998; 276:85-104. [PMID: 9514728 DOI: 10.1006/jmbi.1997.1523] [Citation(s) in RCA: 22] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Heterologous introns are often inaccurately or inefficiently processed in higher plants. The precise features that distinguish the process of pre-mRNA splicing in plants from splicing in yeast and mammals are unclear. One contributing factor is the prominent base compositional contrast between U-rich plant introns and flanking G + C-rich exons. Inclusion of this contrast factor in recently developed statistical methods for splice site prediction from sequence inspection significantly improved prediction accuracy. We applied the prediction tools to re-analyze experimental data on splice site selection and splicing efficiency for native and more than 170 mutated plant introns. In almost all cases, the experimentally determined preferred sites correspond to the highest scoring sites predicted by the model. In native genes, about 90% of splice sites are the locally highest scoring sites within the bounds of the flanking exon and intron. We propose that, in most cases, local context (about 50 bases upstream and downstream from a potential intron end) is sufficient to account for intrinsic splice site strength, and that competition for transacting factors determines splice site selection in vivo. We suggest that computer-aided splice site prediction can be a powerful tool for experimental design and interpretation.
Collapse
Affiliation(s)
- V Brendel
- Department of Mathematics, Stanford University, CA 94305-2125, USA
| | | | | | | |
Collapse
|
22
|
Modeling dependencies in pre-mRNA splicing signals. COMPUTATIONAL METHODS IN MOLECULAR BIOLOGY 1998. [DOI: 10.1016/s0167-7306(08)60465-2] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
|
23
|
Abstract
We present here a new algorithm for functional site analysis. It is based on four main assumptions: each variation of nucleotide composition makes a different contribution to the overall binding free energy of interaction between a functional site and another molecule; nonfunctioning site-like regions (pseudosites) are absent or rare in genomes; there may be errors in the sample of sites; and nucleotides of different site positions are considered to be mutually dependent. In this algorithm, the site set is divided into subsets, each described by a certain consensus. Donor splice sites of the human protein-coding genes were analyzed. Comparing the results with other methods of donor splice site prediction has demonstrated a more accurate prediction of consensus sequences AG/GU(A,G), G/GUnAG, /GU(A,G)AG, /GU(A,G)nGU, and G/GUA than is achieved by weight matrix and consensus (A,C)AG/GU(A,G)AGU with mismatches. The probability of the first type error, E1, for the obtained consensus set was about 0.05, and the probability of the second type error, E2, was 0.15. The analysis demonstrated that accuracy of the functional site prediction could be improved if one takes into account correlations between the site positions. The accuracy of prediction by using human consensus sequences was tested on sequences from different organisms. Some differences in consensus sequences for the plant Arabidopsis sp., the invertebrate Caenorhabditis sp., and the fungus Aspergillus sp. were revealed. For the yeast Saccharomyces sp. only one conservative consensus, /GUA(U,A,C)G(U,A,C), was revealed (E1 = 0.03, E2 = 0.03). Yeast is a very interesting model to use for analysis of molecular mechanisms of splicing.
Collapse
Affiliation(s)
- I B Rogozin
- Istituto di Tecnologie Biomediche Avanzate, Consiglio Nazionale Delle Ricerche, via Ampere 56, 20131 Milano, Italy
| | | |
Collapse
|