101
|
Wei C, Peng J, Xiong Z, Yang J, Wang J, Jin Q. Subproteomic tools to increase genome annotation complexity. Proteomics 2008; 8:4209-13. [DOI: 10.1002/pmic.200800226] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
|
102
|
Sparks ME, Brendel V. MetWAMer: eukaryotic translation initiation site prediction. BMC Bioinformatics 2008; 9:381. [PMID: 18801175 PMCID: PMC2603428 DOI: 10.1186/1471-2105-9-381] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2008] [Accepted: 09/18/2008] [Indexed: 11/20/2022] Open
Abstract
Background Translation initiation site (TIS) identification is an important aspect of the gene annotation process, requisite for the accurate delineation of protein sequences from transcript data. We have developed the MetWAMer package for TIS prediction in eukaryotic open reading frames of non-viral origin. MetWAMer can be used as a stand-alone, third-party tool for post-processing gene structure annotations generated by external computational programs and/or pipelines, or directly integrated into gene structure prediction software implementations. Results MetWAMer currently implements five distinct methods for TIS prediction, the most accurate of which is a routine that combines weighted, signal-based translation initiation site scores and the contrast in coding potential of sequences flanking TISs using a perceptron. Also, our program implements clustering capabilities through use of the k-medoids algorithm, thereby enabling cluster-specific TIS parameter utilization. In practice, our static weight array matrix-based indexing method for parameter set lookup can be used with good results in data sets exhibiting moderate levels of 5'-complete coverage. Conclusion We demonstrate that improvements in statistically-based models for TIS prediction can be achieved by taking the class of each potential start-methionine into account pending certain testing conditions, and that our perceptron-based model is suitable for the TIS identification task. MetWAMer represents a well-documented, extensible, and freely available software system that can be readily re-trained for differing target applications and/or extended with existing and novel TIS prediction methods, to support further research efforts in this area.
Collapse
Affiliation(s)
- Michael E Sparks
- Department of Genetics, Development and Cell Biology, Iowa State University, Ames, IA 50011, USA.
| | | |
Collapse
|
103
|
Armañanzas R, Inza I, Santana R, Saeys Y, Flores JL, Lozano JA, Peer YVD, Blanco R, Robles V, Bielza C, Larrañaga P. A review of estimation of distribution algorithms in bioinformatics. BioData Min 2008; 1:6. [PMID: 18822112 PMCID: PMC2576251 DOI: 10.1186/1756-0381-1-6] [Citation(s) in RCA: 47] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2008] [Accepted: 09/11/2008] [Indexed: 11/10/2022] Open
Abstract
Evolutionary search algorithms have become an essential asset in the algorithmic toolbox for solving high-dimensional optimization problems in across a broad range of bioinformatics problems. Genetic algorithms, the most well-known and representative evolutionary search technique, have been the subject of the major part of such applications. Estimation of distribution algorithms (EDAs) offer a novel evolutionary paradigm that constitutes a natural and attractive alternative to genetic algorithms. They make use of a probabilistic model, learnt from the promising solutions, to guide the search process. In this paper, we set out a basic taxonomy of EDA techniques, underlining the nature and complexity of the probabilistic model of each EDA variant. We review a set of innovative works that make use of EDA techniques to solve challenging bioinformatics problems, emphasizing the EDA paradigm's potential for further research in this domain.
Collapse
Affiliation(s)
- Rubén Armañanzas
- Department of Computer Science and Artificial Intelligence, University of the Basque Country, Donostia – San Sebastián, Spain
| | - Iñaki Inza
- Department of Computer Science and Artificial Intelligence, University of the Basque Country, Donostia – San Sebastián, Spain
| | - Roberto Santana
- Department of Computer Science and Artificial Intelligence, University of the Basque Country, Donostia – San Sebastián, Spain
| | - Yvan Saeys
- Department of Plant Systems Biology, Ghent University, Ghent, Belgium
- Department of Molecular Genetics, Ghent University, Ghent, Belgium
| | - Jose Luis Flores
- Department of Computer Science and Artificial Intelligence, University of the Basque Country, Donostia – San Sebastián, Spain
| | - Jose Antonio Lozano
- Department of Computer Science and Artificial Intelligence, University of the Basque Country, Donostia – San Sebastián, Spain
| | - Yves Van de Peer
- Department of Plant Systems Biology, Ghent University, Ghent, Belgium
- Department of Molecular Genetics, Ghent University, Ghent, Belgium
| | - Rosa Blanco
- Department of Statistics and Operations Research, Public University of Navarre, Pamplona, Spain
| | - Víctor Robles
- Departamento de Arquitectura y Tecnología de Sistemas Informáticos, Universidad Politécnica de Madrid, Madrid, Spain
| | - Concha Bielza
- Departamento de Inteligencia Artificial, Universidad Politécnica de Madrid, Madrid, Spain
| | - Pedro Larrañaga
- Departamento de Inteligencia Artificial, Universidad Politécnica de Madrid, Madrid, Spain
| |
Collapse
|
104
|
Ter-Hovhannisyan V, Lomsadze A, Chernoff YO, Borodovsky M. Gene prediction in novel fungal genomes using an ab initio algorithm with unsupervised training. Genome Res 2008; 18:1979-90. [PMID: 18757608 DOI: 10.1101/gr.081612.108] [Citation(s) in RCA: 654] [Impact Index Per Article: 40.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
We describe a new ab initio algorithm, GeneMark-ES version 2, that identifies protein-coding genes in fungal genomes. The algorithm does not require a predetermined training set to estimate parameters of the underlying hidden Markov model (HMM). Instead, the anonymous genomic sequence in question is used as an input for iterative unsupervised training. The algorithm extends our previously developed method tested on genomes of Arabidopsis thaliana, Caenorhabditis elegans, and Drosophila melanogaster. To better reflect features of fungal gene organization, we enhanced the intron submodel to accommodate sequences with and without branch point sites. This design enables the algorithm to work equally well for species with the kinds of variations in splicing mechanisms seen in the fungal phyla Ascomycota, Basidiomycota, and Zygomycota. Upon self-training, the intron submodel switches on in several steps to reach its full complexity. We demonstrate that the algorithm accuracy, both at the exon and the whole gene level, is favorably compared to the accuracy of gene finders that employ supervised training. Application of the new method to known fungal genomes indicates substantial improvement over existing annotations. By eliminating the effort necessary to build comprehensive training sets, the new algorithm can streamline and accelerate the process of annotation in a large number of fungal genome sequencing projects.
Collapse
|
105
|
Graf A, Gasser B, Dragosits M, Sauer M, Leparc GG, Tüchler T, Kreil DP, Mattanovich D. Novel insights into the unfolded protein response using Pichia pastoris specific DNA microarrays. BMC Genomics 2008; 9:390. [PMID: 18713468 PMCID: PMC2533675 DOI: 10.1186/1471-2164-9-390] [Citation(s) in RCA: 93] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2008] [Accepted: 08/19/2008] [Indexed: 11/24/2022] Open
Abstract
Background DNA Microarrays are regarded as a valuable tool for basic and applied research in microbiology. However, for many industrially important microorganisms the lack of commercially available microarrays still hampers physiological research. Exemplarily, our understanding of protein folding and secretion in the yeast Pichia pastoris is presently widely dependent on conclusions drawn from analogies to Saccharomyces cerevisiae. To close this gap for a yeast species employed for its high capacity to produce heterologous proteins, we developed full genome DNA microarrays for P. pastoris and analyzed the unfolded protein response (UPR) in this yeast species, as compared to S. cerevisiae. Results By combining the partially annotated gene list of P. pastoris with de novo gene finding a list of putative open reading frames was generated for which an oligonucleotide probe set was designed using the probe design tool TherMODO (a thermodynamic model-based oligoset design optimizer). To evaluate the performance of the novel array design, microarrays carrying the oligo set were hybridized with samples from treatments with dithiothreitol (DTT) or a strain overexpressing the UPR transcription factor HAC1, both compared with a wild type strain in normal medium as untreated control. DTT treatment was compared with literature data for S. cerevisiae, and revealed similarities, but also important differences between the two yeast species. Overexpression of HAC1, the most direct control for UPR genes, resulted in significant new understanding of this important regulatory pathway in P. pastoris, and generally in yeasts. Conclusion The differences observed between P. pastoris and S. cerevisiae underline the importance of DNA microarrays for industrial production strains. P. pastoris reacts to DTT treatment mainly by the regulation of genes related to chemical stimulus, electron transport and respiration, while the overexpression of HAC1 induced many genes involved in translation, ribosome biogenesis, and organelle biosynthesis, indicating that the regulatory events triggered by DTT treatment only partially overlap with the reactions to overexpression of HAC1. The high reproducibility of the results achieved with two different oligo sets is a good indication for their robustness, and underlines the importance of less stringent selection of regulated features, in order to avoid a large number of false negative results.
Collapse
Affiliation(s)
- Alexandra Graf
- Institute of Applied Microbiology, Department of Biotechnology, University of Natural Resources and Applied Life Sciences Vienna, Muthgasse 18, 1190 Vienna, Austria.
| | | | | | | | | | | | | | | |
Collapse
|
106
|
Singhal P, Jayaram B, Dixit SB, Beveridge DL. Prokaryotic gene finding based on physicochemical characteristics of codons calculated from molecular dynamics simulations. Biophys J 2008; 94:4173-83. [PMID: 18326660 PMCID: PMC2480686 DOI: 10.1529/biophysj.107.116392] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2007] [Accepted: 11/29/2007] [Indexed: 01/27/2023] Open
Abstract
An ab initio model for gene prediction in prokaryotic genomes is proposed based on physicochemical characteristics of codons calculated from molecular dynamics (MD) simulations. The model requires a specification of three calculated quantities for each codon: the double-helical trinucleotide base pairing energy, the base pair stacking energy, and an index of the propensity of a codon for protein-nucleic acid interactions. The base pairing and stacking energies for each codon are obtained from recently reported MD simulations on all unique tetranucleotide steps, and the third parameter is assigned based on the conjugate rule previously proposed to account for the wobble hypothesis with respect to degeneracies in the genetic code. The third interaction propensity parameter values correlate well with ab initio MD calculated solvation energies and flexibility of codon sequences as well as codon usage in genes and amino acid composition frequencies in approximately 175,000 protein sequences in the Swissprot database. Assignment of these three parameters for each codon enables the calculation of the magnitude and orientation of a cumulative three-dimensional vector for a DNA sequence of any length in each of the six genomic reading frames. Analysis of 372 genomes comprising approximately 350,000 genes shows that the orientations of the gene and nongene vectors are well differentiated and make a clear distinction feasible between genic and nongenic sequences at a level equivalent to or better than currently available knowledge-based models trained on the basis of empirical data, presenting a strong support for the possibility of a unique and useful physicochemical characterization of DNA sequences from codons to genomes.
Collapse
Affiliation(s)
- Poonam Singhal
- Department of Chemistry and Supercomputing Facility for Bioinformatics and Computational Biology, Indian Institute of Technology, Hauz Khas, New Delhi 110016, India
| | | | | | | |
Collapse
|
107
|
Levasseur A, Pontarotti P, Poch O, Thompson JD. Strategies for reliable exploitation of evolutionary concepts in high throughput biology. Evol Bioinform Online 2008; 4:121-37. [PMID: 19204813 PMCID: PMC2614184 DOI: 10.4137/ebo.s597] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022] Open
Abstract
The recent availability of the complete genome sequences of a large number of model organisms, together with the immense amount of data being produced by the new high-throughput technologies, means that we can now begin comparative analyses to understand the mechanisms involved in the evolution of the genome and their consequences in the study of biological systems. Phylogenetic approaches provide a unique conceptual framework for performing comparative analyses of all this data, for propagating information between different systems and for predicting or inferring new knowledge. As a result, phylogeny-based inference systems are now playing an increasingly important role in most areas of high throughput genomics, including studies of promoters (phylogenetic footprinting), interactomes (based on the presence and degree of conservation of interacting proteins), and in comparisons of transcriptomes or proteomes (phylogenetic proximity and co-regulation/co-expression). Here we review the recent developments aimed at making automatic, reliable phylogeny-based inference feasible in large-scale projects. We also discuss how evolutionary concepts and phylogeny-based inference strategies are now being exploited in order to understand the evolution and function of biological systems. Such advances will be fundamental for the success of the emerging disciplines of systems biology and synthetic biology, and will have wide-reaching effects in applied fields such as biotechnology, medicine and pharmacology.
Collapse
Affiliation(s)
- Anthony Levasseur
- Phylogenomics Laboratory, EA 3781 Evolution Biologique, Université de Provence, 13331 Marseille, France
| | | | | | | |
Collapse
|
108
|
Abstract
As the number of sequenced genomes increases, the ability to deduce genome function becomes increasingly salient. For many genome sequences, the only annotation that will be available for the foreseeable future will be based on computational predictions and comparisons with functional elements in related species. Here we discuss computational approaches for automated genome-wide annotation of functional elements in mammalian genomes. These include methods for ab initio and comparative gene-structure predictions. Gene features such as intron splice sites, 3' untranslated regions, promoters, and cis-regulatory elements are discussed, as is a novel method for predicting DNaseI hypersensitive sites. Recent methodologies for predicting noncoding RNA genes, including microRNA genes and their targets, are also reviewed.
Collapse
Affiliation(s)
- Steven J M Jones
- Genome Sciences Centre, British Columbia Cancer Research Center, Vancouver, British Columbia, V5Z 1L3, Canada.
| |
Collapse
|
109
|
Mena-Chalco JP, Carrer H, Zana Y, Cesar RM. Identification of protein coding regions using the modified Gabor-wavelet transform. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2008; 5:198-207. [PMID: 18451429 DOI: 10.1109/tcbb.2007.70259] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/26/2023]
Abstract
An important topic in genomic sequence analysis is the identification of protein coding regions. In this context, several coding DNA model-independent methods, based on the occurrence of specific patterns of nucleotides at coding regions, have been proposed. Nonetheless, these methods have not been completely suitable due to their dependence on an empirically pre-defined window length required for a local analysis of a DNA region. We introduce a method, based on a modified Gabor-wavelet transform (MGWT), for the identification of protein coding regions. This novel transform is tuned to analyze periodic signal components and presents the advantage of being independent of the window length. We compared the performance of the MGWT with other methods using eukaryote datasets. The results show that the MGWT outperforms all assessed model-independent methods with respect to identification accuracy. These results indicate that the source of at least part of the identification errors produced by the previous methods is the fixed working scale. The new method not only avoids this source of errors, but also makes available a tool for detailed exploration of the nucleotide occurrence.
Collapse
Affiliation(s)
- Jesús P Mena-Chalco
- Departmento de Ciencia da Computação, Instituto de Matemática e Estatística de Universidade de São Paulo, Rua do Matão, Cidade Universitária, São Paulo, SP, Brasil.
| | | | | | | |
Collapse
|
110
|
Abstract
The quest for evolutionary mechanisms providing separation between the coding (exons) and noncoding (introns) parts of genomic DNA remains an important focus of genetics. This work combines an analysis of the most recent achievements of genomics and fundamental concepts of random processes to provide a novel point of view on genome evolution. Exon sizes in sequenced genomes show a lognormal distribution typical of a random Kolmogoroff fractioning process. This implies that the process of intron incretion may be independent of exon size, and therefore could be dependent on intron–exon boundaries. All genomes examined have two distinctive classes of exons, each with different evolutionary histories. In the framework proposed in this article, these two classes of exons can be derived from a hypothetical ancestral genome by (spontaneous) symmetry breaking. We note that one of these exon classes comprises mostly alternatively spliced exons.
Collapse
Affiliation(s)
- Yaroslav Ryabov
- Department of Chemistry, Purdue University, 560 Oval drive, Box 202, West Lafayette, IN, 47907, USA.
| | | |
Collapse
|
111
|
Ferro M, Tardif M, Reguer E, Cahuzac R, Bruley C, Vermat T, Nugues E, Vigouroux M, Vandenbrouck Y, Garin J, Viari A. PepLine: a software pipeline for high-throughput direct mapping of tandem mass spectrometry data on genomic sequences. J Proteome Res 2008; 7:1873-83. [PMID: 18348511 DOI: 10.1021/pr070415k] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
PepLine is a fully automated software which maps MS/MS fragmentation spectra of trypsic peptides to genomic DNA sequences. The approach is based on Peptide Sequence Tags (PSTs) obtained from partial interpretation of QTOF MS/MS spectra (first module). PSTs are then mapped on the six-frame translations of genomic sequences (second module) giving hits. Hits are then clustered to detect potential coding regions (third module). Our work aimed at optimizing the algorithms of each component to allow the whole pipeline to proceed in a fully automated manner using raw nucleic acid sequences (i.e., genomes that have not been "reduced" to a database of ORFs or putative exons sequences). The whole pipeline was tested on controlled MS/MS spectra sets from standard proteins and from Arabidopsis thaliana envelope chloroplast samples. Our results demonstrate that PepLine competed with protein database searching softwares and was fast enough to potentially tackle large data sets and/or high size genomes. We also illustrate the potential of this approach for the detection of the intron/exon structure of genes.
Collapse
Affiliation(s)
- Myriam Ferro
- CEA, DSV, iRTSV, Laboratoire d'Etude de la Dynamique des Protéomes, Grenoble, F-38054, France
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
112
|
Abeel T, Saeys Y, Bonnet E, Rouzé P, Van de Peer Y. Generic eukaryotic core promoter prediction using structural features of DNA. Genes Dev 2008; 18:310-23. [PMID: 18096745 PMCID: PMC2203629 DOI: 10.1101/gr.6991408] [Citation(s) in RCA: 133] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2007] [Accepted: 11/14/2007] [Indexed: 11/24/2022]
Abstract
Despite many recent efforts, in silico identification of promoter regions is still in its infancy. However, the accurate identification and delineation of promoter regions is important for several reasons, such as improving genome annotation and devising experiments to study and understand transcriptional regulation. Current methods to identify the core region of promoters require large amounts of high-quality training data and often behave like black box models that output predictions that are difficult to interpret. Here, we present a novel approach for predicting promoters in whole-genome sequences by using large-scale structural properties of DNA. Our technique requires no training, is applicable to many eukaryotic genomes, and performs extremely well in comparison with the best available promoter prediction programs. Moreover, it is fast, simple in design, and has no size constraints, and the results are easily interpretable. We compared our approach with 14 current state-of-the-art implementations using human gene and transcription start site data and analyzed the ENCODE region in more detail. We also validated our method on 12 additional eukaryotic genomes, including vertebrates, invertebrates, plants, fungi, and protists.
Collapse
Affiliation(s)
- Thomas Abeel
- Department of Plant Systems Biology, Flanders Institute for Biotechnology (VIB), 9052 Gent, Belgium
- Department of Molecular Genetics, Ghent University, 9052 Gent, Belgium
| | - Yvan Saeys
- Department of Plant Systems Biology, Flanders Institute for Biotechnology (VIB), 9052 Gent, Belgium
- Department of Molecular Genetics, Ghent University, 9052 Gent, Belgium
| | - Eric Bonnet
- Department of Plant Systems Biology, Flanders Institute for Biotechnology (VIB), 9052 Gent, Belgium
- Department of Molecular Genetics, Ghent University, 9052 Gent, Belgium
| | - Pierre Rouzé
- Department of Plant Systems Biology, Flanders Institute for Biotechnology (VIB), 9052 Gent, Belgium
- Department of Molecular Genetics, Ghent University, 9052 Gent, Belgium
- Laboratoire Associé de l’INRA (France), Ghent University, 9052 Gent, Belgium
| | - Yves Van de Peer
- Department of Plant Systems Biology, Flanders Institute for Biotechnology (VIB), 9052 Gent, Belgium
- Department of Molecular Genetics, Ghent University, 9052 Gent, Belgium
| |
Collapse
|
113
|
Harbers M. The current status of cDNA cloning. Genomics 2008; 91:232-42. [PMID: 18222633 DOI: 10.1016/j.ygeno.2007.11.004] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2007] [Revised: 11/10/2007] [Accepted: 11/17/2007] [Indexed: 11/19/2022]
Abstract
The cloning of cDNAs, copies of cellular RNA, is one of the classical technologies in molecular biology. Over the past 30 years cDNA cloning technologies have been improved to enable the cloning of large cDNA collections, which are fundamental to today's understanding of the utilization of genetic information. With the discovery of noncoding RNAs, additional new approaches to the cloning of short RNAs have been developed. However, with the realization that much larger portions of genomes are transcribed than anticipated from genome annotations, cDNA cloning faces new challenges to uncover rare transcripts and to make the corresponding cDNAs available for functional studies. This review provides an overview on the current status of cDNA cloning and possibilities for the discovery and characterization of new RNA families.
Collapse
Affiliation(s)
- Matthias Harbers
- DNAFORM, Inc., Leading Venture Plaza 2, 75-1 Ono-cho, Tsurumi-ku, Yokohama, Kanagawa 230-0046, Japan.
| |
Collapse
|
114
|
|
115
|
An artificial neural network method for combining gene prediction based on equitable weights. Neurocomputing 2008. [DOI: 10.1016/j.neucom.2007.07.019] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
116
|
Markov G, Lecointre G, Demeneix B, Laudet V. The “street light syndrome”, or how protein taxonomy can bias experimental manipulations. Bioessays 2008; 30:349-57. [DOI: 10.1002/bies.20730] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|
117
|
Dogan RI, Getoor L, Wilbur WJ, Mount SM. Features generated for computational splice-site prediction correspond to functional elements. BMC Bioinformatics 2007; 8:410. [PMID: 17958908 PMCID: PMC2241647 DOI: 10.1186/1471-2105-8-410] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2007] [Accepted: 10/24/2007] [Indexed: 11/16/2022] Open
Abstract
Background Accurate selection of splice sites during the splicing of precursors to messenger RNA requires both relatively well-characterized signals at the splice sites and auxiliary signals in the adjacent exons and introns. We previously described a feature generation algorithm (FGA) that is capable of achieving high classification accuracy on human 3' splice sites. In this paper, we extend the splice-site prediction to 5' splice sites and explore the generated features for biologically meaningful splicing signals. Results We present examples from the observed features that correspond to known signals, both core signals (including the branch site and pyrimidine tract) and auxiliary signals (including GGG triplets and exon splicing enhancers). We present evidence that features identified by FGA include splicing signals not found by other methods. Conclusion Our generated features capture known biological signals in the expected sequence interval flanking splice sites. The method can be easily applied to other species and to similar classification problems, such as tissue-specific regulatory elements, polyadenylation sites, promoters, etc.
Collapse
|
118
|
Roos FF, Jacob R, Grossmann J, Fischer B, Buhmann JM, Gruissem W, Baginsky S, Widmayer P. PepSplice: cache-efficient search algorithms for comprehensive identification of tandem mass spectra. Bioinformatics 2007; 23:3016-23. [PMID: 17768164 DOI: 10.1093/bioinformatics/btm417] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Tandem mass spectrometry allows for high-throughput identification of complex protein samples. Searching tandem mass spectra against sequence databases is the main analysis method nowadays. Since many peptide variations are possible, including them in the search space seems only logical. However, the search space usually grows exponentially with the number of independent variations and may therefore overwhelm computational resources. RESULTS We provide fast, cache-efficient search algorithms to screen large peptide search spaces including non-tryptic peptides, whole genomes, dozens of posttranslational modifications, unannotated point mutations and even unannotated splice sites. All these search spaces can be screened simultaneously. By optimizing the cache usage, we achieve a calculation speed that closely approaches the limits of the hardware. At the same time, we control the size of the overall search space by limiting the combinations of variations that can co-occur on the same peptide. Using a hypergeometric scoring scheme, we applied these algorithms to a dataset of 1 420 632 spectra. We were able to identify a considerable number of peptide variations within a modest amount of computing time on standard desktop computers.
Collapse
Affiliation(s)
- Franz F Roos
- Institute of Theoretical Computer Science, Institute of Plant Science, Institute of Computational Science, ETH Zurich, CH-8092 Zurich, Switzerland
| | | | | | | | | | | | | | | |
Collapse
|
119
|
Yin C, Yau SST. Prediction of protein coding regions by the 3-base periodicity analysis of a DNA sequence. J Theor Biol 2007; 247:687-94. [PMID: 17509616 DOI: 10.1016/j.jtbi.2007.03.038] [Citation(s) in RCA: 119] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2006] [Revised: 03/24/2007] [Accepted: 03/26/2007] [Indexed: 11/30/2022]
Abstract
With the exponential growth of genomic sequences, there is an increasing demand to accurately identify protein coding regions (exons) from genomic sequences. Despite many progresses being made in the identification of protein coding regions by computational methods during the last two decades, the performances and efficiencies of the prediction methods still need to be improved. In addition, it is indispensable to develop different prediction methods since combining different methods may greatly improve the prediction accuracy. A new method to predict protein coding regions is developed in this paper based on the fact that most of exon sequences have a 3-base periodicity, while intron sequences do not have this unique feature. The method computes the 3-base periodicity and the background noise of the stepwise DNA segments of the target DNA sequences using nucleotide distributions in the three codon positions of the DNA sequences. Exon and intron sequences can be identified from trends of the ratio of the 3-base periodicity to the background noise in the DNA sequences. Case studies on genes from different organisms show that this method is an effective approach for exon prediction.
Collapse
Affiliation(s)
- Changchuan Yin
- Department of Mathematics, Statistics and Computer Science, The University of Illinois at Chicago, M/C 249, Chicago, IL 60607-7045, USA
| | | |
Collapse
|
120
|
Ma BG. How to describe genes: Enlightenment from the quaternary number system. Biosystems 2007; 90:20-7. [PMID: 16945479 DOI: 10.1016/j.biosystems.2006.06.004] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2005] [Revised: 06/15/2006] [Accepted: 06/19/2006] [Indexed: 11/17/2022]
Abstract
As an open problem, computational gene identification has been widely studied, and many gene finders (software) become available today. However, little attention has been given to the problem of describing the common features of known genes in databanks to transform raw data into human understandable knowledge. In this paper, we draw attention to the task of describing genes and propose a trial implementation by treating DNA sequences as quaternary numbers. Under such a treatment, the common features of genes can be represented by a "position weight function", the core concept for a number system. In principle, the "position weight function" can be any real-valued function. In this paper, by approximating the function using trigonometric functions, some characteristic parameters indicating single nucleotide periodicities were obtained for the bacteria Escherichia coli K12's genome and the eukaryote yeast's genome. As a byproduct of this approach, a single-nucleotide-level measure is derived that complements codon-based indexes in describing the coding quality and expression level of an open reading frame (ORF). The ideas presented here have the potential to become a general methodology for biological sequence analysis.
Collapse
Affiliation(s)
- Bin-Guang Ma
- College of Chemistry and Chemical Engineering, Suzhou University, Suzhou 215006, PR China.
| |
Collapse
|
121
|
Andreini C, Banci L, Bertini I, Elmi S, Rosato A. Non-heme iron through the three domains of life. Proteins 2007; 67:317-24. [PMID: 17286284 DOI: 10.1002/prot.21324] [Citation(s) in RCA: 63] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Metalloproteins are proteins capable of binding one or more metal ions, which are often required for their biological function or for regulation of their activities or for structural purposes. In high-throughput genome-level protein investigation efforts, such as Structural Genomics, the systematic experimental characterization of metal-binding properties (i.e. the investigation of the metalloproteome) is not always pursued, and remains far from trivial. In the present work we have applied a bioinformatic approach to investigate the occurrence of (putative) non-heme iron-binding proteins in 57 different organisms spanning the entire tree of life. It is found that the non-heme iron-proteome constitutes between 1% and 10% of the entire proteome of an organism. However, the iron-proteome constitutes a higher fraction of the proteome in archaea (on average 7.1% +/- 2.1%) than in bacteria (3.9% +/- 1.6%) and in eukaryota (1.1% +/- 0.4%). The analysis of the function of each putative iron-protein identified suggests that extant organisms have inherited the large majority of their iron-proteome from the last common ancestor.
Collapse
Affiliation(s)
- Claudia Andreini
- Magnetic Resonance Center (CERM) and Department of Chemistry, University of Florence, 50019 Sesto Fiorentino, Italy
| | | | | | | | | |
Collapse
|
122
|
Identification and characterization of insect-specific proteins by genome data analysis. BMC Genomics 2007; 8:93. [PMID: 17407609 PMCID: PMC1852559 DOI: 10.1186/1471-2164-8-93] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2006] [Accepted: 04/04/2007] [Indexed: 11/10/2022] Open
Abstract
Background Insects constitute the vast majority of known species with their importance including biodiversity, agricultural, and human health concerns. It is likely that the successful adaptation of the Insecta clade depends on specific components in its proteome that give rise to specialized features. However, proteome determination is an intensive undertaking. Here we present results from a computational method that uses genome analysis to characterize insect and eukaryote proteomes as an approximation complementary to experimental approaches. Results Homologs in common to Drosophila melanogaster, Anopheles gambiae, Bombyx mori, Tribolium castaneum, and Apis mellifera were compared to the complete genomes of three non-insect eukaryotes (opisthokonts) Homo sapiens, Caenorhabditis elegans and Saccharomyces cerevisiae. This operation yielded 154 groups of orthologous proteins in Drosophila to be insect-specific homologs; 466 groups were determined to be common to eukaryotes (represented by three opisthokonts). ESTs from the hemimetabolous insect Locust migratoria were also considered in order to approximate their corresponding genes in the insect-specific homologs. Stress and stimulus response proteins were found to constitute a higher fraction in the insect-specific homologs than in the homologs common to eukaryotes. Conclusion The significant representation of stress response and stimulus response proteins in proteins determined to be insect-specific, along with specific cuticle and pheromone/odorant binding proteins, suggest that communication and adaptation to environments may distinguish insect evolution relative to other eukaryotes. The tendency for low Ka/Ks ratios in the insect-specific protein set suggests purifying selection pressure. The generally larger number of paralogs in the insect-specific proteins may indicate adaptation to environment changes. Instances in our insect-specific protein set have been arrived at through experiments reported in the literature, supporting the accuracy of our approach.
Collapse
|
123
|
Zhu W, Buell CR. Improvement of whole-genome annotation of cereals through comparative analyses. Genome Res 2007; 17:299-310. [PMID: 17284677 PMCID: PMC1800921 DOI: 10.1101/gr.5881807] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Rice is an important model species for the Poaceae and other monocotyledonous plants. With the availability of a near-complete, finished, and annotated rice genome, we performed genome level comparisons between rice and all plant species in which large genomic or transcriptomic data sets are available to determine the utility of cross-species sequence for structural and functional annotation of the rice genome. Through comparative analyses with four plant genome sequence data sets and transcript assemblies from 185 plant species, we were able to confirm and improve the structural annotation of the rice genome. Support for 38,109 (89.3%) of the total 42,653 nontransposable element-related genes in the rice genome in the form of a rice expressed sequence tag, full-length cDNA, or plant homolog from our comparative analyses could be found. Although the majority of the putative homologs were obtained from Poaceae species, putative homologs were identified in dicotyledonous angiosperms, gymnosperms, and other plants such as algae, moss, and fern. A set of rice genes (7669) lacking a putative homolog was identified which may be lineage-specific genes that evolved after speciation and have a role in species diversity. Improvements to the current rice gene structural annotation could be identified from our comparative alignments and we were able to identify 487 genes which were mostly likely missed in the current rice genome annotation and another 500 genes for structural annotation review. We were able to demonstrate the utility of cross-species comparative alignments in the identification of noncoding sequences and in confirmation of gene nesting in rice.
Collapse
Affiliation(s)
- Wei Zhu
- The Institute for Genomic Research, Rockville, Maryland 20850, USA
| | - C. Robin Buell
- The Institute for Genomic Research, Rockville, Maryland 20850, USA
- Corresponding author.E-mail ; fax: (301) 838-0208
| |
Collapse
|
124
|
Bernal A, Crammer K, Hatzigeorgiou A, Pereira F. Global discriminative learning for higher-accuracy computational gene prediction. PLoS Comput Biol 2007; 3:e54. [PMID: 17367206 PMCID: PMC1828702 DOI: 10.1371/journal.pcbi.0030054] [Citation(s) in RCA: 60] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2006] [Accepted: 02/01/2007] [Indexed: 11/18/2022] Open
Abstract
Most ab initio gene predictors use a probabilistic sequence model, typically a hidden Markov model, to combine separately trained models of genomic signals and content. By combining separate models of relevant genomic features, such gene predictors can exploit small training sets and incomplete annotations, and can be trained fairly efficiently. However, that type of piecewise training does not optimize prediction accuracy and has difficulty in accounting for statistical dependencies among different parts of the gene model. With genomic information being created at an ever-increasing rate, it is worth investigating alternative approaches in which many different types of genomic evidence, with complex statistical dependencies, can be integrated by discriminative learning to maximize annotation accuracy. Among discriminative learning methods, large-margin classifiers have become prominent because of the success of support vector machines (SVM) in many classification tasks. We describe CRAIG, a new program for ab initio gene prediction based on a conditional random field model with semi-Markov structure that is trained with an online large-margin algorithm related to multiclass SVMs. Our experiments on benchmark vertebrate datasets and on regions from the ENCODE project show significant improvements in prediction accuracy over published gene predictors that use intrinsic features only, particularly at the gene level and on genes with long introns. We describe a new approach to statistical learning for sequence data that is broadly applicable to computational biology problems and that has experimentally demonstrated advantages over current hidden Markov model (HMM)-based methods for sequence analysis. The methods we describe in this paper, implemented in the CRAIG program, allow researchers to modularly specify and train sequence analysis models that combine a wide range of weakly informative features into globally optimal predictions. Our results for the gene prediction problem show significant improvements over existing ab initio gene predictors on a variety of tests, including the specially challenging ENCODE regions. Such improved predictions, particularly on initial and single exons, could benefit researchers who are seeking more accurate means of recognizing such important features as signal peptides and regulatory regions. More generally, we believe that our method, by combining the structure-describing capabilities of HMMs with the accuracy of margin-based classification methods, provides a general tool for statistical learning in biological sequences that will replace HMMs in any sequence modeling task for which there is annotated training data.
Collapse
Affiliation(s)
- Axel Bernal
- Department of Computer and Information Science, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America.
| | | | | | | |
Collapse
|
125
|
Danchin EGJ, Levasseur A, Rascol VL, Gouret P, Pontarotti P. The use of evolutionary biology concepts for genome annotation. JOURNAL OF EXPERIMENTAL ZOOLOGY PART B-MOLECULAR AND DEVELOPMENTAL EVOLUTION 2007; 308:26-36. [PMID: 17016828 DOI: 10.1002/jez.b.21131] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
The past decade has seen the completion of numerous whole-genome sequencing projects, began with bacterial genomes and continued with eukaryotic species from different phyla: fungi, plants and animals. Besides, more biological information are produced and are shared thanks to information exchange systems, and more biological concepts, as well as more bioinformatics tools, are available. In this article, we will describe how the evolutionary biology concepts, as well as computer science, are useful for a better understanding of biology in general and genome annotation in particular. The genome annotation process consists of taking the raw DNA produced, for example, by the genome sequencing projects, adding the layers of analysis and interpretation necessary to extract its biological significance and placing it in the context of our understanding of biological processes. Genome annotation is a multistep process falling into two broad categories: structural and functional annotation.
Collapse
Affiliation(s)
- Etienne G J Danchin
- Glycogenomics and Biomedical Structural Biology, AFMB Laboratory, UMR 6098, CNRS, Universités d'Aix-Marseille I et II, 13288 Marseille, France
| | | | | | | | | |
Collapse
|
126
|
Savidor A, Donahoo RS, Hurtado-Gonzales O, Verberkmoes NC, Shah MB, Lamour KH, McDonald WH. Expressed peptide tags: an additional layer of data for genome annotation. J Proteome Res 2007; 5:3048-58. [PMID: 17081056 DOI: 10.1021/pr060134x] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
While genome sequencing is becoming ever more routine, genome annotation remains a challenging process. Identification of the coding sequences within the genomic milieu presents a tremendous challenge, especially for eukaryotes with their complex gene architectures. Here, we present a method to assist the annotation process through the use of proteomic data and bioinformatics. Mass spectra of digested protein preparations of the organism of interest were acquired and searched against a protein database created by a six-frame translation of the genome. The identified peptides were mapped back to the genome, compared to the current annotation, and then categorized as supporting or extending the current genome annotation. We named the classified peptides Expressed Peptide Tags (EPTs). The well-annotated bacterium Rhodopseudomonas palustris was used as a control for the method and showed a high degree of correlation between EPT mapping and the current annotation, with 86% of the EPTs confirming existing gene calls and less than 1% of the EPTs expanding on the current annotation. The eukaryotic plant pathogens Phytophthora ramorum and Phytophthora sojae, whose genomes have been recently sequenced and are much less well-annotated, were also subjected to this method. A series of algorithmic steps were taken to increase the confidence of EPT identification for these organisms, including generation of smaller subdatabases to be searched against, and definition of EPT criteria that accommodates the more complex eukaryotic gene architecture. As expected, the analysis of the Phytophthora species showed less correlation between EPT mapping and their current annotation. While approximately 76% of Phytophthora EPTs supported the current annotation, a portion of them (7.7% and 12.9% for P. ramorum and P. sojae, respectively) suggested modification to current gene calls or identified novel genes that were missed by the current genome annotation of these organisms.
Collapse
Affiliation(s)
- Alon Savidor
- Graduate School of Genome Science and Technology, University of Tennessee-Oak Ridge National Laboratory, Oak Ridge, Tennessee 37830, USA
| | | | | | | | | | | | | |
Collapse
|
127
|
Saeys Y, Rouzé P, Van de Peer Y. In search of the small ones: improved prediction of short exons in vertebrates, plants, fungi and protists. Bioinformatics 2007; 23:414-20. [PMID: 17204465 DOI: 10.1093/bioinformatics/btl639] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Prediction of the coding potential for stretches of DNA is crucial in gene calling and genome annotation, where it is used to identify potential exons and to position their boundaries in conjunction with functional sites, such as splice sites and translation initiation sites. The ability to discriminate between coding and non-coding sequences relates to the structure of coding sequences, which are organized in codons, and by their biased usage. For statistical reasons, the longer the sequences, the easier it is to detect this codon bias. However, in many eukaryotic genomes, where genes harbour many introns, both introns and exons might be small and hard to distinguish based on coding potential. RESULTS Here, we present novel approaches that specifically aim at a better detection of coding potential in short sequences. The methods use complementary sequence features, combined with identification of which features are relevant in discriminating between coding and non-coding sequences. These newly developed methods are evaluated on different species, representative of four major eukaryotic kingdoms, and extensively compared to state-of-the-art Markov models, which are often used for predicting coding potential. The main conclusions drawn from our analyses are that (1) combining complementary sequence features clearly outperforms current Markov models for coding potential prediction in short sequence fragments, (2) coding potential prediction benefits from length-specific models, and these models are not necessarily the same for different sequence lengths and (3) comparing the results across several species indicates that, although our combined method consistently performs extremely well, there are important differences across genomes. SUPPLEMENTARY DATA http://bioinformatics.psb.ugent.be/.
Collapse
Affiliation(s)
- Yvan Saeys
- Department of Plant Systems Biology, Flanders Interuniversity Institute for Biotechnology (VIB), Technologiepark 927, B-9052 Ghent, Belgium.
| | | | | |
Collapse
|
128
|
Suen G, Arshinoff BI, Taylor RG, Welch RD. Practical Applications of Bacterial Functional Genomics. Biotechnol Genet Eng Rev 2007; 24:213-42. [DOI: 10.1080/02648725.2007.10648101] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
129
|
Gorantla M, Babu PR, Lachagari VBR, Reddy AMM, Wusirika R, Bennetzen JL, Reddy AR. Identification of stress-responsive genes in an indica rice (Oryza sativa L.) using ESTs generated from drought-stressed seedlings. JOURNAL OF EXPERIMENTAL BOTANY 2007; 58:253-65. [PMID: 17132712 DOI: 10.1093/jxb/erl213] [Citation(s) in RCA: 60] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/12/2023]
Abstract
The impacts of drought on plant growth and development limit cereal crop production worldwide. Rice (Oryza sativa) productivity and production is severely affected due to recurrent droughts in almost all agroecological zones. With the advent of molecular and genomic technologies, emphasis is now placed on understanding the mechanisms of genetic control of the drought-stress response. In order to identify genes associated with water-stress response in rice, ESTs generated from a normalized cDNA library, constructed from drought-stressed leaf tissue of an indica cultivar, Nagina 22 were used. Analysis of 7794 cDNA sequences led to the identification of 5815 rice ESTs. Of these, 334 exhibited no significant sequence homology with any rice ESTs or full-length cDNAs in public databases, indicating that these transcripts are enriched during drought stress. Analysis of these 5815 ESTs led to the identification of 1677 unique sequences. To characterize this drought transcriptome further and to identify candidate genes associated with the drought-stress response, the rice data were compared with those for abiotic stress-induced sequences obtained from expression profiling studies in Arabidopsis, barley, maize, and rice. This comparative analysis identified 589 putative stress-responsive genes (SRGs) that are shared by these diverse plant species. Further, the identified leaf SRGs were compared to expression profiles for a drought-stressed rice panicle library to identify common sequences. Significantly, 125 genes were found to be expressed under drought stress in both tissues. The functional classification of these 125 genes showed that a majority of them are associated with cellular metabolism, signal transduction, and transcriptional regulation.
Collapse
Affiliation(s)
- Markandeya Gorantla
- Department of Plant Sciences, School of Life Sciences, University of Hyderabad, Hyderabad-500046, AP, India
| | | | | | | | | | | | | |
Collapse
|
130
|
Malousi A, Kouidou S, Maglaveras N. Detecting over-represented motifs in alternatively spliced exons using Gibbs sampling. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2007; 2007:139-142. [PMID: 18001908 DOI: 10.1109/iembs.2007.4352242] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/25/2023]
Abstract
Alternative pre-mRNA splicing is a biological mechanism with significant prevalence in complex organisms and experimentally verified association with numerous disease-causing factors. Splicing-related proteins play a significant regulatory role during this process. In this study, we applied a stochastic analysis of alternatively spliced human genes based on Gibbs sampling, in order to identify short consensus sequences that are over-represented compared to a reference Markov model describing constitutive exons of the same genes. The analysis resulted in a set of statistically significant over-represented motifs. The biological importance of these motifs was assessed by estimating the likelihood of being identified by cis-acting elements that correspond to the binding domains of splicing enhancers/silencers. The results indicate that the identified over-represented sequences are often similar to those recognized by known regulatory splicing elements.
Collapse
Affiliation(s)
- Andigoni Malousi
- Student Member, IEEE, Lab. of Medical Informatics, Faculty of Medicine, Aristotle University of Thessaloniki, 54124, P.O.Box 323, Greece,
| | | | | |
Collapse
|
131
|
Shimizu K, Adachi J, Muraoka Y. ANGLE: a sequencing errors resistant program for predicting protein coding regions in unfinished cDNA. J Bioinform Comput Biol 2006; 4:649-64. [PMID: 16960968 DOI: 10.1142/s0219720006002260] [Citation(s) in RCA: 66] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2005] [Revised: 09/19/2005] [Accepted: 11/20/2005] [Indexed: 11/18/2022]
Abstract
In the process of making full-length cDNA, predicting protein coding regions helps both in the preliminary analysis of genes and in any succeeding process. However, unfinished cDNA contains artifacts including many sequencing errors, which hinder the correct evaluation of coding sequences. Especially, predictions of short sequences are difficult because they provide little information for evaluating coding potential. In this paper, we describe ANGLE, a new program for predicting coding sequences in low quality cDNA. To achieve error-tolerant prediction, ANGLE uses a machine-learning approach, which makes better expression of coding sequence maximizing the use of limited information from input sequences. Our method utilizes not only codon usage, but also protein structure information which is difficult to be used for stochastic model-based algorithms, and optimizes limited information from a short segment when deciding coding potential, with the result that predictive accuracy does not depend on the length of an input sequence. The performance of ANGLE is compared with ESTSCAN on four dataset each of them having a different error rate (one frame-shift error or one substitution error per 200-500 nucleotides) and on one dataset which has no error. ANGLE outperforms ESTSCAN by 9.26% in average Matthews's correlation coefficient on short sequence dataset (< 1000 bases). On long sequence dataset, ANGLE achieves comparable performance.
Collapse
Affiliation(s)
- Kana Shimizu
- Department of Computer Science, Graduate school of Waseda University, Tokyo, 162-0044, Japan.
| | | | | |
Collapse
|
132
|
Knapp K, Chen YPP. An evaluation of contemporary hidden Markov model genefinders with a predicted exon taxonomy. Nucleic Acids Res 2006; 35:317-24. [PMID: 17170005 PMCID: PMC1802560 DOI: 10.1093/nar/gkl1026] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2006] [Revised: 11/13/2006] [Accepted: 11/13/2006] [Indexed: 11/15/2022] Open
Abstract
We present an independent evaluation of six recent hidden Markov model (HMM) genefinders. Each was tested on the new dataset (FSH298), the results of which showed no dramatic improvement over the genefinders tested five years ago. In addition, we introduce a comprehensive taxonomy of predicted exons and classify each resulting exon accordingly. These results are useful in measuring (with finer granularity) the effects of changes in a genefinder. We present an analysis of these results and identify four patterns of inaccuracy common in all HMM-based results.
Collapse
Affiliation(s)
- Keith Knapp
- Faculty of Science and Technology, Deakin UniversityAustralia
| | - Yi-Ping Phoebe Chen
- Faculty of Science and Technology, Deakin UniversityAustralia
- Australia Research Council Centre in BioinformaticsAustralia
| |
Collapse
|
133
|
Segovia-Juarez JL, Colombano S, Kirschner D. Identifying DNA splice sites using hypernetworks with artificial molecular evolution. Biosystems 2006; 87:117-24. [PMID: 17116361 DOI: 10.1016/j.biosystems.2006.09.004] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2005] [Revised: 07/08/2006] [Accepted: 07/15/2006] [Indexed: 11/28/2022]
Abstract
Identifying DNA splice sites is a main task of gene hunting. We introduce the hyper-network architecture as a novel method for finding DNA splice sites. The hypernetwork architecture is a biologically inspired information processing system composed of networks of molecules forming cells, and a number of cells forming a tissue or organism. Its learning is based on molecular evolution. DNA examples taken from GenBank were translated into binary strings and fed into a hypernetwork for training. We performed experiments to explore the generalization performance of hypernetwork learning in this data set by two-fold cross validation. The hypernetwork generalization performance was comparable to well known classification algorithms. With the best hypernetwork obtained, including local information and heuristic rules, we built a system (HyperExon) to obtain splice site candidates. The HyperExon system outperformed leading splice recognition systems in the list of sequences tested.
Collapse
Affiliation(s)
- Jose L Segovia-Juarez
- Department of Microbiology and Immunology, University of Michigan, Ann Arbor, MI, USA
| | | | | |
Collapse
|
134
|
Pareek A, Singh A, Kumar M, Kushwaha HR, Lynn AM, Singla-Pareek SL. Whole-genome analysis of Oryza sativa reveals similar architecture of two-component signaling machinery with Arabidopsis. PLANT PHYSIOLOGY 2006; 142:380-97. [PMID: 16891544 PMCID: PMC1586034 DOI: 10.1104/pp.106.086371] [Citation(s) in RCA: 91] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/11/2023]
Abstract
The two-component system (TCS), which works on the principle of histidine-aspartate phosphorelay signaling, is known to play an important role in diverse physiological processes in lower organisms and has recently emerged as an important signaling system in plants. Employing the tools of bioinformatics, we have characterized TCS signaling candidate genes in the genome of Oryza sativa L. subsp. japonica. We present a complete overview of TCS gene families in O. sativa, including gene structures, conserved motifs, chromosome locations, and phylogeny. Our analysis indicates a total of 51 genes encoding 73 putative TCS proteins. Fourteen genes encode 22 putative histidine kinases with a conserved histidine and other typical histidine kinase signature sequences, five phosphotransfer genes encoding seven phosphotransfer proteins, and 32 response regulator genes encoding 44 proteins. The variations seen between gene and protein numbers are assumed to result from alternative splicing. These putative proteins have high homology with TCS members that have been shown experimentally to participate in several important physiological phenomena in plants, such as ethylene and cytokinin signaling and phytochrome-mediated responses to light. We conclude that the overall architecture of the TCS machinery in O. sativa and Arabidopsis thaliana is similar, and our analysis provides insights into the conservation and divergence of this important signaling machinery in higher plants.
Collapse
Affiliation(s)
- Ashwani Pareek
- Stress Physiology and Molecular Biology Laboratory, School of Life Sciences, Jawaharlal Nehru University, New Delhi 110067, India.
| | | | | | | | | | | |
Collapse
|
135
|
Pighetti GM, Rambeaud M. Genome conservation between the bovine and human interleukin-8 receptor complex: improper annotation of bovine interleukin-8 receptor b identified. Vet Immunol Immunopathol 2006; 114:335-40. [PMID: 16982101 DOI: 10.1016/j.vetimm.2006.08.008] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2006] [Revised: 07/28/2006] [Accepted: 08/14/2006] [Indexed: 11/29/2022]
Abstract
Interleukin (IL)-8 and its receptors, CXCR1 and CXCR2, are key regulators of inflammation. However, knowledge of these receptors at the genomic level is limiting or absent in cattle. Therefore, our objective was to identify bovine orthologs of human CXCR1 and CXCR2. Alignment of bovine CXCR2 reference mRNA to the bovine genome revealed two regions of similarity on BTA2 approximately 20 kb apart and on opposite strands. Comparison with the human genome suggested the more centromeric region to be CXCR2 and the more telomeric region to be CXCR1 which contradicts the current annotation of the bovine CXCR2 reference mRNA. This observation was verified by sequencing RT-PCR products of specific regions within each predicted IL-8 receptor and comparing with human sequences using ClustalW. Further examination of coding and non-coding regions within the IL-8 receptor genome complex revealed that both bovine and canine CXCR1 and CXCR2 genes had more conserved sequences in common with the human genes than either mouse or rat, and may offer more suitable animal models for certain applications. This molecular information provides a stepping stone for greater understanding of the role each IL-8 receptor plays in inflammation and will enhance our ability to develop strategies against inflammatory based diseases.
Collapse
Affiliation(s)
- Gina M Pighetti
- Department of Animal Science, 114 McCord Hall, 2640 Morgan Circle, The University of Tennessee, Knoxville, TN 37996, USA.
| | | |
Collapse
|
136
|
El-Mogharbel N, Wakefield M, Deakin JE, Tsend-Ayush E, Grützner F, Alsop A, Ezaz T, Marshall Graves JA. DMRT gene cluster analysis in the platypus: new insights into genomic organization and regulatory regions. Genomics 2006; 89:10-21. [PMID: 16962738 DOI: 10.1016/j.ygeno.2006.07.017] [Citation(s) in RCA: 35] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2006] [Revised: 07/31/2006] [Accepted: 07/31/2006] [Indexed: 10/24/2022]
Abstract
We isolated and characterized a cluster of platypus DMRT genes and compared their arrangement, location, and sequence across vertebrates. The DMRT gene cluster on human 9p24.3 harbors, in order, DMRT1, DMRT3, and DMRT2, which share a DM domain. DMRT1 is highly conserved and involved in sexual development in vertebrates, and deletions in this region cause sex reversal in humans. Sequence comparisons of DMRT genes between species have been valuable in identifying exons, control regions, and conserved nongenic regions (CNGs). The addition of platypus sequences is expected to be particularly valuable, since monotremes fill a gap in the vertebrate genome coverage. We therefore isolated and fully sequenced platypus BAC clones containing DMRT3 and DMRT2 as well as DMRT1 and then generated multispecies alignments and ran prediction programs followed by experimental verification to annotate this gene cluster. We found that the three genes have 58-66% identity to their human orthologues, lie in the same order as in other vertebrates, and colocate on 1 of the 10 platypus sex chromosomes, X5. We also predict that optimal annotation of the newly sequenced platypus genome will be challenging. The analysis of platypus sequence revealed differences in structure and sequence of the DMRT gene cluster. Multispecies comparison was particularly effective for detecting CNGs, revealing several novel potential regulatory regions within DMRT3 and DMRT2 as well as DMRT1. RT-PCR indicated that platypus DMRT1 and DMRT3 are expressed specifically in the adult testis (and not ovary), but DMRT2 has a wider expression profile, as it does for other mammals. The platypus DMRT1 expression pattern, and its location on an X chromosome, suggests an involvement in monotreme sexual development.
Collapse
Affiliation(s)
- Nisrine El-Mogharbel
- Comparative Genomics Group, Research School of Biological Sciences, Australian National University, P.O. Box 475, Canberra, ACT 2601, Australia.
| | | | | | | | | | | | | | | |
Collapse
|
137
|
Hsieh SJ, Lin CY, Liu NH, Chow WY, Tang CY. GeneAlign: a coding exon prediction tool based on phylogenetical comparisons. Nucleic Acids Res 2006; 34:W280-4. [PMID: 16845010 PMCID: PMC1538901 DOI: 10.1093/nar/gkl307] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022] Open
Abstract
GeneAlign is a coding exon prediction tool for predicting protein coding genes by measuring the homologies between a sequence of a genome and related sequences, which have been annotated, of other genomes. Identifying protein coding genes is one of most important tasks in newly sequenced genomes. With increasing numbers of gene annotations verified by experiments, it is feasible to identify genes in the newly sequenced genomes by comparing to annotated genes of phylogenetically close organisms. GeneAlign applies CORAL, a heuristic linear time alignment tool, to determine if regions flanked by the candidate signals (initiation codon-GT, AG-GT and AG-STOP codon) are similar to annotated coding exons. Employing the conservation of gene structures and sequence homologies between protein coding regions increases the prediction accuracy. GeneAlign was tested on Projector dataset of 491 human–mouse homologous sequence pairs. At the gene level, both the average sensitivity and the average specificity of GeneAlign are 81%, and they are larger than 96% at the exon level. The rates of missing exons and wrong exons are smaller than 1%. GeneAlign is a free tool available at .
Collapse
Affiliation(s)
- Shu Ju Hsieh
- Department of Computer Science, National Tsing Hua UniversityHsinchu, Taiwan 300, ROC
| | - Chun Yuan Lin
- Institute of Molecular and Cellular Biology and Department of Life Science, National Tsing Hua UniversityHsinchu, Taiwan 300, ROC
| | - Ning Han Liu
- Department of Computer Science, National Tsing Hua UniversityHsinchu, Taiwan 300, ROC
| | - Wei Yuan Chow
- Institute of Molecular and Cellular Biology and Department of Life Science, National Tsing Hua UniversityHsinchu, Taiwan 300, ROC
| | - Chuan Yi Tang
- Department of Computer Science, National Tsing Hua UniversityHsinchu, Taiwan 300, ROC
- To whom correspondence should be addressed. Tel: 886 3 5731077; Fax: 886 3 5723694;
| |
Collapse
|
138
|
Marashi SA, Eslahchi C, Pezeshk H, Sadeghi M. Impact of RNA structure on the prediction of donor and acceptor splice sites. BMC Bioinformatics 2006; 7:297. [PMID: 16772025 PMCID: PMC1526458 DOI: 10.1186/1471-2105-7-297] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2006] [Accepted: 06/13/2006] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND gene identification in genomic DNA sequences by computational methods has become an important task in bioinformatics and computational gene prediction tools are now essential components of every genome sequencing project. Prediction of splice sites is a key step of all gene structural prediction algorithms. RESULTS we sought the role of mRNA secondary structures and their information contents for five vertebrate and plant splice site datasets. We selected 900-nucleotide sequences centered at each (real or decoy) donor and acceptor sites, and predicted their corresponding RNA structures by Vienna software. Then, based on whether the nucleotide is in a stem or not, the conventional four-letter nucleotide alphabet was translated into an eight-letter alphabet. Zero-, first- and second-order Markov models were selected as the signal detection methods. It is shown that applying the eight-letter alphabet compared to the four-letter alphabet considerably increases the accuracy of both donor and acceptor site predictions in case of higher order Markov models. CONCLUSION Our results imply that RNA structure contains important data and future gene prediction programs can take advantage of such information.
Collapse
Affiliation(s)
- Sayed-Amir Marashi
- Department of Biotechnology, University College of Science, University of Tehran, Tehran, Iran
| | - Changiz Eslahchi
- Faculty of Mathematics, Shahid-Beheshti University, Tehran, Iran
- Department of Bioinformatics, Institute of Biochemistry and Biophysics, University of Tehran, Tehran, Iran
| | - Hamid Pezeshk
- Center of Excellence in Biomathematics, School of Mathematics, Statistics and Computer Sciences, University College of Science, University of Tehran, Tehran, Iran
- Department of Bioinformatics, Institute of Biochemistry and Biophysics, University of Tehran, Tehran, Iran
| | - Mehdi Sadeghi
- National Institute for Genetic Engineering and Biotechnology, Tehran-Karaj Highway, Tehran, Iran
- Department of Bioinformatics, Institute of Biochemistry and Biophysics, University of Tehran, Tehran, Iran
| |
Collapse
|
139
|
Alonso JM, Ecker JR. Moving forward in reverse: genetic technologies to enable genome-wide phenomic screens in Arabidopsis. Nat Rev Genet 2006; 7:524-36. [PMID: 16755288 DOI: 10.1038/nrg1893] [Citation(s) in RCA: 186] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Genome sequencing, in combination with various computational and empirical approaches to sequence annotation, has made possible the identification of more than 30,000 genes in Arabidopsis thaliana. Increasingly sophisticated genetic tools are being developed with the long-term goal of understanding how the coordinated activity of these genes gives rise to a complex organism. The combination of classical forward genetics with recently developed genome-wide, gene-indexed mutant collections is beginning to revolutionize the way in which gene functions are studied in plants. High-throughput screens using these mutant populations should provide a means to analyse plant gene functions--the phenome--on a genomic scale.
Collapse
Affiliation(s)
- Jose M Alonso
- North Carolina State University, Department of Genetics, Raleigh, North Carolina 27695-7614, USA.
| | | |
Collapse
|
140
|
Mazzarelli JM, White P, Gorski R, Brestelli J, Pinney DF, Arsenlis A, Katokhin A, Belova O, Bogdanova V, Elisafenko E, Gubina M, Nizolenko L, Perelman P, Puzakov M, Shilov A, Trifonoff V, Vorobjeva N, Kolchanov N, Kaestner KH, Stoeckert CJ. Novel genes identified by manual annotation and microarray expression analysis in the pancreas. Genomics 2006; 88:752-761. [PMID: 16725306 DOI: 10.1016/j.ygeno.2006.04.005] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2006] [Accepted: 04/14/2006] [Indexed: 10/24/2022]
Abstract
The mouse PancChip, a microarray developed for studying endocrine pancreatic development and diabetes, represents over 13,000 cDNAs. After computationally assigning the cDNAs on the array to known genes, manual curation of the remaining sequences identified 211 novel transcripts. In microarray experiments, we found that 196 of these transcripts were expressed in total pancreas and/or pancreatic islets. Of 50 randomly selected clones from these 196 transcripts, 92% were confirmed as expressed by qRT-PCR. We evaluated the coding potential of the novel transcripts and found that 74% of the clones had low coding potential. Since the transcripts may be partial mRNAs, we examined their translated proteins for transmembrane or signal peptide domains and found that about 40 proteins had one of these predicted domains. Interestingly, when we investigated the novel transcripts for their overlap with noncoding microRNAs, we found that 1 of the novel transcripts overlapped a known microRNA gene.
Collapse
Affiliation(s)
- Joan M Mazzarelli
- Center for Bioinformatics, School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA; Department of Genetics, School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA.
| | - Peter White
- Department of Genetics, School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Regina Gorski
- Center for Bioinformatics, School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA; Department of Genetics, School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - John Brestelli
- Center for Bioinformatics, School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA; Department of Genetics, School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Deborah F Pinney
- Center for Bioinformatics, School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA; Department of Genetics, School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Athanasios Arsenlis
- Department of Genetics, School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Alexey Katokhin
- Institute of Cytology and Genetics SB RAS, Novosibirsk, Russia
| | - Olga Belova
- Institute of Cytology and Genetics SB RAS, Novosibirsk, Russia
| | - Vera Bogdanova
- Institute of Cytology and Genetics SB RAS, Novosibirsk, Russia
| | | | - Marina Gubina
- Institute of Cytology and Genetics SB RAS, Novosibirsk, Russia
| | - Lilia Nizolenko
- Institute of Cytology and Genetics SB RAS, Novosibirsk, Russia
| | - Polina Perelman
- Institute of Cytology and Genetics SB RAS, Novosibirsk, Russia
| | - Mikhail Puzakov
- Institute of Cytology and Genetics SB RAS, Novosibirsk, Russia
| | | | | | | | | | - Klaus H Kaestner
- Department of Genetics, School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Christian J Stoeckert
- Center for Bioinformatics, School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA; Department of Genetics, School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| |
Collapse
|
141
|
Dana AN, Hillenmeyer ME, Lobo NF, Kern MK, Romans PA, Collins FH. Differential gene expression in abdomens of the malaria vector mosquito, Anopheles gambiae, after sugar feeding, blood feeding and Plasmodium berghei infection. BMC Genomics 2006; 7:119. [PMID: 16712725 PMCID: PMC1508153 DOI: 10.1186/1471-2164-7-119] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2005] [Accepted: 05/19/2006] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Large scale sequencing of cDNA libraries can provide profiles of genes expressed in an organism under defined biological and environmental circumstances. We have analyzed sequences of 4541 Expressed Sequence Tags (ESTs) from 3 different cDNA libraries created from abdomens from Plasmodium infection-susceptible adult female Anopheles gambiae. These libraries were made from sugar fed (S), rat blood fed (RB), and P. berghei-infected (IRB) mosquitoes at 30 hours after the blood meal, when most parasites would be transforming ookinetes or very early oocysts. RESULTS The S, RB and IRB libraries contained 1727, 1145 and 1669 high quality ESTs, respectively, averaging 455 nucleotides (nt) in length. They assembled into 1975 consensus sequences--567 contigs and 1408 singletons. Functional annotation was performed to annotate probable molecular functions of the gene products and the biological processes in which they function. Genes represented at high frequency in one or more of the libraries were subjected to digital Northern analysis and results on expression of 5 verified by qRT-PCR. CONCLUSION 13% of the 1965 ESTs showing identity to the A. gambiae genome sequence represent novel genes. These, together with untranslated regions (UTR) present on many of the ESTs, will inform further genome annotation. We have identified 23 genes encoding products likely to be involved in regulating the cellular oxidative environment and 25 insect immunity genes. We also identified 25 genes as being up or down regulated following blood feeding and/or feeding with P. berghei infected blood relative to their expression levels in sugar fed females.
Collapse
Affiliation(s)
- Ali N Dana
- Center for Tropical Disease Research and Training, Department of Biological Sciences, University of Notre Dame, Notre Dame, IN 46556, USA
| | | | - Neil F Lobo
- Center for Tropical Disease Research and Training, Department of Biological Sciences, University of Notre Dame, Notre Dame, IN 46556, USA
| | - Marcia K Kern
- Center for Tropical Disease Research and Training, Department of Biological Sciences, University of Notre Dame, Notre Dame, IN 46556, USA
| | - Patricia A Romans
- Department of Zoology, University of Toronto, Toronto, ON M5S 3G5, Canada
| | - Frank H Collins
- Center for Tropical Disease Research and Training, Department of Biological Sciences, University of Notre Dame, Notre Dame, IN 46556, USA
| |
Collapse
|
142
|
Ge B, Gurd S, Gaudin T, Dore C, Lepage P, Harmsen E, Hudson TJ, Pastinen T. Survey of allelic expression using EST mining. Genome Res 2006; 15:1584-91. [PMID: 16251468 PMCID: PMC1310646 DOI: 10.1101/gr.4023805] [Citation(s) in RCA: 98] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
Cis-acting allelic variation in gene regulation is a source of phenotypic variation. Consequently, recent studies have experimentally screened human genes in an attempt to initiate a catalog of genes possessing cis-acting variants. In this study, we use human EST data in dbEST as the source of allelic expression data, and the HapMap database to provide expected allele frequencies in human populations. We demonstrate a greater concordance of allele frequencies estimated from human ESTs in dbEST with those derived from the CEPH HapMap sample representing Caucasians from northern and western Europe, than population samples obtained in Asia and Africa. Deviations between allele frequencies observed in EST databases and the ones obtained from the CEPH HapMap samples may result from common heritable cis-acting variants altering the relative allele distribution in RNA. We provide in silico as well as experimental evidence that this strategy does allow significant enrichment of genes harboring common heritable cis-acting polymorphisms in linkage disequilibrium with expressed alleles.
Collapse
Affiliation(s)
- Bing Ge
- McGill University and Genome Quebec Innovation Centre, Montreal, Quebec H3A 1A, Canada
| | | | | | | | | | | | | | | |
Collapse
|
143
|
Horellou MH, Chevreaud C, Mathieux V, Conard J, de Mazancourt P. Fibrinogen Paris IX: a case of symptomatic hypofibrinogenemia with Bbeta Y236C and Bbeta IVS7-1G-->C mutations. J Thromb Haemost 2006; 4:1134-6. [PMID: 16689768 DOI: 10.1111/j.1538-7836.2006.01881.x] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
144
|
Dutta S, Singhal P, Agrawal P, Tomer R, Kritee K, Khurana E, Jayaram B. A physicochemical model for analyzing DNA sequences. J Chem Inf Model 2006; 46:78-85. [PMID: 16426042 DOI: 10.1021/ci050119x] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
In search of an ab initio model to characterize DNA sequences as genes and nongenes, we examined some physicochemical properties of each trinucleotide (codon), which could accomplish this task. We constructed three-dimensional vectors for each double-helical trinucleotide sequence considering hydrogen-bonding energy, stacking energy, and a third parameter, which we provisionally identified with DNA-protein interactions. As this three-dimensional vector moves along any genome, the net orientation of the resultant vector should differ significantly for gene and nongene regions to make a distinction feasible, if the underlying model has some merits. An analysis of 331 prokaryotic genomes comprising a total of 294 786 experimentally verified genes (nonoverlapping) and an equal number of nongenes presents a proof of concept of the model without the need for further parametrization. Also, initial analyses on Saccharomyces cerevisiae and Arabidopsis thaliana suggest that the methodology is extendable to eukaryotes. The physicochemical model (ChemGenome1.0) introduced has the potential to be developed into a gene-finding algorithm and, more pressingly, could be employed for an independent assessment of the annotation of DNA sequences.
Collapse
Affiliation(s)
- Samrat Dutta
- Department of Chemistry and Supercomputing Facility for Bioinformatics and Computational Biology, Indian Institute of Technology, Hauz Khas, New Delhi
| | | | | | | | | | | | | |
Collapse
|
145
|
Agrawal R, Stormo GD. Using mRNAs lengths to accurately predict the alternatively spliced gene products in Caenorhabditis elegans. Bioinformatics 2006; 22:1239-44. [PMID: 16595562 DOI: 10.1093/bioinformatics/btl076] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Computational gene prediction methods are an important component of whole genome analyses. While ab initio gene finders have demonstrated major improvements in accuracy, the most reliable methods are evidence-based gene predictors. These algorithms can rely on several different sources of evidence including predictions from multiple ab initio gene finders, matches to known proteins, sequence conservation and partial cDNAs to predict the final product. Despite the success of these algorithms, prediction of complete gene structures, especially for alternatively spliced products, remains a difficult task. RESULTS LOCUS (Length Optimized Characterization of Unknown Spliceforms) is a new evidence-based gene finding algorithm which integrates a length-constraint into a dynamic programming-based framework for prediction of gene products. On a Caenorhabditis elegans test set of alternatively spliced internal exons, its performance exceeds that of current ab initio gene finders and in most cases can accurately predict the correct form of all the alternative products. As the length information used by the algorithm can be obtained in a high-throughput fashion, we propose that integration of such information into a gene-prediction pipeline is feasible and doing so may improve our ability to fully characterize the complete set of mRNAs for a genome. AVAILABILITY LOCUS is available from http://ural.wustl.edu/software.html
Collapse
Affiliation(s)
- Ritesh Agrawal
- Department of Genetics, Washington University School of Medicine 660 S. Euclid, Campus Box 8232, St. Louis, MO 63110, USA
| | | |
Collapse
|
146
|
Andreini C, Banci L, Bertini I, Rosato A. Counting the zinc-proteins encoded in the human genome. J Proteome Res 2006; 5:196-201. [PMID: 16396512 DOI: 10.1021/pr050361j] [Citation(s) in RCA: 702] [Impact Index Per Article: 39.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/20/2023]
Abstract
Metalloproteins are proteins capable of binding one or more metal ions, which may be required for their biological function, or for regulation of their activities or for structural purposes. Genome sequencing projects have provided a huge number of protein primary sequences, but, even though several different elaborate analyses and annotations have been enabled by a rich and ever-increasing portfolio of bioinformatic tools, metal-binding properties remain difficult to predict as well as to investigate experimentally. Consequently, the present knowledge about metalloproteins is only partial. The present bioinformatic research proposes a strategy to answer the question of how many and which proteins encoded in the human genome may require zinc for their physiological function. This is achieved by a combination of approaches, which include: (i) searching in the proteome for the zinc-binding patterns that, on their turn, are obtained from all available X-ray data; (ii) using libraries of metal-binding protein domains based on multiple sequence alignments of known metalloproteins obtained from the Pfam database; and (iii) mining the annotations of human gene sequences, which are based on any type of information available. It is found that 1684 proteins in the human proteome are independently identified by all three approaches as zinc-proteins, 746 are identified by two, and 777 are identified by only one method. By assuming that all proteins identified by at least two approaches are truly zinc-binding and inspecting the proteins identified by a single method, it can be proposed that ca. 2800 human proteins are potentially zinc-binding in vivo, corresponding to 10% of the human proteome, with an uncertainty of 400 sequences. Available functional information suggests that the large majority of human zinc-binding proteins are involved in the regulation of gene expression. The most abundant class of zinc-binding proteins in humans is that of zinc-fingers, with Cys4 and Cys2His2 being the most common types of coordination environment.
Collapse
Affiliation(s)
- Claudia Andreini
- Magnetic Resonance Center (CERM), University of Florence, Via L. Sacconi 6, 50019 Sesto Fiorentino, Italy
| | | | | | | |
Collapse
|
147
|
Shafer P, Lin DM, Yona G. EST2Prot: mapping EST sequences to proteins. BMC Genomics 2006; 7:41. [PMID: 16515706 PMCID: PMC1456965 DOI: 10.1186/1471-2164-7-41] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2005] [Accepted: 03/04/2006] [Indexed: 11/12/2022] Open
Abstract
Background EST libraries are used in various biological studies, from microarray experiments to proteomic and genetic screens. These libraries usually contain many uncharacterized ESTs that are typically ignored since they cannot be mapped to known genes. Consequently, new discoveries are possibly overlooked. Results We describe a system (EST2Prot) that uses multiple elements to map EST sequences to their corresponding protein products. EST2Prot uses UniGene clusters, substring analysis, information about protein coding regions in existing DNA sequences and protein database searches to detect protein products related to a query EST sequence. Gene Ontology terms, Swiss-Prot keywords, and protein similarity data are used to map the ESTs to functional descriptors. Conclusion EST2Prot extends and significantly enriches the popular UniGene mapping by utilizing multiple relations between known biological entities. It produces a mapping between ESTs and proteins in real-time through a simple web-interface. The system is part of the Biozon database and is accessible at .
Collapse
Affiliation(s)
- Paul Shafer
- Department of Computer Science, Cornell University, Ithaca, NY, USA
| | - David M Lin
- Department of Biomedical Sciences, Cornell University, Ithaca, NY, USA
| | - Golan Yona
- Department of Computer Science, Cornell University, Ithaca, NY, USA
| |
Collapse
|
148
|
Larrañaga P, Calvo B, Santana R, Bielza C, Galdiano J, Inza I, Lozano JA, Armañanzas R, Santafé G, Pérez A, Robles V. Machine learning in bioinformatics. Brief Bioinform 2006; 7:86-112. [PMID: 16761367 DOI: 10.1093/bib/bbk007] [Citation(s) in RCA: 360] [Impact Index Per Article: 20.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
This article reviews machine learning methods for bioinformatics. It presents modelling methods, such as supervised classification, clustering and probabilistic graphical models for knowledge discovery, as well as deterministic and stochastic heuristics for optimization. Applications in genomics, proteomics, systems biology, evolution and text mining are also shown.
Collapse
Affiliation(s)
- Pedro Larrañaga
- Intelligent Systems Group, Department of Computer Science and Artificial Intelligence, University of the Basque Country, Paseo Manuel de Lardizabal, 1, 20018 San Sebastian, Spain.
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
149
|
Ko P, Narayanan M, Kalyanaraman A, Aluru S. Space-conserving optimal DNA-protein alignment. PROCEEDINGS. IEEE COMPUTATIONAL SYSTEMS BIOINFORMATICS CONFERENCE 2006:80-8. [PMID: 16448002 DOI: 10.1109/csb.2004.1332420] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
DNA-protein alignment algorithms can be used to discover coding sequences in a genomic sequence, if the corresponding protein derivatives are known. They can also be used to identify potential coding sequences of a newly sequenced genome, by using proteins from related species. Previously known algorithms either solve a simplified formulation, or sacrifice optimality to achieve practical implementation. In this paper, we present a comprehensive formulation of the DNA-protein alignment problem, and an algorithm to compute the optimal alignment in O(mn) time using only four tables of size (m + 1) x (n + 1), where m and n are the lengths of the DNA and protein sequences, respectively. We also developed a Protein and DNA Alignment program PanDA that implements the proposed solution. Experimental results indicate that our algorithm produces high quality alignments.
Collapse
Affiliation(s)
- Pang Ko
- Department of Electrical and Computer Engineering, Iowa State University, USA.
| | | | | | | |
Collapse
|
150
|
Marashi SA, Goodarzi H, Sadeghi M, Eslahchi C, Pezeshk H. Importance of RNA secondary structure information for yeast donor and acceptor splice site predictions by neural networks. Comput Biol Chem 2005; 30:50-7. [PMID: 16386465 DOI: 10.1016/j.compbiolchem.2005.10.009] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2005] [Revised: 10/19/2005] [Accepted: 10/19/2005] [Indexed: 10/25/2022]
Abstract
Previously, Patterson et al. showed that mRNA structure information aids splice site prediction in human genes [Patterson, D.J., Yasuhara, K., Ruzzo, W.L., 2002. Pre-mRNA secondary structure prediction aids splice site prediction. Pac. Symp. Biocomput. 7, 223-234]. Here, we have attempted to predict splice sites in selected genes of Saccharomyces cerevisiae using the information obtained from the secondary structures of corresponding mRNAs. From Ares database, 154 genes were selected and their structures were predicted by Mfold. We selected a 20-nucleotide window around each site, each containing 4 nucleotides in the exon region. Based on whether the nucleotide is in a stem or not, the conventional four-letter nucleotide alphabet was translated into an eight-letter alphabet. Two different three-layer-based perceptron neural networks were devised to predict the 5' and 3' splice sites. In case of 5' site determination, a network with 3 neurons at the hidden layer was chosen, while in case of 3' site 20 neurons acted more efficiently. Both neural nets were trained applying Levenberg-Marquardt backpropagation method, using half of the available genes as training inputs and the other half for testing and cross-validations. Sequences with GUs and AGs non-sites were used as negative controls. The correlation coefficients in the predictions of 5' and 3' splice sites using eight-letter alphabet were 98.0% and 69.6%, respectively, while these values were 89.3% and 57.1% when four-letter alphabet is applied. Our results suggest that considering the secondary structure of mRNA molecules positively affects both donor and acceptor site predictions by increasing the capacity of neural networks in learning the patterns.
Collapse
Affiliation(s)
- Sayed-Amir Marashi
- Department of Biotechnology, Faculty of Science, University of Tehran, Tehran, Iran
| | | | | | | | | |
Collapse
|