1
|
Baichoo S, Ouzounis CA. Computational complexity of algorithms for sequence comparison, short-read assembly and genome alignment. Biosystems 2017; 156-157:72-85. [PMID: 28392341 DOI: 10.1016/j.biosystems.2017.03.003] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2017] [Revised: 03/21/2017] [Accepted: 03/22/2017] [Indexed: 12/12/2022]
Abstract
A multitude of algorithms for sequence comparison, short-read assembly and whole-genome alignment have been developed in the general context of molecular biology, to support technology development for high-throughput sequencing, numerous applications in genome biology and fundamental research on comparative genomics. The computational complexity of these algorithms has been previously reported in original research papers, yet this often neglected property has not been reviewed previously in a systematic manner and for a wider audience. We provide a review of space and time complexity of key sequence analysis algorithms and highlight their properties in a comprehensive manner, in order to identify potential opportunities for further research in algorithm or data structure optimization. The complexity aspect is poised to become pivotal as we will be facing challenges related to the continuous increase of genomic data on unprecedented scales and complexity in the foreseeable future, when robust biological simulation at the cell level and above becomes a reality.
Collapse
Affiliation(s)
- Shakuntala Baichoo
- Department of Computer Science & Engineering, University of Mauritius, Réduit 80837, Mauritius.
| | - Christos A Ouzounis
- Biological Computation & Process Laboratory, Chemical Process & Energy Resources Institute, Centre for Research & Technology Hellas, Thessalonica 57001, Greece.
| |
Collapse
|
2
|
|
3
|
|
4
|
Mukherjee S, Mitra S. HIDDEN MARKOV MODELS, GRAMMARS, AND BIOLOGY: A TUTORIAL. J Bioinform Comput Biol 2011; 3:491-526. [PMID: 15852517 DOI: 10.1142/s0219720005001077] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2004] [Revised: 01/05/2004] [Accepted: 01/06/2005] [Indexed: 11/18/2022]
Abstract
Biological sequences and structures have been modelled using various machine learning techniques and abstract mathematical concepts. This article surveys methods using Hidden Markov Model and functional grammars for this purpose. We provide a formal introduction to Hidden Markov Model and grammars, stressing on a comprehensive mathematical description of the methods and their natural continuity. The basic algorithms and their application to analyzing biological sequences and modelling structures of bio-molecules like proteins and nucleic acids are discussed. A comparison of the different approaches is discussed, and possible areas of work and problems are highlighted. Related databases and softwares, available on the internet, are also mentioned.
Collapse
Affiliation(s)
- Shibaji Mukherjee
- Association for Studies in Computational Biology, Kolkata 700 018, India.
| | | |
Collapse
|
5
|
Pérez AJ, Rodríguez A, Trelles O, Thode G. A computational strategy for protein function assignment which addresses the multidomain problem. Comp Funct Genomics 2010; 3:423-40. [PMID: 18629055 PMCID: PMC2447339 DOI: 10.1002/cfg.208] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2002] [Accepted: 08/12/2002] [Indexed: 11/25/2022] Open
Abstract
A method for assigning functions to unknown sequences based on finding correlations between short signals and functional annotations in a protein database is presented.
This approach is based on keyword (KW) and feature (FT) information stored in
the SWISS-PROT database. The former refers to particular protein characteristics
and the latter locates these characteristics at a specific sequence position. In this way,
a certain keyword is only assigned to a sequence if sequence similarity is found in
the position described by the FT field. Exhaustive tests performed over sequences
with homologues (cluster set) and without homologues (singleton set) in the database
show that assigning functions is much ’cleaner’ when information about domains (FT
field) is used, than when only the keywords are used.
Collapse
Affiliation(s)
- A J Pérez
- Genetics Department, University of Málaga, Málaga 29071, Spain.
| | | | | | | |
Collapse
|
6
|
Ferreira PG, Azevedo PJ. Evaluating deterministic motif significance measures in protein databases. Algorithms Mol Biol 2007; 2:16. [PMID: 18157916 PMCID: PMC2254621 DOI: 10.1186/1748-7188-2-16] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2007] [Accepted: 12/24/2007] [Indexed: 11/13/2022] Open
Abstract
BACKGROUND Assessing the outcome of motif mining algorithms is an essential task, as the number of reported motifs can be very large. Significance measures play a central role in automatically ranking those motifs, and therefore alleviating the analysis work. Spotting the most interesting and relevant motifs is then dependent on the choice of the right measures. The combined use of several measures may provide more robust results. However caution has to be taken in order to avoid spurious evaluations. RESULTS From the set of conducted experiments, it was verified that several of the selected significance measures show a very similar behavior in a wide range of situations therefore providing redundant information. Some measures have proved to be more appropriate to rank highly conserved motifs, while others are more appropriate for weakly conserved ones. Support appears as a very important feature to be considered for correct motif ranking. We observed that not all the measures are suitable for situations with poorly balanced class information, like for instance, when positive data is significantly less than negative data. Finally, a visualization scheme was proposed that, when several measures are applied, enables an easy identification of high scoring motifs. CONCLUSION In this work we have surveyed and categorized 14 significance measures for pattern evaluation. Their ability to rank three types of deterministic motifs was evaluated. Measures were applied in different testing conditions, where relations were identified. This study provides some pertinent insights on the choice of the right set of significance measures for the evaluation of deterministic motifs extracted from protein databases.
Collapse
Affiliation(s)
- Pedro Gabriel Ferreira
- Department of Informatics, University of Minho, Campus de Gualtar, 4710-057 Braga, Portugal
| | - Paulo J Azevedo
- Department of Informatics, University of Minho, Campus de Gualtar, 4710-057 Braga, Portugal
| |
Collapse
|
7
|
Elloumi F, Nason M. SEARCHPATTOOL: a new method for mining the most specific frequent patterns for binding sites with application to prokaryotic DNA sequences. BMC Bioinformatics 2007; 8:354. [PMID: 17883842 PMCID: PMC2082047 DOI: 10.1186/1471-2105-8-354] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2007] [Accepted: 09/20/2007] [Indexed: 11/18/2022] Open
Abstract
Background Computational methods to predict transcription factor binding sites (TFBS) based on exhaustive algorithms are guaranteed to find the best patterns but are often limited to short ones or impose some constraints on the pattern type. Many patterns for binding sites in prokaryotic species are not well characterized but are known to be large, between 16–30 base pairs (bp) and contain at least 2 conserved bases. The length of prokaryotic species promoters (about 400 bp) and our interest in studying a small set of genes that could be a cluster of co-regulated genes from microarray experiments led to the development of a new exhaustive algorithm targeting these large patterns. Results We present Searchpattool, a new method to search for and select the most specific (conservative) frequent patterns. This method does not impose restrictions on the density or the structure of the pattern. The best patterns (motifs) are selected using several statistics, including a new application of a z-score based on the number of matching sequences. We compared Searchpattool against other well known algorithms on a Bacillus subtilis group of 14 input sequences and found that in our experiments Searchpattool always performed the best based on performance scores. Conclusion Searchpattool is a new method for pattern discovery relative to transcription factor binding sites for species or genes with short promoters. It outputs the most specific significant patterns and helps the biologist to choose the best candidates.
Collapse
Affiliation(s)
- Fathi Elloumi
- Research Technology Branch, National Institute of Allergy and Infectious Diseases, National Institutes of Health, 9000 Rockville Pike, Blg50/R5505, Bethesda, MD 20892, USA
| | - Martha Nason
- Biostatistics Research Branch, Office of Clinical Research, National Institute of Allergy and Infectious Diseases, National Institutes of Health, 6700B Rockledge Dr. MSC 7609, Bethesda, MD 20892-7609, USA
| |
Collapse
|
8
|
Davey NE, Shields DC, Edwards RJ. SLiMDisc: short, linear motif discovery, correcting for common evolutionary descent. Nucleic Acids Res 2006; 34:3546-54. [PMID: 16855291 PMCID: PMC1524906 DOI: 10.1093/nar/gkl486] [Citation(s) in RCA: 84] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/31/2023] Open
Abstract
Many important interactions of proteins are facilitated by short, linear motifs (SLiMs) within a protein's primary sequence. Our aim was to establish robust methods for discovering putative functional motifs. The strongest evidence for such motifs is obtained when the same motifs occur in unrelated proteins, evolving by convergence. In practise, searches for such motifs are often swamped by motifs shared in related proteins that are identical by descent. Prediction of motifs among sets of biologically related proteins, including those both with and without detectable similarity, were made using the TEIRESIAS algorithm. The number of motif occurrences arising through common evolutionary descent were normalized based on treatment of BLAST local alignments. Motifs were ranked according to a score derived from the product of the normalized number of occurrences and the information content. The method was shown to significantly outperform methods that do not discount evolutionary relatedness, when applied to known SLiMs from a subset of the eukaryotic linear motif (ELM) database. An implementation of Multiple Spanning Tree weighting outperformed two other weighting schemes, in a variety of settings.
Collapse
Affiliation(s)
| | - Denis C. Shields
- To whom correspondence should be addressed. Tel: +353 1 7166831;
| | | |
Collapse
|
9
|
Bridging Lossy and Lossless Compression by Motif Pattern Discovery. LECTURE NOTES IN COMPUTER SCIENCE 2006. [DOI: 10.1007/11889342_51] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/15/2023]
|
10
|
Tao T, Zhai CX, Lu X, Fang H. A study of statistical methods for function prediction of protein motifs. ACTA ACUST UNITED AC 2005; 3:115-24. [PMID: 15693737 DOI: 10.2165/00822942-200403020-00006] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
Abstract
Automatic discovery of new protein motifs (i.e. amino acid patterns) is one of the major challenges in bioinformatics. Several algorithms have been proposed that can extract statistically significant motif patterns from any set of protein sequences. With these methods, one can generate a large set of candidate motifs that may be biologically meaningful. This article examines methods to predict the functions of these candidate motifs. We use several statistical methods: a popularity method, a mutual information method and probabilistic translation models. These methods capture, from different perspectives, the correlations between the matched motifs of a protein and its assigned Gene Ontology terms that characterise the function of the protein. We evaluate these different methods using the known motifs in the InterPro database. Each method is used to rank candidate terms for each motif. We then use the expected mean reciprocal rank to evaluate the performance. The results show that, in general, all these methods perform well, suggesting that they can all be useful for predicting the function of an unknown motif. Among the methods tested, a probabilistic translation model with a popularity prior performs the best.
Collapse
Affiliation(s)
- Tao Tao
- Department of Computer Science, University of Illinois at Urbana-Champaign, 506 S. Matthews Avenue, Urbana, IL 61801, USA.
| | | | | | | |
Collapse
|
11
|
Darzentas N, Rigoutsos I, Ouzounis CA. Sensitive detection of sequence similarity using combinatorial pattern discovery: A challenging study of two distantly related protein families. Proteins 2005; 61:926-37. [PMID: 16224785 DOI: 10.1002/prot.20608] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
We investigate the performance of combinatorial pattern discovery to detect remote sequence similarities in terms of both biological accuracy and computational efficiency for a pair of distantly related families, as a case study. The two families represent the cupredoxins and multicopper oxidases, both containing blue copper-binding domains. These families present a challenging case due to low sequence similarity, different local structure, and variable sequence conservation at their copper-binding active sites. In this study, we investigate a new approach for automatically identifying weak sequence similarities that is based on combinatorial pattern discovery. We compare its performance with a traditional, HMM-based scheme and obtain estimates for sensitivity and specificity of the two approaches. Our analysis suggests that pattern discovery methods can be substantially more sensitive in detecting remote protein relationships while at the same time guaranteeing high specificity.
Collapse
Affiliation(s)
- Nikos Darzentas
- Computational Genomics Group, The European Bioinformatics Institute, EMBL Cambridge Outstation, Cambridge, UK
| | | | | |
Collapse
|
12
|
Stenberg P, Pettersson F, Saura AO, Berglund A, Larsson J. Sequence signature analysis of chromosome identity in three Drosophila species. BMC Bioinformatics 2005; 6:158. [PMID: 15975141 PMCID: PMC1181806 DOI: 10.1186/1471-2105-6-158] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2004] [Accepted: 06/23/2005] [Indexed: 11/30/2022] Open
Abstract
Background All eukaryotic organisms need to distinguish each of their chromosomes. A few protein complexes have been described that recognise entire, specific chromosomes, for instance dosage compensation complexes and the recently discovered autosome-specific Painting of Fourth (POF) protein in Drosophila. However, no sequences have been found that are chromosome-specific and distributed over the entire length of the respective chromosome. Here, we present a new, unbiased, exhaustive computational method that was used to probe three Drosophila genomes for chromosome-specific sequences. Results By combining genome annotations and cytological data with multivariate statistics related to three Drosophila genomes we found sequence signatures that distinguish Muller's F-elements (chromosome 4 in D. melanogaster) from all other chromosomes in Drosophila that are not attributable to differences in nucleotide composition, simple sequence repeats or repeated elements. Based on these signatures we identified complex motifs that are strongly overrepresented in the F-elements and found indications that the D. melanogaster motif may be involved in POF-binding to the F-element. In addition, the X-chromosomes of D. melanogaster and D. yakuba can be distinguished from the other chromosomes, albeit to a lesser extent. Surprisingly, the conservation of the F-element sequence signatures extends not only between species separated by approximately 55 Myr, but also linearly along the sequenced part of the F-elements. Conclusion Our results suggest that chromosome-distinguishing features are not exclusive to the sex chromosomes, but are also present on at least one autosome (the F-element) in Drosophila.
Collapse
Affiliation(s)
| | - Fredrik Pettersson
- Research Group for Chemometrics, Department of Chemistry, Umeå University, Umeå, Sweden
| | - Anja O Saura
- Department of Genetics, University of Helsinki, Helsinki, Finland
| | - Anders Berglund
- Research Group for Chemometrics, Department of Chemistry, Umeå University, Umeå, Sweden
| | | |
Collapse
|
13
|
Blekas K, Fotiadis DI, Likas A. Motif-based protein sequence classification using neural networks. J Comput Biol 2005; 12:64-82. [PMID: 15725734 DOI: 10.1089/cmb.2005.12.64] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
We present a system for multi-class protein classification based on neural networks. The basic issue concerning the construction of neural network systems for protein classification is the sequence encoding scheme that must be used in order to feed the neural network. To deal with this problem we propose a method that maps a protein sequence into a numerical feature space using the matching scores of the sequence to groups of conserved patterns (called motifs) into protein families. We consider two alternative ways for identifying the motifs to be used for feature generation and provide a comparative evaluation of the two schemes. We also evaluate the impact of the incorporation of background features (2-grams) on the performance of the neural system. Experimental results on real datasets indicate that the proposed method is highly efficient and is superior to other well-known methods for protein classification.
Collapse
Affiliation(s)
- Konstantinos Blekas
- Department of Computer Science and Biomedical Research Institute-FORTH, University of Ioannina, GR-45110 Ioannina, Greece.
| | | | | |
Collapse
|
14
|
|
15
|
Helman P, Veroff R, Atlas SR, Willman C. A Bayesian network classification methodology for gene expression data. J Comput Biol 2005; 11:581-615. [PMID: 15579233 DOI: 10.1089/cmb.2004.11.581] [Citation(s) in RCA: 40] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
We present new techniques for the application of a Bayesian network learning framework to the problem of classifying gene expression data. The focus on classification permits us to develop techniques that address in several ways the complexities of learning Bayesian nets. Our classification model reduces the Bayesian network learning problem to the problem of learning multiple subnetworks, each consisting of a class label node and its set of parent genes. We argue that this classification model is more appropriate for the gene expression domain than are other structurally similar Bayesian network classification models, such as Naive Bayes and Tree Augmented Naive Bayes (TAN), because our model is consistent with prior domain experience suggesting that a relatively small number of genes, taken in different combinations, is required to predict most clinical classes of interest. Within this framework, we consider two different approaches to identifying parent sets which are supported by the gene expression observations and any other currently available evidence. One approach employs a simple greedy algorithm to search the universe of all genes; the second approach develops and applies a gene selection algorithm whose results are incorporated as a prior to enable an exhaustive search for parent sets over a restricted universe of genes. Two other significant contributions are the construction of classifiers from multiple, competing Bayesian network hypotheses and algorithmic methods for normalizing and binning gene expression data in the absence of prior expert knowledge. Our classifiers are developed under a cross validation regimen and then validated on corresponding out-of-sample test sets. The classifiers attain a classification rate in excess of 90% on out-of-sample test sets for two publicly available datasets. We present an extensive compilation of results reported in the literature for other classification methods run against these same two datasets. Our results are comparable to, or better than, any we have found reported for these two sets, when a train-test protocol as stringent as ours is followed.
Collapse
Affiliation(s)
- Paul Helman
- Computer Science Department, University of New Mexico, Albuquerque, NM 87131, USA.
| | | | | | | |
Collapse
|
16
|
Huynh T, Rigoutsos I. The web server of IBM's Bioinformatics and Pattern Discovery group: 2004 update. Nucleic Acids Res 2004; 32:W10-5. [PMID: 15215340 PMCID: PMC441505 DOI: 10.1093/nar/gkh367] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
In this report, we provide an update on the services and content which are available on the web server of IBM's Bioinformatics and Pattern Discovery group. The server, which is operational around the clock, provides access to a large number of methods that have been developed and published by the group's members. There is an increasing number of problems that these tools can help tackle; these problems range from the discovery of patterns in streams of events and the computation of multiple sequence alignments, to the discovery of genes in nucleic acid sequences, the identification--directly from sequence--of structural deviations from alpha-helicity and the annotation of amino acid sequences for antimicrobial activity. Additionally, annotations for more than 130 archaeal, bacterial, eukaryotic and viral genomes are now available on-line and can be searched interactively. The tools and code bundles continue to be accessible from http://cbcsrv.watson.ibm.com/Tspd.html whereas the genomics annotations are available at http://cbcsrv.watson.ibm.com/Annotations/.
Collapse
Affiliation(s)
- Tien Huynh
- Bioinformatics and Pattern Discovery group, IBM T.J. Watson Research Center, PO Box 218, Yorktown Heights, NY 10598, USA
| | | |
Collapse
|
17
|
Abstract
Herpesviruses represent an exceptionally suitable model to analyze evolutionary old pathogens, their competency to adapt to existing and changing molecular niches in host species, and the modulation of the gene content and function to comply with the requirements of life. The basis for numerous studies dealing with these questions are reliable statements about the gene content of herpesviral genomes and the functions of viral proteins. The recent determination of the coding strategy of the chimpanzee cytomegalovirus genome and the re-evaluation of the gene content of the human cytomegalovirus genome made it also necessary to restructure the putative transcription map of the Tupaia herpesvirus (THV) genome. Twenty-three THV-specific ORFs formerly predicted to be coding for viral proteins were deleted from the THV transcription map resulting in a gene layout that is now characterized by the presence of conserved genes in the genome center, that probably reflect the genome structure of common herpesviral ancestors, and species-specific genes at the termini. The conserved regions in the THV genome are characterized by high G + C contents between 60% and 80%, a high CpG dinucleotide frequency, and the presence of densely packed putative CpG islands. The genome termini seem to provide the requirements of large scale rearrangements and complements of the gene content to adapt to new environmental demands. With the help of the recently designed method of dictionary-driven, pattern-based protein annotation it was possible to assign putative functions to almost all potential THV proteins, e.g. 123 were found to be putative membrane or secreted proteins, putative signal domains were identified in 69, and 29 proteins were predicted to be glycosylated. The present study adds new aspects to the knowledge about the precise gene composition of herpesvirus genomes and viral protein functions that are of exceptional importance for studies dealing with the phylogeny, the evolution, vaccine vector development, virus-host interactions, pathogenesis and the determination of protein functions of herpesviruses.
Collapse
Affiliation(s)
- Udo Bahr
- Hygiene-Institut, Abteilung Virologie, Universität Heidelberg, Im Neuenheimer Feld 324, D-69120 Heidelberg, Germany
| | | |
Collapse
|
18
|
A Tennis Video Indexing Approach Through Pattern Discovery in Interactive Process. ADVANCES IN MULTIMEDIA INFORMATION PROCESSING - PCM 2004 2004. [DOI: 10.1007/978-3-540-30541-5_7] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/07/2023]
|
19
|
Abstract
We examine the problem of extracting maximal irredundant motifs from a string. A combinatorial argument poses a linear bound on the total number of such motifs, thereby opening the way to the quest for the fastest and most efficient methods of extraction. The basic paradigm explored here is that of iterated updates of the set of irredundant motifs in a string under consecutive unit symbol extensions of the string itself. This approach exposes novel characterizations for the base set of motifs in a string, hinged on notions of partial order. Such properties support the design of ad hoc data structures and constructs, and lead to develop an O(n(3)) time incremental discovery algorithm.
Collapse
Affiliation(s)
- Alberto Apostolico
- Dipartimento di Ingegneria dell' Informazione, Università di Padova, Padova, Italy.
| | | |
Collapse
|
20
|
Huynh T, Rigoutsos I, Parida L, Platt D, Shibuya T. The web server of IBM's Bioinformatics and Pattern Discovery group. Nucleic Acids Res 2003; 31:3645-50. [PMID: 12824385 PMCID: PMC169027 DOI: 10.1093/nar/gkg621] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2003] [Revised: 04/08/2003] [Accepted: 04/08/2003] [Indexed: 11/12/2022] Open
Abstract
We herein present and discuss the services and content which are available on the web server of IBM's Bioinformatics and Pattern Discovery group. The server is operational around the clock and provides access to a variety of methods that have been published by the group's members and collaborators. The available tools correspond to applications ranging from the discovery of patterns in streams of events and the computation of multiple sequence alignments, to the discovery of genes in nucleic acid sequences and the interactive annotation of amino acid sequences. Additionally, annotations for more than 70 archaeal, bacterial, eukaryotic and viral genomes are available on-line and can be searched interactively. The tools and code bundles can be accessed beginning at http://cbcsrv.watson.ibm.com/Tspd.html whereas the genomics annotations are available at http://cbcsrv.watson.ibm.com/Annotations/.
Collapse
Affiliation(s)
- Tien Huynh
- Bioinformatics and Pattern Discovery Group, IBM TJ Watson Research Center, PO BOX 218, Yorktown Heights, NY 10598, USA.
| | | | | | | | | |
Collapse
|
21
|
Rigoutsos I, Novotny J, Huynh T, Chin-Bow ST, Parida L, Platt D, Coleman D, Shenk T. In silico pattern-based analysis of the human cytomegalovirus genome. J Virol 2003; 77:4326-44. [PMID: 12634390 PMCID: PMC150618 DOI: 10.1128/jvi.77.7.4326-4344.2003] [Citation(s) in RCA: 52] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2002] [Accepted: 12/23/2002] [Indexed: 11/20/2022] Open
Abstract
More than 200 open reading frames (ORFs) from the human cytomegalovirus genome have been reported as potentially coding for proteins. We have used two pattern-based in silico approaches to analyze this set of putative viral genes. With the help of an objective annotation method that is based on the Bio-Dictionary, a comprehensive collection of amino acid patterns that describes the currently known natural sequence space of proteins, we have reannotated all of the previously reported putative genes of the human cytomegalovirus. Also, with the help of MUSCA, a pattern-based multiple sequence alignment algorithm, we have reexamined the original human cytomegalovirus gene family definitions. Our analysis of the genome shows that many of the coded proteins comprise amino acid combinations that are unique to either the human cytomegalovirus or the larger group of herpesviruses. We have confirmed that a surprisingly large portion of the analyzed ORFs encode membrane proteins, and we have discovered a significant number of previously uncharacterized proteins that are predicted to be G-protein-coupled receptor homologues. The analysis also indicates that many of the encoded proteins undergo posttranslational modifications such as hydroxylation, phosphorylation, and glycosylation. ORFs encoding proteins with similar functional behavior appear in neighboring regions of the human cytomegalovirus genome. All of the results of the present study can be found and interactively explored online (http://cbcsrv.watson.ibm.com/virus/).
Collapse
Affiliation(s)
- Isidore Rigoutsos
- Bioinformatics and Pattern Discovery Group, IBM TJ Watson Research Center, Yorktown Heights, New York 10598, USA.
| | | | | | | | | | | | | | | |
Collapse
|
22
|
Bicciato S, Pandin M, Didonè G, Di Bello C. Pattern identification and classification in gene expression data using an autoassociative neural network model. Biotechnol Bioeng 2003; 81:594-606. [PMID: 12514809 DOI: 10.1002/bit.10505] [Citation(s) in RCA: 43] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
The application of DNA microarray technology for analysis of gene expression creates enormous opportunities to accelerate the pace in understanding living systems and identification of target genes and pathways for drug development and therapeutic intervention. Parallel monitoring of the expression profiles of thousands of genes seems particularly promising for a deeper understanding of cancer biology and the identification of molecular signatures supporting the histological classification schemes of neoplastic specimens. However, the increasing volume of data generated by microarray experiments poses the challenge of developing equally efficient methods and analysis procedures to extract, interpret, and upgrade the information content of these databases. Herein, a computational procedure for pattern identification, feature extraction, and classification of gene expression data through the analysis of an autoassociative neural network model is described. The identified patterns and features contain critical information about gene-phenotype relationships observed during changes in cell physiology. They represent a rational and dimensionally reduced base for understanding the basic biology of the onset of diseases, defining targets of therapeutic intervention, and developing diagnostic tools for the identification and classification of pathological states. The proposed method has been tested on two different microarray datasets-Golub's analysis of acute human leukemia [Golub et al. (1999) Science 286:531-537], and the human colon adenocarcinoma study presented by Alon et al. [1999; Proc Natl Acad Sci USA 97:10101-10106]. The analysis of the neural network internal structure allows the identification of specific phenotype markers and the extraction of peculiar associations among genes and physiological states. At the same time, the neural network outputs provide assignment to multiple classes, such as different pathological conditions or tissue samples, for previously unseen instances.
Collapse
Affiliation(s)
- Silvio Bicciato
- Department of Chemical Process Engineering, University of Padova, via Marzolo, 9, 35131, Padova, Italy.
| | | | | | | |
Collapse
|
23
|
Rigoutsos I, Huynh T, Floratos A, Parida L, Platt D. Dictionary-driven protein annotation. Nucleic Acids Res 2002; 30:3901-16. [PMID: 12202776 PMCID: PMC137405 DOI: 10.1093/nar/gkf464] [Citation(s) in RCA: 20] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2002] [Revised: 06/04/2002] [Accepted: 06/04/2002] [Indexed: 11/14/2022] Open
Abstract
Computational methods seeking to automatically determine the properties (functional, structural, physicochemical, etc.) of a protein directly from the sequence have long been the focus of numerous research groups. With the advent of advanced sequencing methods and systems, the number of amino acid sequences that are being deposited in the public databases has been increasing steadily. This has in turn generated a renewed demand for automated approaches that can annotate individual sequences and complete genomes quickly, exhaustively and objectively. In this paper, we present one such approach that is centered around and exploits the Bio-Dictionary, a collection of amino acid patterns that completely covers the natural sequence space and can capture functional and structural signals that have been reused during evolution, within and across protein families. Our annotation approach also makes use of a weighted, position-specific scoring scheme that is unaffected by the over-representation of well-conserved proteins and protein fragments in the databases used. For a given query sequence, the method permits one to determine, in a single pass, the following: local and global similarities between the query and any protein already present in a public database; the likeness of the query to all available archaeal/ bacterial/eukaryotic/viral sequences in the database as a function of amino acid position within the query; the character of secondary structure of the query as a function of amino acid position within the query; the cytoplasmic, transmembrane or extracellular behavior of the query; the nature and position of binding domains, active sites, post-translationally modified sites, signal peptides, etc. In terms of performance, the proposed method is exhaustive, objective and allows for the rapid annotation of individual sequences and full genomes. Annotation examples are presented and discussed in Results, including individual queries and complete genomes that were released publicly after we built the Bio-Dictionary that is used in our experiments. Finally, we have computed the annotations of more than 70 complete genomes and made them available on the World Wide Web at http://cbcsrv.watson.ibm.com/Annotations/.
Collapse
Affiliation(s)
- Isidore Rigoutsos
- Bioinformatics and Pattern Discovery Group, IBM TJ Watson Research Center, Yorktown Heights, NY 10598, USA.
| | | | | | | | | |
Collapse
|
24
|
Abstract
Gene identification, also known as gene finding or gene recognition, is among the important problems of molecular biology that have been receiving increasing attention with the advent of large scale sequencing projects. Previous strategies for solving this problem can be categorized into essentially two schools of thought: one school employs sequence composition statistics, whereas the other relies on database similarity searches. In this paper, we propose a new gene identification scheme that combines the best characteristics from each of these two schools. In particular, our method determines gene candidates among the ORFs that can be identified in a given DNA strand through the use of the Bio-Dictionary, a database of patterns that covers essentially all of the currently available sample of the natural protein sequence space. Our approach relies entirely on the use of redundant patterns as the agents on which the presence or absence of genes is predicated and does not employ any additional evidence, e.g. ribosome-binding site signals. The Bio-Dictionary Gene Finder (BDGF), the algorithm's implementation, is a single computational engine able to handle the gene identification task across distinct archaeal and bacterial genomes. The engine exhibits performance that is characterized by simultaneous very high values of sensitivity and specificity, and a high percentage of correctly predicted start sites. Using a collection of patterns derived from an old (June 2000) release of the Swiss-Prot/TrEMBL database that contained 451 602 proteins and fragments, we demonstrate our method's generality and capabilities through an extensive analysis of 17 complete archaeal and bacterial genomes. Examples of previously unreported genes are also shown and discussed in detail.
Collapse
Affiliation(s)
- Tetsuo Shibuya
- Exploratory Technology, IBM Tokyo Research Laboratory, 1623-14 Shimotsuruma, Yamato-shi, Kanagawa 242-8502, Japan
| | | |
Collapse
|
25
|
Burgard AP, Moore GL, Maranas CD. Review of the TEIRESIAS-based tools of the IBM Bioinformatics and Pattern Discovery Group. Metab Eng 2001; 3:285-8. [PMID: 11676564 DOI: 10.1006/mben.2001.0195] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Affiliation(s)
- A P Burgard
- Department of Chemical Engineering, The Pennsylvania State University, University Park, PA 16802, USA
| | | | | |
Collapse
|