151
|
Kim WK, In YJ, Kim JH, Cho HJ, Kim JH, Kang S, Lee CY, Lee SC. Quantitative relationship of dioxin-responsive gene expression to dioxin response element in Hep3B and HepG2 human hepatocarcinoma cell lines. Toxicol Lett 2006; 165:174-81. [PMID: 16697128 DOI: 10.1016/j.toxlet.2006.03.007] [Citation(s) in RCA: 17] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2005] [Revised: 03/03/2006] [Accepted: 03/10/2006] [Indexed: 11/29/2022]
Abstract
Dioxin response element (DRE) is a cis-acting DNA sequence mediating the 2,3,7,8-tetrachlorodibenzo-p-dioxin (TCDD)-induced gene expression. The present study was undertaken to elucidate TCDD-responsive gene expression profiles and their relationships to the number of DREs in liver cancer cells. Hep3B and HepG2 human hepatocarcinoma cells were exposed to 50-nM TCDD for 0, 1, 2 and 4h in culture, after which gene expression profiles were analyzed by the microarray hybridization using a chip containing 24,000 cDNAs prepared from the human liver. The TCDD-responsive expression levels in each gene were calculated by dividing the densitometric values of the hybridization signal for h1, h2 and h4 by that of h0, followed by transformation of the resulting data into a log scale with the base of 2. Up- and down-regulated gene expressions were defined as >0.585 and <-0.585 by the log scale (>1.5 and <1/1.5 arithmetically), respectively, exhibited at any time after h0. Hep3B and HepG2 cells had 27 and 58 TCDD-responsive, up-regulated genes, respectively, of which 78% (21/27) and 62% (36/58) had one or more DREs. Of these 85, 80 genes were up-regulated exclusively in one of the two lines, with CYP1A1 and PPP1R15A being so regulated in both lines. Expression levels of the up-regulated genes at h1, h2 and h4 were correlated with each other (P<0.01) and the mean of these regressed to the number of DRE(s) in both lines (P<0.01). However, expression of a total of 93 TCDD-responsive, down-regulated genes, of which 46% contained DRE(s), had no relation to the number of DRE(s). In conclusion, results suggest that DREs may cooperatively mediate the expression of TCDD-responsive genes in liver cancer cells.
Collapse
Affiliation(s)
- Won Kon Kim
- Systemic Proteomics Research Center, Korea Research Institute of Bioscience and BioTechnology (KRIBB), Daejeon, South Korea
| | | | | | | | | | | | | | | |
Collapse
|
152
|
Bryson K, Loux V, Bossy R, Nicolas P, Chaillou S, van de Guchte M, Penaud S, Maguin E, Hoebeke M, Bessières P, Gibrat JF. AGMIAL: implementing an annotation strategy for prokaryote genomes as a distributed system. Nucleic Acids Res 2006; 34:3533-45. [PMID: 16855290 PMCID: PMC1524909 DOI: 10.1093/nar/gkl471] [Citation(s) in RCA: 80] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
We have implemented a genome annotation system for prokaryotes called AGMIAL. Our approach embodies a number of key principles. First, expert manual annotators are seen as a critical component of the overall system; user interfaces were cyclically refined to satisfy their needs. Second, the overall process should be orchestrated in terms of a global annotation strategy; this facilitates coordination between a team of annotators and automatic data analysis. Third, the annotation strategy should allow progressive and incremental annotation from a time when only a few draft contigs are available, to when a final finished assembly is produced. The overall architecture employed is modular and extensible, being based on the W3 standard Web services framework. Specialized modules interact with two independent core modules that are used to annotate, respectively, genomic and protein sequences. AGMIAL is currently being used by several INRA laboratories to analyze genomes of bacteria relevant to the food-processing industry, and is distributed under an open source license.
Collapse
Affiliation(s)
| | | | | | | | - S. Chaillou
- Flore Lactique et Environnement Carné, INRA78352 Jouy-en-Josas Cedex, France
| | | | - S. Penaud
- Génétique Microbienne, INRA78352 Jouy-en-Josas Cedex, France
| | - E. Maguin
- Génétique Microbienne, INRA78352 Jouy-en-Josas Cedex, France
| | | | | | - J-F Gibrat
- To whom correspondence should be addressed. Tel: +33 1 34 65 28 97; Fax: +33 1 34 65 29 01; E-mail:
| |
Collapse
|
153
|
Lozada-Chávez I, Janga SC, Collado-Vides J. Bacterial regulatory networks are extremely flexible in evolution. Nucleic Acids Res 2006; 34:3434-45. [PMID: 16840530 PMCID: PMC1524901 DOI: 10.1093/nar/gkl423] [Citation(s) in RCA: 140] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
Abstract
Over millions of years the structure and complexity of the transcriptional regulatory network (TRN) in bacteria has changed, reorganized and enabled them to adapt to almost every environmental niche on earth. In order to understand the plasticity of TRNs in bacteria, we studied the conservation of currently known TRNs of the two model organisms Escherichia coli K12 and Bacillus subtilis across complete genomes including Bacteria, Archaea and Eukarya at three different levels: individual components of the TRN, pairs of interactions and regulons. We found that transcription factors (TFs) evolve much faster than the target genes (TGs) across phyla. We show that global regulators are poorly conserved across the phylogenetic spectrum and hence TFs could be the major players responsible for the plasticity and evolvability of the TRNs. We also found that there is only a small fraction of significantly conserved transcriptional regulatory interactions among different phyla of bacteria and that there is no constraint on the elements of the interaction to co-evolve. Finally our results suggest that majority of the regulons in bacteria are rapidly lost implying a high-order flexibility in the TRNs. We hypothesize that during the divergence of bacteria certain essential cellular processes like the synthesis of arginine, biotine and ribose, transport of amino acids and iron, availability of phosphate, replication process and the SOS response are well conserved in evolution. From our comparative analysis, it is possible to infer that transcriptional regulation is more flexible than the genetic component of the organisms and its complexity and structure plays an important role in the phenotypic adaptation.
Collapse
Affiliation(s)
- Irma Lozada-Chávez
- Programa de Genomica Computacional, Centro de Ciencias Genomicas, Universidad Nacional Autonoma de Mexico, Apdo. Postal 565-A, Avenue Universidad, Cuernavaca, Morelos, 62100 Mexico, Mexico. [corrected]
| | | | | |
Collapse
|
154
|
Han L, Cui J, Lin H, Ji Z, Cao Z, Li Y, Chen Y. Recent progresses in the application of machine learning approach for predicting protein functional class independent of sequence similarity. Proteomics 2006; 6:4023-37. [PMID: 16791826 DOI: 10.1002/pmic.200500938] [Citation(s) in RCA: 42] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023]
Abstract
Protein sequence contains clues to its function. Functional prediction from sequence presents a challenge particularly for proteins that have low or no sequence similarity to proteins of known function. Recently, machine learning methods have been explored for predicting functional class of proteins from sequence-derived properties independent of sequence similarity, which showed promising potential for low- and non-homologous proteins. These methods can thus be explored as potential tools to complement alignment- and clustering-based methods for predicting protein function. This article reviews the strategies, current progresses, and underlying difficulties in using machine learning methods for predicting the functional class of proteins. The relevant software and web-servers are described. The reported prediction performances in the application of these methods are also presented, which need to be interpreted with caution as they are dependent on such factors as datasets used and choice of parameters.
Collapse
Affiliation(s)
- Lianyi Han
- Department of Computational Science, National University of Singapore, Singapore, Singapore
| | | | | | | | | | | | | |
Collapse
|
155
|
Brylinski M, Konieczny L, Roterman I. Ligation site in proteins recognized in silico. Bioinformation 2006; 1:127-9. [PMID: 17597871 PMCID: PMC1891674 DOI: 10.6026/97320630001127] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2006] [Accepted: 04/05/2006] [Indexed: 11/23/2022] Open
Abstract
UNLABELLED Recognition of a ligation site in a protein molecule is important for identifying its biological activity. The model for in silico recognition of ligation sites in proteins is presented. The idealized hydrophobic core stabilizing protein structure is represented by a three-dimensional Gaussian function. The experimentally observed distribution of hydrophobicity compared with the theoretical distribution reveals differences. The area of high differences indicates the ligation site. AVAILABILITY http://bioinformatics.cm-uj.krakow.pl/activesite.
Collapse
Affiliation(s)
- Michal Brylinski
- Department of Bioinformatics and Telemedicine, Collegium Medicum - Jagiellonian University, Kopernika 17, 31-501 Krakow, Poland
- Faculty of Chemistry, Jagiellonian University, Ingardena 3, 30-060 Krakow, Poland
| | - Leszek Konieczny
- Institute of Medical Biochemistry, Collegium Medicum Jagiellonian University, Kopernika 7, 31 034 Krakow, Poland
| | - Irena Roterman
- Department of Bioinformatics and Telemedicine, Collegium Medicum - Jagiellonian University, Kopernika 17, 31-501 Krakow, Poland
- Faculty of Physics, Jagiellonian University, Reymonta 4, 30-060 Krakow, Poland
| |
Collapse
|
156
|
Schneider G, Neuberger G, Wildpaner M, Tian S, Berezovsky I, Eisenhaber F. Application of a sensitive collection heuristic for very large protein families: evolutionary relationship between adipose triglyceride lipase (ATGL) and classic mammalian lipases. BMC Bioinformatics 2006; 7:164. [PMID: 16551354 PMCID: PMC1435942 DOI: 10.1186/1471-2105-7-164] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2005] [Accepted: 03/21/2006] [Indexed: 11/30/2022] Open
Abstract
Background Manually finding subtle yet statistically significant links to distantly related homologues becomes practically impossible for very populated protein families due to the sheer number of similarity searches to be invoked and analyzed. The unclear evolutionary relationship between classical mammalian lipases and the recently discovered human adipose triglyceride lipase (ATGL; a patatin family member) is an exemplary case for such a problem. Results We describe an unsupervised, sensitive sequence segment collection heuristic suitable for assembling very large protein families. It is based on fan-like expanding, iterative database searches. To prevent inclusion of unrelated hits, additional criteria are introduced: minimal alignment length and overlap with starting sequence segments, finding starting sequences in reciprocal searches, automated filtering for compositional bias and repetitive patterns. This heuristic was implemented as FAMILYSEARCHER in the ANNIE sequence analysis environment and applied to search for protein links between the classical lipase family and the patatin-like group. Conclusion The FAMILYSEARCHER is an efficient tool for tracing distant evolutionary relationships involving large protein families. Although classical lipases and ATGL have no obvious sequence similarity and differ with regard to fold and catalytic mechanism, homology links detected with FAMILYSEARCHER show that they are evolutionarily related. The conserved sequence parts can be narrowed down to an ancestral core module consisting of three β-strands, one α-helix and a turn containing the typical nucleophilic serine. Moreover, this ancestral module also appears in numerous enzymes with various substrate specificities, but that critically rely on nucleophilic attack mechanisms.
Collapse
Affiliation(s)
- Georg Schneider
- IMP - Research Institute of Molecular Pathology, Dr. Bohr-Gasse 7, A-1030 Vienna, Republic of Austria
| | - Georg Neuberger
- IMP - Research Institute of Molecular Pathology, Dr. Bohr-Gasse 7, A-1030 Vienna, Republic of Austria
| | - Michael Wildpaner
- IMP - Research Institute of Molecular Pathology, Dr. Bohr-Gasse 7, A-1030 Vienna, Republic of Austria
| | - Sun Tian
- IMP - Research Institute of Molecular Pathology, Dr. Bohr-Gasse 7, A-1030 Vienna, Republic of Austria
| | - Igor Berezovsky
- Department of Chemistry and Chemical Biology, Harvard University, 12 Oxford str., M-105, 02138 Cambridge, MA, USA
| | - Frank Eisenhaber
- IMP - Research Institute of Molecular Pathology, Dr. Bohr-Gasse 7, A-1030 Vienna, Republic of Austria
| |
Collapse
|
157
|
Wu J, Hu Z, DeLisi C. Gene annotation and network inference by phylogenetic profiling. BMC Bioinformatics 2006; 7:80. [PMID: 16503966 PMCID: PMC1388238 DOI: 10.1186/1471-2105-7-80] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2005] [Accepted: 02/17/2006] [Indexed: 12/30/2022] Open
Abstract
BACKGROUND Phylogenetic analysis is emerging as one of the most informative computational methods for the annotation of genes and identification of evolutionary modules of functionally related genes. The effectiveness with which phylogenetic profiles can be utilized to assign genes to pathways depends on an appropriate measure of correlation between gene profiles, and an effective decision rule to use the correlate. Current methods, though useful, perform at a level well below what is possible, largely because performance of the latter deteriorates rapidly as coverage increases. RESULTS We introduce, test and apply a new decision rule, correlation enrichment (CE), for assigning genes to functional categories at various levels of resolution. Among the results are: (1) CE performs better than standard guilt by association (SGA, assignment to a functional category when a simple correlate exceeds a pre-specified threshold) irrespective of the number of genes assigned (i.e. coverage); improvement is greatest at high coverage where precision (positive predictive value) of CE is approximately 6-fold higher than that of SGA. (2) CE is estimated to allocate each of the 2918 unannotated orthologs to KEGG pathways with an average precision of 49% (approximately 7-fold higher than SGA) (3) An estimated 94% of the 1846 unannotated orthologs in the COG ontology can be assigned a function with an average precision of 0.4 or greater. (4) Dozens of functional and evolutionarily conserved cliques or quasi-cliques can be identified, many having previously unannotated genes. CONCLUSION The method serves as a general computational tool for annotating large numbers of unknown genes, uncovering evolutionary and functional modules. It appears to perform substantially better than extant stand alone high throughout methods.
Collapse
Affiliation(s)
- Jie Wu
- Department of Biomedical Engineering, Boston University, 24 Cummington St., Boston, MA, 02215, USA
| | - Zhenjun Hu
- Bioinformatics and Systems Biology, Boston University, 24 Cummington St., Boston, MA, 02215, USA
| | - Charles DeLisi
- Department of Biomedical Engineering, Boston University, 24 Cummington St., Boston, MA, 02215, USA
- Bioinformatics and Systems Biology, Boston University, 24 Cummington St., Boston, MA, 02215, USA
| |
Collapse
|
158
|
Ahmad I, Hoessli DC, Walker-Nasir E, Rafik SM, Shakoori AR. Oct-2 DNA binding transcription factor: functional consequences of phosphorylation and glycosylation. Nucleic Acids Res 2006; 34:175-84. [PMID: 16431844 PMCID: PMC1326018 DOI: 10.1093/nar/gkj401] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022] Open
Abstract
Phosphorylation and O-GlcNAc modification often induce conformational changes and allow the protein to specifically interact with other proteins. Interplay of phosphorylation and O-GlcNAc modification at the same conserved site may result in the protein undergoing functional switches. We describe that at conserved Ser/Thr residues of human Oct-2, alternative phosphorylation and O-GlcNAc modification (Yin Yang sites) can be predicted by the YinOYang1.2 method. We propose here that alternative phosphorylation and O-GlcNAc modification at Ser191 in the N-terminal region, Ser271 and 274 in the linker region of two POU sub-domains and Thr301 and Ser323 in the POUh subdomain are involved in the differential binding behavior of Oct-2 to the octamer DNA motif. This implies that phosphorylation or O-GlcNAc modification of the same amino acid may result in a different binding capacity of the modified protein. In the C-terminal domain, Ser371, 389 and 394 are additional Yin Yang sites that could be involved in the modulation of Oct-2 binding properties.
Collapse
Affiliation(s)
- Ishtiaq Ahmad
- Institute of Molecular Sciences and Bioinformatics, Lahore, Pakistan
| | | | | | | | | |
Collapse
|
159
|
Pretzer G, Snel J, Molenaar D, Wiersma A, Bron PA, Lambert J, de Vos WM, van der Meer R, Smits MA, Kleerebezem M. Biodiversity-based identification and functional characterization of the mannose-specific adhesin of Lactobacillus plantarum. J Bacteriol 2005; 187:6128-36. [PMID: 16109954 PMCID: PMC1196140 DOI: 10.1128/jb.187.17.6128-6136.2005] [Citation(s) in RCA: 223] [Impact Index Per Article: 11.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022] Open
Abstract
Lactobacillus plantarum is a frequently encountered inhabitant of the human intestinal tract, and some strains are marketed as probiotics. Their ability to adhere to mannose residues is a potentially interesting characteristic with regard to proposed probiotic features such as colonization of the intestinal surface and competitive exclusion of pathogens. In this study, the variable capacity of 14 L. plantarum strains to agglutinate Saccharomyces cerevisiae in a mannose-specific manner was determined and subsequently correlated with an L. plantarum WCFS1-based genome-wide genotype database. This led to the identification of four candidate mannose adhesin-encoding genes. Two genes primarily predicted to code for sortase-dependent cell surface proteins displayed a complete gene-trait match. Their involvement in mannose adhesion was corroborated by the finding that a sortase (srtA) mutant of L. plantarum WCFS1 lost the capacity to agglutinate S. cerevisiae. The postulated role of these two candidate genes was investigated by gene-specific deletion and overexpression in L. plantarum WCFS1. Subsequent evaluation of the mannose adhesion capacity of the resulting mutant strains showed that inactivation of one candidate gene (lp_0373) did not affect mannose adhesion properties. In contrast, deletion of the other gene (lp_1229) resulted in a complete loss of yeast agglutination ability, while its overexpression quantitatively enhanced this phenotype. Therefore, this gene was designated to encode the mannose-specific adhesin (Msa; gene name, msa) of L. plantarum. Domain homology analysis of the predicted 1,000-residue Msa protein identified known carbohydrate-binding domains, further supporting its role as a mannose adhesin that is likely to be involved in the interaction of L. plantarum with its host in the intestinal tract.
Collapse
|
160
|
Francke C, Siezen RJ, Teusink B. Reconstructing the metabolic network of a bacterium from its genome. Trends Microbiol 2005; 13:550-8. [PMID: 16169729 DOI: 10.1016/j.tim.2005.09.001] [Citation(s) in RCA: 124] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2005] [Revised: 08/25/2005] [Accepted: 09/08/2005] [Indexed: 10/25/2022]
Abstract
The prospect of understanding the relationship between the genome and the physiology of an organism is an important incentive to reconstruct metabolic networks. The first steps in the process can be automated and it does not take much effort to obtain an initial metabolic reconstruction from a genome sequence. However, such a reconstruction is certainly not flawless and correction of the many imperfections is laborious. It requires the combined analysis of the available information on protein sequence, phylogeny, gene-context and co-occurrence but is also aided by high-throughput experimental data. Simultaneously, the reconstructed network provides the opportunity to visualize the "omics" data within a relevant biological functional context and thus aids the interpretation of those data.
Collapse
Affiliation(s)
- Christof Francke
- Wageningen Centre for Food Sciences, PO Box 557, 6700 AN Wageningen, the Netherlands.
| | | | | |
Collapse
|
161
|
McCarthy FM, Burgess SC, van den Berg BHJ, Koter MD, Pharr GT. Differential detergent fractionation for non-electrophoretic eukaryote cell proteomics. J Proteome Res 2005; 4:316-24. [PMID: 15822906 DOI: 10.1021/pr049842d] [Citation(s) in RCA: 72] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Differential detergent fractionation (DDF), which relies on detergents to sequentially extract proteins from eukaryotic cells, has been used to increase proteome coverage of 2D-PAGE. Here, we used DDF extraction in conjunction with the nonelectrophoretic proteomics method of liquid chromatography and electrospray ionization tandem mass spectrometry. We demonstrate that DDF can be used with 2D-LC ESI MS2 for comprehensive cellular proteomics, including a large proportion of membrane proteins. Compared to some published methods designed to isolate membrane proteins specifically, DDF extraction yields comprehensive proteomes which include twice as many membrane proteins. Two-thirds of these membrane proteins have more than one trans-membrane domain. Since DDF separates proteins based upon their physicochemistry and subcellular localization, this method also provides data useful for functional genome annotation. As more genome sequences are completed, methods which can aid in functional annotation will become increasingly important.
Collapse
Affiliation(s)
- Fiona M McCarthy
- College of Veterinary Medicine, PO Box 6100, Mississippi State University, Mississippi 39762, USA
| | | | | | | | | |
Collapse
|
162
|
Abstract
The functional characterization of all genes and their gene products is the main challenge of the postgenomic era. Recent experimental and computational techniques have enabled the study of interactions among all proteins on a large scale. In this paper, approaches will be presented to exploit interaction information for the inference of protein structure, function, signalling pathways and ultimately entire interactomes. Interaction networks can be modelled as graphs, showing the operation of gene function in terms of protein interactions. Since the architecture of biological networks differs distinctly from random networks, these functional maps contain a signal that can be used for predictive purposes. Protein function and structure can be predicted by matching interaction patterns, without the requirement of sequence similarity. Moving on to a higher level definition of protein function, the question arises how to decompose complex networks into meaningful subsets. An algorithm will be demonstrated, which extracts whole signal-transduction pathways from noisy graphs derived from text-mining the biological literature. Finally, an algorithmic strategy is formulated that enables the proteomics community to build a reliable scaffold of the interactome in a fraction of the time compared with uncoordinated efforts.
Collapse
Affiliation(s)
- M Lappe
- Max-Planck Institute of Molecular Genetics, Ihnestrasse 73, 14195 Berlin, Germany
| | | |
Collapse
|
163
|
Gilks WR, Audit B, de Angelis D, Tsoka S, Ouzounis CA. Percolation of annotation errors through hierarchically structured protein sequence databases. Math Biosci 2005; 193:223-34. [PMID: 15748731 DOI: 10.1016/j.mbs.2004.08.001] [Citation(s) in RCA: 52] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2003] [Accepted: 08/30/2004] [Indexed: 11/18/2022]
Abstract
Databases of protein sequences have grown rapidly in recent years as a result of genome sequencing projects. Annotating protein sequences with descriptions of their biological function ideally requires careful experimentation, but this work lags far behind. Instead, biological function is often imputed by copying annotations from similar protein sequences. This gives rise to annotation errors, and more seriously, to chains of misannotation. [Percolation of annotation errors in a database of protein sequences (2002)] developed a probabilistic framework for exploring the consequences of this percolation of errors through protein databases, and applied their theory to a simple database model. Here we apply the theory to hierarchically structured protein sequence databases, and draw conclusions about database quality at different levels of the hierarchy.
Collapse
Affiliation(s)
- Walter R Gilks
- Medical Research Council Biostatistics Unit, Institute of Public Health, University of Forvive Site, Robinson Way, Cambridge CB2 2SR, UK.
| | | | | | | | | |
Collapse
|
164
|
Korbel JO, Doerks T, Jensen LJ, Perez-Iratxeta C, Kaczanowski S, Hooper SD, Andrade MA, Bork P. Systematic association of genes to phenotypes by genome and literature mining. PLoS Biol 2005; 3:e134. [PMID: 15799710 PMCID: PMC1073694 DOI: 10.1371/journal.pbio.0030134] [Citation(s) in RCA: 105] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2004] [Accepted: 02/02/2005] [Indexed: 11/23/2022] Open
Abstract
One of the major challenges of functional genomics is to unravel the connection between genotype and phenotype. So far no global analysis has attempted to explore those connections in the light of the large phenotypic variability seen in nature. Here, we use an unsupervised, systematic approach for associating genes and phenotypic characteristics that combines literature mining with comparative genome analysis. We first mine the MEDLINE literature database for terms that reflect phenotypic similarities of species. Subsequently we predict the likely genomic determinants: genes specifically present in the respective genomes. In a global analysis involving 92 prokaryotic genomes we retrieve 323 clusters containing a total of 2,700 significant gene–phenotype associations. Some clusters contain mostly known relationships, such as genes involved in motility or plant degradation, often with additional hypothetical proteins associated with those phenotypes. Other clusters comprise unexpected associations; for example, a group of terms related to food and spoilage is linked to genes predicted to be involved in bacterial food poisoning. Among the clusters, we observe an enrichment of pathogenicity-related associations, suggesting that the approach reveals many novel genes likely to play a role in infectious diseases. The combination of text mining and comparative genomics is shown to be a powerful approach to predicting phenotypes that are associated with particular genes in bacterial genomes
Collapse
Affiliation(s)
- Jan O Korbel
- 1European Molecular Biology LaboratoryHeidelbergGermany
| | - Tobias Doerks
- 1European Molecular Biology LaboratoryHeidelbergGermany
| | - Lars J Jensen
- 1European Molecular Biology LaboratoryHeidelbergGermany
- 2Max Delbrück Center for Molecular MedicineBerlin-BuchGermany
| | | | - Szymon Kaczanowski
- 4Institute of Biochemistry and Biophysics, Polish Academy of SciencesWarsawPoland
| | - Sean D Hooper
- 1European Molecular Biology LaboratoryHeidelbergGermany
| | - Miguel A Andrade
- 3Ontario Genomics Innovation Centre, Ottawa Health Research InstituteOttawaCanada
| | - Peer Bork
- 1European Molecular Biology LaboratoryHeidelbergGermany
- 2Max Delbrück Center for Molecular MedicineBerlin-BuchGermany
| |
Collapse
|
165
|
Worthey EA, Myler PJ. Protozoan genomes: gene identification and annotation. Int J Parasitol 2005; 35:495-512. [PMID: 15826642 DOI: 10.1016/j.ijpara.2005.02.008] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2004] [Revised: 01/25/2005] [Accepted: 02/06/2005] [Indexed: 12/01/2022]
Abstract
The draft sequence of several complete protozoan genomes is now available and genome projects are ongoing for a number of other species. Different strategies are being implemented to identify and annotate protein coding and RNA genes in these genomes, as well as study their genomic architecture. Since the genomes vary greatly in size, GC-content, nucleotide composition, and degree of repetitiveness, genome structure is often a factor in choosing the methodology utilised for annotation. In addition, the approach taken is dictated, to a greater or lesser extent, by the particular reasons for carrying out genome-wide analyses and the level of funding available for projects. Nevertheless, these projects have provided a plethora of material that will aid in understanding the biology and evolution of these parasites, as well as identifying new targets that can be used to design urgently required drug treatments for the diseases they cause.
Collapse
Affiliation(s)
- E A Worthey
- Seattle Biomedical Research Institute, 307 Westlake Ave N., Seattle, WA 98109-2591, USA
| | | |
Collapse
|
166
|
Sampson EM, Johnson CLV, Bobik TA. Biochemical evidence that the pduS gene encodes a bifunctional cobalamin reductase. Microbiology (Reading) 2005; 151:1169-1177. [PMID: 15817784 DOI: 10.1099/mic.0.27755-0] [Citation(s) in RCA: 39] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Salmonella enterica degrades 1,2-propanediol (1,2-PD) by a pathway that requires coenzyme B(12) (adenosylcobalamin; AdoCbl). The genes specifically involved in 1,2-PD utilization (pdu) are found in a large contiguous cluster, the pdu locus. Earlier studies have indicated that this locus includes genes for the conversion of vitamin B(12) (cyanocobalamin; CNCbl) to AdoCbl and that the pduO gene encodes an ATP : cob(I)alamin adenosyltransferase which catalyses the terminal step of this process. Here, in vitro evidence is presented that the pduS gene encodes a bifunctional cobalamin reductase that catalyses two reductive steps needed for the conversion of CNCbl into AdoCbl. The PduS enzyme was produced in high levels in Escherichia coli. Enzyme assays showed that cell extracts from the PduS expression strain reduced cob(III)alamin (hydroxycobalamin) to cob(II)alamin at a rate of 91 nmol min(-1) mg(-1) and cob(II)alamin to cob(I)alamin at a rate of 7.8 nmol min(-1) mg(-1). In contrast, control extracts had only 9.9 nmol min(-1) mg(-1) cob(III)alamin reductase activity and no detectable cob(II)alamin reductase activity. Thus, these results indicated that the PduS enzyme is a bifunctional cobalamin reductase. Enzyme assays also showed that the PduS enzyme reduced cob(II)alamin to cob(I)alamin for conversion into AdoCbl by purified PduO adenosyltransferase. Moreover, studies in which iodoacetate was used as a chemical trap for cob(I)alamin indicated that the PduS and PduO enzymes physically interact and that cob(I)alamin is sequestered during the conversion of cob(II)alamin to AdoCbl by these two enzymes. This is likely to be important physiologically, since cob(I)alamin is extremely reactive and would need to be protected from unproductive by-reactions. Lastly, bioinformatic analyses showed that the PduS enzyme is unrelated in amino acid sequence to enzymes of known function currently present in GenBank. Hence, results indicate that the PduS enzyme represents a new class of cobalamin reductase.
Collapse
Affiliation(s)
- Edith M Sampson
- Department of Pediatrics, University of Florida, Gainesville, FL 32611, USA
| | | | - Thomas A Bobik
- Department of Biochemistry, Biophysics and Molecular Biology, Iowa State University, Ames, IA 50011, USA
| |
Collapse
|
167
|
Prigent V, Thierry JC, Poch O, Plewniak F. DbW: automatic update of a functional family-specific multiple alignment. Bioinformatics 2004; 21:1437-42. [PMID: 15598832 DOI: 10.1093/bioinformatics/bti218] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Recent advances in gene sequencing have provided complete sequence information for a number of genomes and as a result the amount of data in the sequence databases is growing at an exponential rate. We introduce here a new program, DbW, to automate the update of a functional family-specific multiple alignment that tries to include relevant sequences. The program is based on the use of different sources of information: sequences and annotations in databases. RESULTS The advantages of DbW are demonstrated using the 20 families of aminoacyl-tRNA synthetases, where DbW detects a maximum of homologous sequences in the Swiss-Prot and SPTREMBL databases. The global specificity of DbW in this test is 98.4% (1.6% of the sequences included in the alignment did not belong to the family according to their function), and the global sensitivity of DbW is estimated to be 95.2%. Thus, DbW provides a reliable basis for the many applications that rely on accurate multiple alignments, e.g. functional residue identification, 2D/3D structure prediction or homology modeling. AVAILABILITY The DbW software is available for download at ftp://ftp-igbmc.u-strasbg.fr/pub/DbW/DbW.tar and online at http://titus.u-strasbg.fr/DbW CONTACT: prigent@igbmc.u-strasbg.fr.
Collapse
Affiliation(s)
- V Prigent
- Laboratoire de Biologie et Génomique Structurales, Institut de Génétique et de Biologie Moléculaire et Cellulaire, (CNRS/INSERM/ULP) BP 10142, 67404 Illkirch, Cedex, France.
| | | | | | | |
Collapse
|
168
|
Hidden localization motifs: naturally occurring peroxisomal targeting signals in non-peroxisomal proteins. Genome Biol 2004; 5:R97. [PMID: 15575971 PMCID: PMC545800 DOI: 10.1186/gb-2004-5-12-r97] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2004] [Revised: 10/11/2004] [Accepted: 11/09/2004] [Indexed: 11/13/2022] Open
Abstract
Functional but silent peroxisomal targeting signals have been found in non- peroxisomal proteins. This discovery has important implications for sequence-based signal prediction and for evolution. Background Can sequence segments coding for subcellular targeting or for posttranslational modifications occur in proteins that are not substrates in either of these processes? Although considerable effort has been invested in achieving low false-positive prediction rates, even accurate sequence-analysis tools for the recognition of these motifs generate a small but noticeable number of protein hits that lack the appropriate biological context but cannot be rationalized as false positives. Results We show that the carboxyl termini of a set of definitely non-peroxisomal proteins with predicted peroxisomal targeting signals interact with the peroxisomal matrix protein receptor peroxin 5 (PEX5) in a yeast two-hybrid test. Moreover, we show that examples of these proteins - chicken lysozyme, human tyrosinase and the yeast mitochondrial ribosomal protein L2 (encoded by MRP7) - are imported into peroxisomes in vivo if their original sorting signals are disguised. We also show that even prokaryotic proteins can contain peroxisomal targeting sequences. Conclusions Thus, functional localization signals can evolve in unrelated protein sequences as a result of neutral mutations, and subcellular targeting is hierarchically organized, with signal accessibility playing a decisive role. The occurrence of silent functional motifs in unrelated proteins is important for the development of sequence-based function prediction tools and the interpretation of their results. Silent functional signals have the potential to acquire importance in future evolutionary scenarios and in pathological conditions.
Collapse
|
169
|
Yu X, Lin J, Shi T, Li Y. A novel domain-based method for predicting the functional classes of proteins. ACTA ACUST UNITED AC 2004. [DOI: 10.1007/bf03183426] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
170
|
Galperin MY, Koonin EV. 'Conserved hypothetical' proteins: prioritization of targets for experimental study. Nucleic Acids Res 2004; 32:5452-63. [PMID: 15479782 PMCID: PMC524295 DOI: 10.1093/nar/gkh885] [Citation(s) in RCA: 309] [Impact Index Per Article: 14.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Comparative genomics shows that a substantial fraction of the genes in sequenced genomes encodes 'conserved hypothetical' proteins, i.e. those that are found in organisms from several phylogenetic lineages but have not been functionally characterized. Here, we briefly discuss recent progress in functional characterization of prokaryotic 'conserved hypothetical' proteins and the possible criteria for prioritizing targets for experimental study. Based on these criteria, the chief one being wide phyletic spread, we offer two 'top 10' lists of highly attractive targets. The first list consists of proteins for which biochemical activity could be predicted with reasonable confidence but the biological function was predicted only in general terms, if at all ('known unknowns'). The second list includes proteins for which there is no prediction of biochemical activity, even if, for some, general biological clues exist ('unknown unknowns'). The experimental characterization of these and other 'conserved hypothetical' proteins is expected to reveal new, crucial aspects of microbial biology and could also lead to better functional prediction for medically relevant human homologs.
Collapse
Affiliation(s)
- Michael Y Galperin
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | | |
Collapse
|
171
|
Sun YV, Boverhof DR, Burgoon LD, Fielden MR, Zacharewski TR. Comparative analysis of dioxin response elements in human, mouse and rat genomic sequences. Nucleic Acids Res 2004; 32:4512-23. [PMID: 15328365 PMCID: PMC516056 DOI: 10.1093/nar/gkh782] [Citation(s) in RCA: 167] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Comparative approaches were used to identify human, mouse and rat dioxin response elements (DREs) in genomic sequences unambiguously assigned to a nucleotide RefSeq accession number. A total of 13 bona fide DREs, all including the substitution intolerant core sequence (GCGTG) and adjacent variable sequences, were used to establish a position weight matrix and a matrix similarity (MS) score threshold to rank identified DREs. DREs with MS scores above the threshold were disproportionately distributed in close proximity to the transcription start site in all three species. Gene expression assays in hepatic mouse tissue confirmed the responsiveness of 192 genes possessing a putative DRE. Previously identified functional DREs in well-characterized AhR-regulated genes including Cyp1a1 and Cyp1b1 were corroborated. Putative DREs were identified in 48 out of 2437 human-mouse-rat orthologous genes between -1500 and the transcriptional start site, of which 19 of these genes possessed positionally conserved DREs as determined by multiple sequence alignment. Seven of these nineteen genes exhibited 2,3,7,8-tetrachlorodibenzo-p-dioxin-mediated regulation, although there were significant discrepancies between in vivo and in vitro results. Interestingly, of the mouse-rat orthologous genes with a DRE between -1500 and +1500, only 37% had an equivalent human ortholog. These results suggest that AhR-mediated gene expression may not be well conserved across species, which could have significant implications in human risk assessment.
Collapse
Affiliation(s)
- Y V Sun
- Department of Biochemistry and Molecular Biology, National Food Safety and Toxicology Center, Michigan State University, East Lansing, MI 48824, USA
| | | | | | | | | |
Collapse
|
172
|
Abstract
Background Joining a model for the molecular evolution of a protein family to the paleontological and geological records (geobiology), and then to the chemical structures of substrates, products, and protein folds, is emerging as a broad strategy for generating hypotheses concerning function in a post-genomic world. This strategy expands systems biology to a planetary context, necessary for a notion of fitness to underlie (as it must) any discussion of function within a biomolecular system. Results Here, we report an example of such an expansion, where tools from planetary biology were used to analyze three genes from the pig Sus scrofa that encode cytochrome P450 aromatases–enzymes that convert androgens into estrogens. The evolutionary history of the vertebrate aromatase gene family was reconstructed. Transition redundant exchange silent substitution metrics were used to interpolate dates for the divergence of family members, the paleontological record was consulted to identify changes in physiology that correlated in time with the change in molecular behavior, and new aromatase sequences from peccary were obtained. Metrics that detect changing function in proteins were then applied, including KA/KS values and those that exploit structural biology. These identified specific amino acid replacements that were associated with changing substrate and product specificity during the time of presumed adaptive change. The combined analysis suggests that aromatase paralogs arose in pigs as a result of selection for Suoidea with larger litters than their ancestors, and permitted the Suoidea to survive the global climatic trauma that began in the Eocene. Conclusions This combination of bioinformatics analysis, molecular evolution, paleontology, cladistics, global climatology, structural biology, and organic chemistry serves as a paradigm in planetary biology. As the geological, paleontological, and genomic records improve, this approach should become widely useful to make systems biology statements about high-level function for biomolecular systems.
Collapse
|
173
|
Yu H, Luscombe NM, Lu HX, Zhu X, Xia Y, Han JDJ, Bertin N, Chung S, Vidal M, Gerstein M. Annotation transfer between genomes: protein-protein interologs and protein-DNA regulogs. Genome Res 2004; 14:1107-18. [PMID: 15173116 PMCID: PMC419789 DOI: 10.1101/gr.1774904] [Citation(s) in RCA: 402] [Impact Index Per Article: 19.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Proteins function mainly through interactions, especially with DNA and other proteins. While some large-scale interaction networks are now available for a number of model organisms, their experimental generation remains difficult. Consequently, interolog mapping--the transfer of interaction annotation from one organism to another using comparative genomics--is of significant value. Here we quantitatively assess the degree to which interologs can be reliably transferred between species as a function of the sequence similarity of the corresponding interacting proteins. Using interaction information from Saccharomyces cerevisiae, Caenorhabditis elegans, Drosophila melanogaster, and Helicobacter pylori, we find that protein-protein interactions can be transferred when a pair of proteins has a joint sequence identity >80% or a joint E-value <10(-70). (These "joint" quantities are the geometric means of the identities or E-values for the two pairs of interacting proteins.) We generalize our interolog analysis to protein-DNA binding, finding such interactions are conserved at specific thresholds between 30% and 60% sequence identity depending on the protein family. Furthermore, we introduce the concept of a "regulog"--a conserved regulatory relationship between proteins across different species. We map interologs and regulogs from yeast to a number of genomes with limited experimental annotation (e.g., Arabidopsis thaliana) and make these available through an online database at http://interolog.gersteinlab.org. Specifically, we are able to transfer approximately 90,000 potential protein-protein interactions to the worm. We test a number of these in two-hybrid experiments and are able to verify 45 overlaps, which we show to be statistically significant.
Collapse
Affiliation(s)
- Haiyuan Yu
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA
| | | | | | | | | | | | | | | | | | | |
Collapse
|
174
|
Abstract
Classification of proteins into families is one of the main goals of functional analysis. Proteins are usually assigned to a family on the basis of the presence of family-specific patterns, domains, or structural elements. Whereas proteins belonging to the same family are generally similar to each other, the extent of similarity varies widely across families. Some families are characterized by short, well-defined motifs, whereas others contain longer, less-specific motifs. We present a simple method for visualizing such differences. We applied our method to the Arabidopsis thaliana families listed at The Arabidopsis Information Resource (TAIR) Web site and for 76% of the nontrivial families (families with more than one member), our method identifies simple similarity measures that are necessary and sufficient to cluster members of the family together. Our visualization method can be used as part of an annotation pipeline to identify potentially incorrectly defined families. We also describe how our method can be extended to identify novel families and to assign unclassified proteins into known families.
Collapse
Affiliation(s)
- Vamsi Veeramachaneni
- Department of Biology, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
| | | |
Collapse
|
175
|
Lau AY, Chasman DI. Functional classification of proteins and protein variants. Proc Natl Acad Sci U S A 2004; 101:6576-81. [PMID: 15087495 PMCID: PMC404087 DOI: 10.1073/pnas.0305043101] [Citation(s) in RCA: 19] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
To help characterize the diversity in biological function of proteins emerging from the analysis of whole genomes, we present an operational definition of biological function that provides an explicit link between the functional classification of proteins and the effects of genetic variation or mutation on protein function. Using phylogenetic information, we establish definite criteria for functional relatedness among proteins and a companion procedure for predicting deleterious alleles or mutations. Applied to the functional classification of sequences similar to 13 human tumor suppressor proteins, our methods predict there are functional properties unique to mammals for three of them, BRCA1, BRCA2, and WT1. We examine protein variants caused by nonsynonymous single-nucleotide polymorphisms in a set of clinically important genes and estimate the magnitude of a disproportionate propensity for disruption of function among the nonsynomous single-nucleotide polymorphisms that are maintained at low frequency in the human population.
Collapse
Affiliation(s)
- Albert Y Lau
- Variagenics, Incorporated, 60 Hampshire Street, Cambridge, MA 02139, USA.
| | | |
Collapse
|
176
|
Abstract
One approach for facilitating protein function prediction is to classify proteins into functional families. Recent studies on the classification of G-protein coupled receptors and other proteins suggest that a statistical learning method, Support vector machines (SVM), may be potentially useful for protein classification into functional families. In this work, SVM is applied and tested on the classification of enzymes into functional families defined by the Enzyme Nomenclature Committee of IUBMB. SVM classification system for each family is trained from representative enzymes of that family and seed proteins of Pfam curated protein families. The classification accuracy for enzymes from 46 families and for non-enzymes is in the range of 50.0% to 95.7% and 79.0% to 100% respectively. The corresponding Matthews correlation coefficient is in the range of 54.1% to 96.1%. Moreover, 80.3% of the 8,291 correctly classified enzymes are uniquely classified into a specific enzyme family by using a scoring function, indicating that SVM may have certain level of unique prediction capability. Testing results also suggest that SVM in some cases is capable of classification of distantly related enzymes and homologous enzymes of different functions. Effort is being made to use a more comprehensive set of enzymes as training sets and to incorporate multi-class SVM classification systems to further enhance the unique prediction accuracy. Our results suggest the potential of SVM for enzyme family classification and for facilitating protein function prediction. Our software is accessible at http://jing.cz3.nus.edu.sg/cgi-bin/svmprot.cgi.
Collapse
Affiliation(s)
- C Z Cai
- Department of Applied Physics, Chongqing University, Chongqing, Peoples Republic of China
| | | | | | | |
Collapse
|
177
|
Jim K, Parmar K, Singh M, Tavazoie S. A cross-genomic approach for systematic mapping of phenotypic traits to genes. Genome Res 2004; 14:109-15. [PMID: 14707173 PMCID: PMC314287 DOI: 10.1101/gr.1586704] [Citation(s) in RCA: 32] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
We present a computational method for de novo identification of gene function using only cross-organismal distribution of phenotypic traits. Our approach assumes that proteins necessary for a set of phenotypic traits are preferentially conserved among organisms that share those traits. This method combines organism-to-phenotype associations,along with phylogenetic profiles,to identify proteins that have high propensities for the query phenotype; it does not require the use of any functional annotations for any proteins. We first present the statistical foundations of this approach and then apply it to a range of phenotypes to assess how its performance depends on the frequency and specificity of the phenotype. Our analysis shows that statistically significant associations are possible as long as the phenotype is neither extremely rare nor extremely common; results on the flagella,pili, thermophily,and respiratory tract tropism phenotypes suggest that reliable associations can be inferred when the phenotype does not arise from many alternate mechanisms.
Collapse
Affiliation(s)
- Kam Jim
- Department of Computer Science, Princeton University, Princeton, New Jersey, 08544 USA
| | | | | | | |
Collapse
|
178
|
von Mering C, Zdobnov EM, Tsoka S, Ciccarelli FD, Pereira-Leal JB, Ouzounis CA, Bork P. Genome evolution reveals biochemical networks and functional modules. Proc Natl Acad Sci U S A 2003; 100:15428-33. [PMID: 14673105 PMCID: PMC307584 DOI: 10.1073/pnas.2136809100] [Citation(s) in RCA: 120] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022] Open
Abstract
The analysis of completely sequenced genomes uncovers an astonishing variability between species in terms of gene content and order. During genome history, the genes are frequently rear-ranged, duplicated, lost, or transferred horizontally between genomes. These events appear to be stochastic, yet they are under selective constraints resulting from the functional interactions between genes. These genomic constraints form the basis for a variety of techniques that employ systematic genome comparisons to predict functional associations among genes. The most powerful techniques to date are based on conserved gene neighborhood, gene fusion events, and common phylogenetic distributions of gene families. Here we show that these techniques, if integrated quantitatively and applied to a sufficiently large number of genomes, have reached a resolution which allows the characterization of function at a higher level than that of the individual gene: global modularity becomes detectable in a functional protein network. In Escherichia coli, the predicted modules can be bench-marked by comparison to known metabolic pathways. We found as many as 74% of the known metabolic enzymes clustering together in modules, with an average pathway specificity of at least 84%. The modules extend beyond metabolism, and have led to hundreds of reliable functional predictions both at the protein and pathway level. The results indicate that modularity in protein networks is intrinsically encoded in present-day genomes.
Collapse
Affiliation(s)
- Christian von Mering
- European Molecular Biology Laboratory, Meyerhofstrasse 1, D-69117 Heidelberg, Germany
| | | | | | | | | | | | | |
Collapse
|
179
|
Abstract
Enzyme function conservation has been used to derive the threshold of sequence identity necessary to transfer function from a protein of known function to an unknown protein. Using pairwise sequence comparison, several studies suggested that when the sequence identity is above 40%, enzyme function is well conserved. In contrast, Rost argued that because of database bias, the results from such simple pairwise comparisons might be misleading. Thus, by grouping enzyme sequences into families based on sequence similarity and selecting representative sequences for comparison, he showed that enzyme function starts to diverge quickly when the sequence identity is below 70%. Here, we employ a strategy similar to Rost's to reduce the database bias; however, we classify enzyme families based not only on sequence similarity, but also on functional similarity, i.e. sequences in each family must have the same four digits or the same first three digits of the enzyme commission (EC) number. Furthermore, instead of selecting representative sequences for comparison, we calculate the function conservation of each enzyme family and then average the degree of enzyme function conservation across all enzyme families. Our analysis suggests that for functional transferability, 40% sequence identity can still be used as a confident threshold to transfer the first three digits of an EC number; however, to transfer all four digits of an EC number, above 60% sequence identity is needed to have at least 90% accuracy. Moreover, when PSI-BLAST is used, the magnitude of the E-value is found to be weakly correlated with the extent of enzyme function conservation in the third iteration of PSI-BLAST. As a result, functional annotation based on the E-values from PSI-BLAST should be used with caution. We also show that by employing an enzyme family-specific sequence identity threshold above which 100% functional conservation is required, functional inference of unknown sequences can be accurately accomplished. However, this comes at a cost: those true positive sequences below this threshold cannot be uniquely identified.
Collapse
Affiliation(s)
- Weidong Tian
- Center of Excellence in Bioinformatics, University at Buffalo, The State University of New York, 901 Washington Street, Buffalo, NY 14203, USA
| | | |
Collapse
|
180
|
Nair R, Rost B. Better prediction of sub-cellular localization by combining evolutionary and structural information. Proteins 2003; 53:917-30. [PMID: 14635133 DOI: 10.1002/prot.10507] [Citation(s) in RCA: 62] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
The native sub-cellular compartment of a protein is one aspect of its function. Thus, predicting localization is an important step toward predicting function. Short zip code-like sequence fragments regulate some of the shuttling between compartments. Cataloguing and predicting such motifs is the most accurate means of determining localization in silico. However, only few motifs are currently known, and not all the trafficking appears regulated in this way. The amino acid composition of a protein correlates with its localization. All general prediction methods employed this observation. Here, we explored the evolutionary information contained in multiple alignments and aspects of protein structure to predict localization in absence of homology and targeting motifs. Our final system combined statistical rules and a variety of neural networks to achieve an overall four-state accuracy above 65%, a significant improvement over systems using only composition. The system was at its best for extra-cellular and nuclear proteins; it was significantly less accurate than TargetP for mitochondrial proteins. Interestingly, all methods that were developed on SWISS-PROT sequences failed grossly when fed with sequences from proteins of known structures taken from PDB. We therefore developed two separate systems: one for proteins of known structure and one for proteins of unknown structure. Finally, we applied the PDB-based system along with homology-based inferences and automatic text analysis to annotate all eukaryotic proteins in the PDB (http://cubic.bioc.columbia.edu/db/LOC3D). We imagine that this pilot method-certainly in combination with similar tools-may be valuable target selection in structural genomics.
Collapse
Affiliation(s)
- Rajesh Nair
- CUBIC, Department of Biochemistry and Molecular Biophysics, Columbia University, New York, New York 10032 , USA.
| | | |
Collapse
|
181
|
Abascal F, Valencia A. Automatic annotation of protein function based on family identification. Proteins 2003; 53:683-92. [PMID: 14579359 DOI: 10.1002/prot.10449] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Although genomes are being sequenced at an impressive rate, the information generated tells us little about protein function, which is slow to characterize by traditional methods. Automatic protein function annotation based on computational methods has alleviated this imbalance. The most powerful current approach for inferring the function of new proteins is by studying the annotations of their homologues, since their common origin is assumed to be reflected in their structure and function. Unfortunately, as proteins evolve they acquire new functions, so annotation based on homology must be carried out in the context of orthologues or subfamilies. Evolution adds new complications through domain shuffling: homology (or orthology) frequently corresponds to domains rather than complete proteins. Moreover, the function of a protein may be seen as the result of combining the functions of its domains. Additionally, automatic annotation has to deal with problems related to the annotations in the databases: errors (which are likely to be propagated), inconsistencies, or different degrees of function specification. We describe a method that addresses these difficulties for the annotation of protein function. Sequence relationships are detected and measured to obtain a map of the sequence space, which is searched for differentiated groups of proteins (similar to islands on the map), which are expected to have a common function and correspond to groups of orthologues or subfamilies. This mapmaking is done by applying a clustering algorithm based on Normalized cuts in graphs. The domain problem is addressed in a simple way: pairwise local alignments are analyzed to determine the extent to which they cover the entire sequence lengths of the two proteins. This analysis determines both what homologues are preferred for functional inheritance and the level of confidence of the annotation. To alleviate the problems associated with database annotations, the information on all the homologues that are grouped together with the query protein are taken into account to select the most representative functional descriptors. This method has been applied for the annotation of the genome of Buchnera aphidicola (specific host Baizongia pistaciae). Human inspection of the annotations allowed an estimation of accuracy of 94%; the different kinds of error that may appear when using this approach are described. Results can be accessed at http://www.pdg.cnb.uam.es/funcut.html. The programs are available upon request, although installation in other systems may be complicated.
Collapse
Affiliation(s)
- Federico Abascal
- Protein Design Group, National Centre for Biotechnology, CNB-CSIC, Cantoblanco, Madrid, Spain.
| | | |
Collapse
|
182
|
Kaplan N, Vaaknin A, Linial M. PANDORA: keyword-based analysis of protein sets by integration of annotation sources. Nucleic Acids Res 2003; 31:5617-26. [PMID: 14500825 PMCID: PMC206469 DOI: 10.1093/nar/gkg769] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Recent advances in high-throughput methods and the application of computational tools for automatic classification of proteins have made it possible to carry out large-scale proteomic analyses. Biological analysis and interpretation of sets of proteins is a time-consuming undertaking carried out manually by experts. We have developed PANDORA (Protein ANnotation Diagram ORiented Analysis), a web-based tool that provides an automatic representation of the biological knowledge associated with any set of proteins. PANDORA uses a unique approach of keyword-based graphical analysis that focuses on detecting subsets of proteins that share unique biological properties and the intersections of such sets. PANDORA currently supports SwissProt keywords, NCBI Taxonomy, InterPro entries and the hierarchical classification terms from ENZYME, SCOP and GO databases. The integrated study of several annotation sources simultaneously allows a representation of biological relations of structure, function, cellular location, taxonomy, domains and motifs. PANDORA is also integrated into the ProtoNet system, thus allowing testing thousands of automatically generated clusters. We illustrate how PANDORA enhances the biological understanding of large, non-uniform sets of proteins originating from experimental and computational sources, without the need for prior biological knowledge on individual proteins.
Collapse
Affiliation(s)
- Noam Kaplan
- Department of Biological Chemistry, Institute of Life Sciences, The Hebrew University, Jerusalem 91904, Israel
| | | | | |
Collapse
|
183
|
Cai CZ, Han LY, Ji ZL, Chen X, Chen YZ. SVM-Prot: Web-based support vector machine software for functional classification of a protein from its primary sequence. Nucleic Acids Res 2003; 31:3692-7. [PMID: 12824396 PMCID: PMC169006 DOI: 10.1093/nar/gkg600] [Citation(s) in RCA: 366] [Impact Index Per Article: 16.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Prediction of protein function is of significance in studying biological processes. One approach for function prediction is to classify a protein into functional family. Support vector machine (SVM) is a useful method for such classification, which may involve proteins with diverse sequence distribution. We have developed a web-based software, SVMProt, for SVM classification of a protein into functional family from its primary sequence. SVMProt classification system is trained from representative proteins of a number of functional families and seed proteins of Pfam curated protein families. It currently covers 54 functional families and additional families will be added in the near future. The computed accuracy for protein family classification is found to be in the range of 69.1-99.6%. SVMProt shows a certain degree of capability for the classification of distantly related proteins and homologous proteins of different function and thus may be used as a protein function prediction tool that complements sequence alignment methods. SVMProt can be accessed at http://jing.cz3.nus.edu.sg/cgi-bin/svmprot.cgi.
Collapse
Affiliation(s)
- C Z Cai
- Department of Computational Science, National University of Singapore, Blk SOC1, Level 7, 3 Science Drive 2, Singapore 117543, Singapore
| | | | | | | | | |
Collapse
|
184
|
Nair R, Rost B. LOC3D: annotate sub-cellular localization for protein structures. Nucleic Acids Res 2003; 31:3337-40. [PMID: 12824321 PMCID: PMC168921 DOI: 10.1093/nar/gkg514] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
LOC3D (http://cubic.bioc.columbia.edu/db/LOC3d/) is both a weekly-updated database and a web server for predictions of sub-cellular localization for eukaryotic proteins of known three-dimensional (3D) structure. Localization is predicted using four different methods: (i) PredictNLS, prediction of nuclear proteins through nuclear localization signals; (ii) LOChom, inferring localization through sequence homology; (iii) LOCkey, inferring localization through automatic text analysis of SWISS-PROT keywords; and (iv) LOC3Dini, ab initio prediction through a system of neural networks and vector support machines. The final prediction is based on the method that predicts localization with the highest confidence. The LOC3D database currently contains predictions for >8700 eukaryotic protein chains taken from the Protein Data Bank (PDB). The web server can be used to predict sub-cellular localization for proteins for which only a predicted structure is available from threading servers. This makes the resource of particular interest to structural genomics initiatives.
Collapse
Affiliation(s)
- Rajesh Nair
- CUBIC, Department of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA.
| | | |
Collapse
|
185
|
Eisenhaber F, Eisenhaber B, Kubina W, Maurer-Stroh S, Neuberger G, Schneider G, Wildpaner M. Prediction of lipid posttranslational modifications and localization signals from protein sequences: big-Pi, NMT and PTS1. Nucleic Acids Res 2003; 31:3631-4. [PMID: 12824382 PMCID: PMC168944 DOI: 10.1093/nar/gkg537] [Citation(s) in RCA: 69] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Many posttranslational modifications (N-myristoylation or glycosylphosphatidylinositol (GPI) lipid anchoring) and localization signals (the peroxisomal targeting signal PTS1) are encoded in short, partly compositionally biased regions at the N- or C-terminus of the protein sequence. These sequence signals are not well defined in terms of amino acid type preferences but they have significant interpositional correlations. Although the number of verified protein examples is small, the quantification of several physical conditions necessary for productive protein binding with the enzyme complexes executing the respective transformations can lead to predictors that recognize the signals from the amino acid sequence of queries alone. Taxon-specific prediction functions are required due to the divergent evolution of the active complexes. The big-Pi tool for the prediction of the C-terminal signal for GPI lipid anchor attachment is available for metazoan, protozoan and plant sequences. The myristoyl transferase (NMT) predictor recognizes glycine N-myristoylation sites (at the N-terminus and for fragments after processing) of higher eukaryotes (including their viruses) and fungi. The PTS1 signal predictor finds proteins with a C-terminus appropriate for peroxisomal import (for metazoa and fungi). Guidelines for application of the three WWW-based predictors (http://mendel.imp.univie.ac.at/) and for the interpretation of their output are described.
Collapse
Affiliation(s)
- Frank Eisenhaber
- Research Institute of Molecular Pathology, Dr. Bohr-Gasse 7, A-1030 Vienna, Republic of Austria.
| | | | | | | | | | | | | |
Collapse
|
186
|
Jackson DB, Minch E, Munro RE. Bioinformatics. EXS 2003:31-69. [PMID: 12613171 DOI: 10.1007/978-3-0348-7997-2_3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/01/2023]
|
187
|
Abstract
In the genomics era, the interactions between proteins are at the center of attention. Genomic-context methods used to predict these interactions have been put on a quantitative basis, revealing that they are at least on an equal footing with genomics experimental data. A survey of experimentally confirmed predictions proves the applicability of these methods, and new concepts to predict protein interactions in eukaryotes have been described. Finally, the interaction networks that can be obtained by combining the predicted pair-wise interactions have enough internal structure to detect higher levels of organization, such as 'functional modules'.
Collapse
Affiliation(s)
- Martijn A Huynen
- Nijmegen Center for Molecular Life Sciences, Center for Molecular and Biomolecular Informatics, Toernooiveld 1, 6525 ED, Nijmegen, The Netherlands.
| | | | | | | |
Collapse
|
188
|
Brunk CF, Lee LC, Tran AB, Li J. Complete sequence of the mitochondrial genome of Tetrahymena thermophila and comparative methods for identifying highly divergent genes. Nucleic Acids Res 2003; 31:1673-82. [PMID: 12626709 PMCID: PMC152872 DOI: 10.1093/nar/gkg270] [Citation(s) in RCA: 56] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2002] [Revised: 01/16/2003] [Accepted: 01/16/2003] [Indexed: 11/13/2022] Open
Abstract
The complete sequence of the mitochondrial genome of Tetrahymena thermophila has been determined and compared with the mitochondrial genome of Tetrahymena pyriformis. The sequence similarity clearly indicates homology of the entire T.thermophila and T.pyriformis mitochondrial genomes. The T.thermophila genome is very compact, most of the intergenic regions are short (only three are longer than 63 bp) and comprise only 3.8% of the genome. The nad9 gene is tandemly duplicated in T.thermophila. Long terminal inverted repeats and the nad9 genes are undergoing concerted evolution. There are 55 putative genes: three ribosomal RNA genes, eight transfer RNA genes, 22 proteins with putatively assigned functions and 22 additional open reading frames of unknown function. In order to extend indications of homology beyond amino acid sequence similarity we have examined a number of physico-chemical properties of the mitochondrial proteins, including theoretical pI, molecular weight and particularly the predicted transmembrane spanning regions. This approach has allowed us to identify homologs to ymf58 (nad4L), ymf62 (nad6) and ymf60 (rpl6).
Collapse
Affiliation(s)
- Clifford F Brunk
- Department of Organismic Biology, Ecology and Evolution, University of California-Los Angeles, Los Angeles, CA 90095-1606, USA.
| | | | | | | |
Collapse
|
189
|
Boneca IG, de Reuse H, Epinat JC, Pupin M, Labigne A, Moszer I. A revised annotation and comparative analysis of Helicobacter pylori genomes. Nucleic Acids Res 2003; 31:1704-14. [PMID: 12626712 PMCID: PMC152854 DOI: 10.1093/nar/gkg250] [Citation(s) in RCA: 62] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Huge amounts of genomic information are currently being generated. Therefore, biologists require structured, exhaustive and comparative databases. The PyloriGene database (http://genolist.pasteur.fr/PyloriGene) was developed to respond to these needs, by integrating and connecting the information generated during the sequencing of two distinct strains of Helicobacter pylori. This led to the need for a general annotation consensus, as the physical and functional annotations of the two strains differed significantly in some cases. A revised functional classification system was created to accommodate the existing data and to make it possible to classify coding sequences (CDS) into several functional categories to harmonize CDS classification. The annotation of the two complete genomes was revised in the light of new data, allowing us to reduce the percentage of hypothetical proteins from approximately 40 to 33%. This resulted in the reassignment of functions for 108 CDS (approximately 7% of all CDS). Interestingly, the functions of only approximately 13% of CDS (222 out of 1658 CDS) were annotated as a result of work done directly on H.pylori genes. Finally, comparison of the two published genomes revealed a significant amount of size variation between corresponding (orthologous) CDS. Most of these size variations were due to natural polymorphisms, although other sources of variation were identified, such as pseudogenes, new genes potentially regulated by slipped-strand mispairing mechanism, or frame-shifts. 113 of these differences were due to different start codon assignments, a common problem when constructing physical annotations.
Collapse
Affiliation(s)
- Ivo G Boneca
- Unité de Pathogénie Bactérienne des Muqueuses, Institut Pasteur, Paris, France.
| | | | | | | | | | | |
Collapse
|
190
|
Abstract
Mapping, and ultimately preventing, the dissemination of infectious agents is an important topic in public health. Newly developed molecular-microbiological methods have contributed significantly to recent advances in the efficient tracking of the nosocomial and environmental spread of microbial pathogens. Not only has the application of novel technologies led to improved understanding of microbial epidemiology, but the concepts of population structure and dynamics of many of the medically significant microorganisms have advanced significantly also. Currently, genetic identification of microbes is also within the reach of clinical microbiology laboratory professionals including those without specialized technology research interests. This review summarizes the possibilities for high-throughput molecular-microbiological typing in adequately equipped medical microbiology laboratories from both clinical and fundamental research perspectives. First, the development and application of methods for large-scale comparative typing of serially isolated microbial strains are discussed. The outcome of studies employing these methods allows for long-term epidemiologic surveillance of infectious diseases. Second, recent methods enable an almost nucleotide-by-nucleotide genetic comparison of smaller numbers of strains, thereby facilitating the identification of the genetic basis of, for instance, medically relevant microbiological traits. Whereas the first approach provides insights into the dynamic spread of infectious agents, the second provides insights into intragenomic dynamics and genetic functionality. The current state of technology is summarized, and future perspectives are sketched.
Collapse
Affiliation(s)
- A van Belkum
- Erasmus MC, Department of Medical Microbiology & Infectious Diseases, Rotterdam, The Netherlands.
| |
Collapse
|
191
|
Chen YZ, Ung CY. Computer automated prediction of potential therapeutic and toxicity protein targets of bioactive compounds from Chinese medicinal plants. THE AMERICAN JOURNAL OF CHINESE MEDICINE 2002; 30:139-54. [PMID: 12067089 DOI: 10.1142/s0192415x02000156] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Understanding the molecular mechanism and pharmacology of bioactive compounds from Chinese medicinal plants (CMP) is important in facilitating scientific evaluation of novel therapeutic approaches in traditional Chinese medicine. It is also of significance in new drug development based on the mechanism of Chinese medicine. A key step towards this task is the determination of the therapeutic and toxicity protein targets of CMP compounds. In this work, newly developed computer software INVDOCK is used for automated identification of potential therapeutic and toxicity targets of several bioactive compounds isolated from Chinese medicinal plants. This software searches a protein database to find proteins to which a CMP compound can bind or weakly bind. INVDOCK results on three CMP compounds (allicin, catechin and camptotecin) show that 60% of computer-identified potential therapeutic protein targets and 27% of computer-identified potential toxicity targets have been implicated or confirmed by experiments. This software may potentially be used as a relatively fast-speed and low-cost tool for facilitating the study of molecular mechanism and pharmacology of bioactive compounds from Chinese medicinal plants and natural products from other sources.
Collapse
Affiliation(s)
- Y Z Chen
- Department of Computational Science, National University of Singapore, Singapore.
| | | |
Collapse
|
192
|
Abstract
The effects of genes on phenotype are mediated by processes that are typically unknown but whose determination is desirable. The conversion from gene to phenotype is not a simple function of individual genes, but involves the complex interactions of many genes; it is what is known as a nonlinear mapping problem. A computational method called genetic programming allows the representation of candidate nonlinear mappings in several possible trees. To find the best model, the trees are 'evolved' by processes akin to mutation and recombination, and the trees that more closely represent the actual data are preferentially selected. The result is an improved tree of rules that represent the nonlinear mapping directly. In this way, the encoding of cellular and higher-order activities by genes is seen as directly analogous to computer programs. This analogy is of utility in biological genetics and in problems of genotype-phenotype mapping.
Collapse
|
193
|
Rigoutsos I, Huynh T, Floratos A, Parida L, Platt D. Dictionary-driven protein annotation. Nucleic Acids Res 2002; 30:3901-16. [PMID: 12202776 PMCID: PMC137405 DOI: 10.1093/nar/gkf464] [Citation(s) in RCA: 19] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2002] [Revised: 06/04/2002] [Accepted: 06/04/2002] [Indexed: 11/14/2022] Open
Abstract
Computational methods seeking to automatically determine the properties (functional, structural, physicochemical, etc.) of a protein directly from the sequence have long been the focus of numerous research groups. With the advent of advanced sequencing methods and systems, the number of amino acid sequences that are being deposited in the public databases has been increasing steadily. This has in turn generated a renewed demand for automated approaches that can annotate individual sequences and complete genomes quickly, exhaustively and objectively. In this paper, we present one such approach that is centered around and exploits the Bio-Dictionary, a collection of amino acid patterns that completely covers the natural sequence space and can capture functional and structural signals that have been reused during evolution, within and across protein families. Our annotation approach also makes use of a weighted, position-specific scoring scheme that is unaffected by the over-representation of well-conserved proteins and protein fragments in the databases used. For a given query sequence, the method permits one to determine, in a single pass, the following: local and global similarities between the query and any protein already present in a public database; the likeness of the query to all available archaeal/ bacterial/eukaryotic/viral sequences in the database as a function of amino acid position within the query; the character of secondary structure of the query as a function of amino acid position within the query; the cytoplasmic, transmembrane or extracellular behavior of the query; the nature and position of binding domains, active sites, post-translationally modified sites, signal peptides, etc. In terms of performance, the proposed method is exhaustive, objective and allows for the rapid annotation of individual sequences and full genomes. Annotation examples are presented and discussed in Results, including individual queries and complete genomes that were released publicly after we built the Bio-Dictionary that is used in our experiments. Finally, we have computed the annotations of more than 70 complete genomes and made them available on the World Wide Web at http://cbcsrv.watson.ibm.com/Annotations/.
Collapse
Affiliation(s)
- Isidore Rigoutsos
- Bioinformatics and Pattern Discovery Group, IBM TJ Watson Research Center, Yorktown Heights, NY 10598, USA.
| | | | | | | | | |
Collapse
|
194
|
Van Regenmortel MHV. Reductionism and the search for structure-function relationships in antibody molecules. J Mol Recognit 2002; 15:240-7. [PMID: 12447900 DOI: 10.1002/jmr.584] [Citation(s) in RCA: 45] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
One of the claims of reductionism is that it can explain all the features of living systems by an analysis of their physico-chemical constituents. Such a claim disregards the existence in biological systems of emergent properties that do not exist in their isolated components but which allow autonomous organisms to be directively organized in a self-regulated and integrated manner. It is not possible to describe biological systems adequately without using functional language that is meaningless in the physical sciences. The description of biological functions is also an essential part of immunology and functional explanations are more useful than causal explanations also in this discipline. Since causality is not a relation between a material object and an event, the structure of an antibody cannot be the cause of its binding activity. When structure-function relationships are analysed, the search should be for correlations rather than for causal relations. Methods used to find correlations between the atomic structure of antibody binding sites and their binding activity are mostly based on mutagenesis studies. Since the effect of any mutation depends on the molecular context, it is usually very difficult to predict the effects of multiple mutations on antibody function. Our knowledge of the molecular basis of antigen-antibody recognition has led to the expectation that it may be possible to develop new vaccines using molecular design principles. Such unwarranted hopes arise because of a confusion between antigenicity and immunogenicity. Although knowledge of antibody structure is of little use in vaccine design, it may help to develop therapeutic inhibitors and antibodies effective in the passive immunotherapy of viral infection.
Collapse
|
195
|
Wu LF, Hughes TR, Davierwala AP, Robinson MD, Stoughton R, Altschuler SJ. Large-scale prediction of Saccharomyces cerevisiae gene function using overlapping transcriptional clusters. Nat Genet 2002; 31:255-65. [PMID: 12089522 DOI: 10.1038/ng906] [Citation(s) in RCA: 221] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Genome sequencing has led to the discovery of tens of thousands of potential new genes. Six years after the sequencing of the well-studied yeast Saccharomyces cerevisiae and the discovery that its genome encodes approximately 6,000 predicted proteins, more than 2,000 have not yet been characterized experimentally, and determining their functions seems far from a trivial task. One crucial constraint is the generation of useful hypotheses about protein function. Using a new approach to interpret microarray data, we assign likely cellular functions with confidence values to these new yeast proteins. We perform extensive genome-wide validations of our predictions and offer visualization methods for exploration of the large numbers of functional predictions. We identify potential new members of many existing functional categories including 285 candidate proteins involved in transcription, processing and transport of non-coding RNA molecules. We present experimental validation confirming the involvement of several of these proteins in ribosomal RNA processing. Our methodology can be applied to a variety of genomics data types and organisms.
Collapse
Affiliation(s)
- Lani F Wu
- Rosetta Inpharmatics, Kirkland, Washington, USA
| | | | | | | | | | | |
Collapse
|
196
|
Hoover DM, Lubkowski J. DNAWorks: an automated method for designing oligonucleotides for PCR-based gene synthesis. Nucleic Acids Res 2002; 30:e43. [PMID: 12000848 PMCID: PMC115297 DOI: 10.1093/nar/30.10.e43] [Citation(s) in RCA: 394] [Impact Index Per Article: 17.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
The availability of sequences of entire genomes has dramatically increased the number of protein targets, many of which will need to be overexpressed in cells other than the original source of DNA. Gene synthesis often provides a fast and economically efficient approach. The synthetic gene can be optimized for expression and constructed for easy mutational manipulation without regard to the parent genome. Yet design and construction of synthetic genes, especially those coding for large proteins, can be a slow, difficult and confusing process. We have written a computer program that automates the design of oligonucleotides for gene synthesis. Our program requires simple input information, i.e. amino acid sequence of the target protein and melting temperature (needed for the gene assembly) of synthetic oligonucleotides. The program outputs a series of oligonucleotide sequences with codons optimized for expression in an organism of choice. Those oligonucleotides are characterized by highly homogeneous melting temperatures and a minimized tendency for hairpin formation. With the help of this program and a two-step PCR method, we have successfully constructed numerous synthetic genes, ranging from 139 to 1042 bp. The approach presented here simplifies the production of proteins from a wide variety of organisms for genomics-based studies.
Collapse
Affiliation(s)
- David M Hoover
- Macromolecular Crystallography Laboratory, National Cancer Institute at Frederick, MD 21702, USA
| | | |
Collapse
|
197
|
Chen YZ, Ung CY. Prediction of potential toxicity and side effect protein targets of a small molecule by a ligand-protein inverse docking approach. J Mol Graph Model 2002; 20:199-218. [PMID: 11766046 DOI: 10.1016/s1093-3263(01)00109-7] [Citation(s) in RCA: 112] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
Abstract
Determination of potential drug toxicity and side effect in early stages of drug development is important in reducing the cost and time of drug discovery. In this work, we explore a computer method for predicting potential toxicity and side effect protein targets of a small molecule. A ligand-protein inverse docking approach is used for computer-automated search of a protein cavity database to identify protein targets. This database is developed from protein 3D structures in the protein data bank (PDB). Docking is conducted by a procedure involving multiple conformer shape-matching alignment of a molecule to a cavity followed by molecular-mechanics torsion optimization and energy minimization on both the molecule and the protein residues at the binding region. Potential protein targets are selected by evaluation of molecular mechanics energy and, while applicable, further analysis of its binding competitiveness against other ligands that bind to the same receptor site in at least one PDB entry. Our results on several drugs show that 83% of the experimentally known toxicity and side effect targets for these drugs are predicted. The computer search successfully predicted 38 and missed five experimentally confirmed or implicated protein targets with available structure and in which binding involves no covalent bond. There are additional 30 predicted targets yet to be validated experimentally. Application of this computer approach can potentially facilitate the prediction of toxicity and side effect of a drug or drug lead.
Collapse
Affiliation(s)
- Y Z Chen
- Department of Computational Science, National University of Singapore.
| | | |
Collapse
|
198
|
Caron M, Imam-Sghiouar N, Poirier F, Le Caër JP, Labas V, Joubert-Caron R. Proteomic map and database of lymphoblastoid proteins. J Chromatogr B Analyt Technol Biomed Life Sci 2002; 771:197-209. [PMID: 12015999 DOI: 10.1016/s1570-0232(02)00040-5] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
Advances in genomics have led to the accumulation of an unprecedented amount of data, giving rise to a new field in biochemistry, proteomics. We used a combination of two dimensional gel electrophoresis, analysis and annotation using third-generation software, and mass spectrometry to establish the proteome maps of lymphoblastoid B-cells, a prerequisite for analysis of drug effects and lymphocyte cell diseases. About 1200 protein spots were detected and characterised in terms of their isoelectric point, molecular mass and expression. The present status of proteomic technologies, as well as a description of the usefulness of human hematopoietic cells proteomic database are discussed.
Collapse
Affiliation(s)
- Michel Caron
- Université Paris 13, UFR SMBHI Leonard de Vinci, Bobigny, France.
| | | | | | | | | | | |
Collapse
|
199
|
YANO N, FADDEN-PAIVA KJ, ENDOH M, SAKAI H, KUROKAWA K, DWORKIN LD, RIFAI A. Profiling the IgA nephropathy renal transcriptome: analysis by complementary DNA array hybridization. Nephrology (Carlton) 2002. [DOI: 10.1111/j.1440-1797.2002.tb00524.x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
200
|
Snel B, Bork P, Huynen MA. The identification of functional modules from the genomic association of genes. Proc Natl Acad Sci U S A 2002; 99:5890-5. [PMID: 11983890 PMCID: PMC122872 DOI: 10.1073/pnas.092632599] [Citation(s) in RCA: 195] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2001] [Indexed: 11/18/2022] Open
Abstract
By combining the pairwise interactions between proteins, as predicted by the conserved co-occurrence of their genes in operons, we obtain protein interaction networks. Here we study the properties of such networks to identify functional modules: sets of proteins that together are involved in a biological process. The complete network contains 3,033 orthologous groups of proteins in 38 genomes. It consists of one giant component, containing 1,611 orthologous groups, and of 516 small disjointed clusters that, on average, contain only 2.7 orthologous groups. These small clusters have a homogeneous functional composition and thus represent functional modules in themselves. Analysis of the giant component reveals that it is a scale-free, small-world network with a high degree of local clustering (C = 0.6). It consists of locally highly connected subclusters that are connected to each other by linker proteins. The linker proteins tend to have multiple functions, or are involved in multiple processes and have an above average probability of being essential. By splitting up the giant component at these linker proteins, we identify 265 subclusters that tend to have a homogeneous functional composition. The rare functional inhomogeneities in our subclusters reflect the mixing of different types of (molecular) functions in a single cellular process, exemplified by subclusters containing both metabolic enzymes as well as the transcription factors that regulate them. Comparative genome analysis, thus, allows identification of a level of functional interaction between that of pairwise interactions, and of the complete genome.
Collapse
Affiliation(s)
- Berend Snel
- European Molecular Biology Laboratory, Meyerhofstrasse 1, 69117 Heidelberg, Germany.
| | | | | |
Collapse
|