101
|
Hu L, Huang T, Shi X, Lu WC, Cai YD, Chou KC. Predicting functions of proteins in mouse based on weighted protein-protein interaction network and protein hybrid properties. PLoS One 2011; 6:e14556. [PMID: 21283518 PMCID: PMC3023709 DOI: 10.1371/journal.pone.0014556] [Citation(s) in RCA: 130] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2010] [Accepted: 12/21/2010] [Indexed: 11/27/2022] Open
Abstract
Background With the huge amount of uncharacterized protein sequences generated in the post-genomic age, it is highly desirable to develop effective computational methods for quickly and accurately predicting their functions. The information thus obtained would be very useful for both basic research and drug development in a timely manner. Methodology/Principal Findings Although many efforts have been made in this regard, most of them were based on either sequence similarity or protein-protein interaction (PPI) information. However, the former often fails to work if a query protein has no or very little sequence similarity to any function-known proteins, while the latter had similar problem if the relevant PPI information is not available. In view of this, a new approach is proposed by hybridizing the PPI information and the biochemical/physicochemical features of protein sequences. The overall first-order success rates by the new predictor for the functions of mouse proteins on training set and test set were 69.1% and 70.2%, respectively, and the success rate covered by the results of the top-4 order from a total of 24 orders was 65.2%. Conclusions/Significance The results indicate that the new approach is quite promising that may open a new avenue or direction for addressing the difficult and complicated problem.
Collapse
Affiliation(s)
- Lele Hu
- Institute of Systems Biology, Shanghai University, Shanghai, China
- Department of Chemistry, College of Sciences, Shanghai University, Shanghai, China
| | - Tao Huang
- Key Laboratory of Systems Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
- Shanghai Center for Bioinformation Technology, Shanghai, China
| | - Xiaohe Shi
- Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences and Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - Wen-Cong Lu
- Department of Chemistry, College of Sciences, Shanghai University, Shanghai, China
| | - Yu-Dong Cai
- Institute of Systems Biology, Shanghai University, Shanghai, China
- Centre for Computational Systems Biology, Fudan University, Shanghai, China
- Gordon Life Science Institute, San Diego, California, United States of America
- * E-mail:
| | - Kuo-Chen Chou
- Gordon Life Science Institute, San Diego, California, United States of America
| |
Collapse
|
102
|
Plett D, Toubia J, Garnett T, Tester M, Kaiser BN, Baumann U. Dichotomy in the NRT gene families of dicots and grass species. PLoS One 2010; 5:e15289. [PMID: 21151904 PMCID: PMC2997785 DOI: 10.1371/journal.pone.0015289] [Citation(s) in RCA: 88] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2010] [Accepted: 11/04/2010] [Indexed: 11/19/2022] Open
Abstract
A large proportion of the nitrate (NO(3)(-)) acquired by plants from soil is actively transported via members of the NRT families of NO(3)(-) transporters. In Arabidopsis, the NRT1 family has eight functionally characterised members and predominantly comprises low-affinity transporters; the NRT2 family contains seven members which appear to be high-affinity transporters; and there are two NRT3 (NAR2) family members which are known to participate in high-affinity transport. A modified reciprocal best hit (RBH) approach was used to identify putative orthologues of the Arabidopsis NRT genes in the four fully sequenced grass genomes (maize, rice, sorghum, Brachypodium). We also included the poplar genome in our analysis to establish whether differences between Arabidopsis and the grasses may be generally applicable to monocots and dicots. Our analysis reveals fundamental differences between Arabidopsis and the grass species in the gene number and family structure of all three families of NRT transporters. All grass species possessed additional NRT1.1 orthologues and appear to lack NRT1.6/NRT1.7 orthologues. There is significant separation in the NRT2 phylogenetic tree between NRT2 genes from dicots and grass species. This indicates that determination of function of NRT2 genes in grass species will not be possible in cereals based simply on sequence homology to functionally characterised Arabidopsis NRT2 genes and that proper functional analysis will be required. Arabidopsis has a unique NRT3.2 gene which may be a fusion of the NRT3.1 and NRT3.2 genes present in all other species examined here. This work provides a framework for future analysis of NO(3)(-) transporters and NO(3)(-) transport in grass crop species.
Collapse
Affiliation(s)
- Darren Plett
- Australian Centre for Plant Functional Genomics, Waite Research Institute, University of Adelaide, Adelaide, South Australia, Australia
| | - John Toubia
- Australian Centre for Plant Functional Genomics, Waite Research Institute, University of Adelaide, Adelaide, South Australia, Australia
| | - Trevor Garnett
- Australian Centre for Plant Functional Genomics, Waite Research Institute, University of Adelaide, Adelaide, South Australia, Australia
| | - Mark Tester
- Australian Centre for Plant Functional Genomics, Waite Research Institute, University of Adelaide, Adelaide, South Australia, Australia
| | - Brent N. Kaiser
- School of Agriculture, Food and Wine, Waite Research Institute, University of Adelaide, Adelaide, South Australia, Australia
- * E-mail:
| | - Ute Baumann
- Australian Centre for Plant Functional Genomics, Waite Research Institute, University of Adelaide, Adelaide, South Australia, Australia
| |
Collapse
|
103
|
Cloning, characterization, and expression analysis of Toll-like receptor-7 cDNA from common carp, Cyprinus carpio L. COMPARATIVE BIOCHEMISTRY AND PHYSIOLOGY D-GENOMICS & PROTEOMICS 2010; 5:245-55. [DOI: 10.1016/j.cbd.2010.07.001] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/12/2010] [Revised: 07/05/2010] [Accepted: 07/13/2010] [Indexed: 01/02/2023]
|
104
|
Schröder A, Eichner J, Supper J, Eichner J, Wanke D, Henneges C, Zell A. Predicting DNA-binding specificities of eukaryotic transcription factors. PLoS One 2010; 5:e13876. [PMID: 21152420 PMCID: PMC2994704 DOI: 10.1371/journal.pone.0013876] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2010] [Accepted: 10/14/2010] [Indexed: 11/18/2022] Open
Abstract
Today, annotated amino acid sequences of more and more transcription factors (TFs) are readily available. Quantitative information about their DNA-binding specificities, however, are hard to obtain. Position frequency matrices (PFMs), the most widely used models to represent binding specificities, are experimentally characterized only for a small fraction of all TFs. Even for some of the most intensively studied eukaryotic organisms (i.e., human, rat and mouse), roughly one-sixth of all proteins with annotated DNA-binding domain have been characterized experimentally. Here, we present a new method based on support vector regression for predicting quantitative DNA-binding specificities of TFs in different eukaryotic species. This approach estimates a quantitative measure for the PFM similarity of two proteins, based on various features derived from their protein sequences. The method is trained and tested on a dataset containing 1 239 TFs with known DNA-binding specificity, and used to predict specific DNA target motifs for 645 TFs with high accuracy.
Collapse
Affiliation(s)
- Adrian Schröder
- Center for Bioinformatics Tübingen (ZBIT), University of Tübingen, Tübingen, Germany.
| | | | | | | | | | | | | |
Collapse
|
105
|
Horst JA, Samudrala R. A protein sequence meta-functional signature for calcium binding residue prediction. Pattern Recognit Lett 2010; 31:2103-2112. [PMID: 20824111 PMCID: PMC2932634 DOI: 10.1016/j.patrec.2010.04.012] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
Abstract
The diversity of characterized protein functions found amongst experimentally interrogated proteins suggests that a vast array of unknown functions remains undiscovered. These protein functions are imparted by specific geometric distributions of amino acid residue chemical moieties, each contributing a functional interaction. We hypothesize that individual residue function contributions are predictable through sequence analytic knowledge based algorithms, and that they can be recombined to understand composite protein function by predicting spatial relation in tertiary structure. We assess the former by training a meta-functional signature algorithm to specifically predict calcium ion binding residues from protein sequence. We estimate the latter by testing for match between predictive contribution of positions in predicted secondary structures and patterns of side chain proximity forced by secondary structure moieties. Specific training for calcium binding results in 83% area under the receiver operator characteristic curve added value over random (AUCoR) and p<10(-300) significance as measured by Kendall's τ in ten fold cross validation for parallel sets of 811 residues in 336 proteins and 696 residues in 299 proteins. Training for generalized function results in 63% AUCoR and p≅10(-221) for the same tests. Including inference of side chain proximity improves predictive ability by 2% AUCoR consistently. The results demonstrate that protein meta-functional signatures can be trained to predict specific protein functions by considering amino acid identity and structural features accessible from sequence, laying the groundwork for composite sequence based function site prediction.
Collapse
Affiliation(s)
- Jeremy A Horst
- Department of Oral Biology, School of Dentistry, University of Washington, 1959 NE Pacific St #357132, Seattle, WA 98195
- Department of Microbiology, School of Medicine, University of Washington, 1959 NE Pacific St #357132, Seattle, WA 98195
| | - Ram Samudrala
- Department of Oral Biology, School of Dentistry, University of Washington, 1959 NE Pacific St #357132, Seattle, WA 98195
- Department of Microbiology, School of Medicine, University of Washington, 1959 NE Pacific St #357132, Seattle, WA 98195
| |
Collapse
|
106
|
de Almeida JMGCF. BiDiBlast: comparative genomics pipeline for the PC. GENOMICS PROTEOMICS & BIOINFORMATICS 2010; 8:135-8. [PMID: 20691399 PMCID: PMC5054440 DOI: 10.1016/s1672-0229(10)60015-0] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
Abstract
Bi-directional BLAST is a simple approach to detect, annotate, and analyze candidate orthologous or paralogous sequences in a single go. This procedure is usually confined to the realm of customized Perl scripts, usually tuned for UNIX-like environments. Porting those scripts to other operating systems involves refactoring them, and also the installation of the Perl programming environment with the required libraries. To overcome these limitations, a data pipeline was implemented in Java. This application submits two batches of sequences to local versions of the NCBI BLAST tool, manages result lists, and refines both bi-directional and simple hits. GO Slim terms are attached to hits, several statistics are derived, and molecular evolution rates are estimated through PAML. The results are written to a set of delimited text tables intended for further analysis. The provided graphic user interface allows a friendly interaction with this application, which is documented and available to download at http://moodle.fct.unl.pt/course/view.php?id=2079 or https://sourceforge.net/projects/bidiblast/ under the GNU GPL license.
Collapse
Affiliation(s)
- João M G C F de Almeida
- Centro de Recursos Microbiológicos (CREM), Departamento de Ciências da Vida, Faculdade de Ciências e Tecnologia, Universidade Nova de Lisboa, Quinta da Torre, Caparica, Portugal.
| |
Collapse
|
107
|
MacPherson JI, Dickerson JE, Pinney JW, Robertson DL. Patterns of HIV-1 protein interaction identify perturbed host-cellular subsystems. PLoS Comput Biol 2010; 6:e1000863. [PMID: 20686668 PMCID: PMC2912648 DOI: 10.1371/journal.pcbi.1000863] [Citation(s) in RCA: 47] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2010] [Accepted: 06/21/2010] [Indexed: 01/12/2023] Open
Abstract
Human immunodeficiency virus type 1 (HIV-1) exploits a diverse array of host cell functions in order to replicate. This is mediated through a network of virus-host interactions. A variety of recent studies have catalogued this information. In particular the HIV-1, Human Protein Interaction Database (HHPID) has provided a unique depth of protein interaction detail. However, as a map of HIV-1 infection, the HHPID is problematic, as it contains curation error and redundancy; in addition, it is based on a heterogeneous set of experimental methods. Based on identifying shared patterns of HIV-host interaction, we have developed a novel methodology to delimit the core set of host-cellular functions and their associated perturbation from the HHPID. Initially, using biclustering, we identify 279 significant sets of host proteins that undergo the same types of interaction. The functional cohesiveness of these protein sets was validated using a human protein-protein interaction network, gene ontology annotation and sequence similarity. Next, using a distance measure, we group host protein sets and identify 37 distinct higher-level subsystems. We further demonstrate the biological significance of these subsystems by cross-referencing with global siRNA screens that have been used to detect host factors necessary for HIV-1 replication, and investigate the seemingly small intersect between these data sets. Our results highlight significant host-cell subsystems that are perturbed during the course of HIV-1 infection. Moreover, we characterise the patterns of interaction that contribute to these perturbations. Thus, our work disentangles the complex set of HIV-1-host protein interactions in the HHPID, reconciles these with siRNA screens and provides an accessible and interpretable map of infection.
Collapse
Affiliation(s)
- Jamie I. MacPherson
- Faculty of Life Sciences, Michael Smith Building, University of Manchester, Manchester, United Kingdom
| | - Jonathan E. Dickerson
- Faculty of Life Sciences, Michael Smith Building, University of Manchester, Manchester, United Kingdom
| | - John W. Pinney
- Centre for Bioinformatics, Division of Molecular Biosciences, Imperial College London, London, United Kingdom
| | - David L. Robertson
- Faculty of Life Sciences, Michael Smith Building, University of Manchester, Manchester, United Kingdom
- * E-mail:
| |
Collapse
|
108
|
Wong WC, Maurer-Stroh S, Eisenhaber F. More than 1,001 problems with protein domain databases: transmembrane regions, signal peptides and the issue of sequence homology. PLoS Comput Biol 2010; 6:e1000867. [PMID: 20686689 PMCID: PMC2912341 DOI: 10.1371/journal.pcbi.1000867] [Citation(s) in RCA: 48] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2010] [Accepted: 06/25/2010] [Indexed: 12/16/2022] Open
Abstract
Large-scale genome sequencing gained general importance for life science because functional annotation of otherwise experimentally uncharacterized sequences is made possible by the theory of biomolecular sequence homology. Historically, the paradigm of similarity of protein sequences implying common structure, function and ancestry was generalized based on studies of globular domains. Having the same fold imposes strict conditions over the packing in the hydrophobic core requiring similarity of hydrophobic patterns. The implications of sequence similarity among non-globular protein segments have not been studied to the same extent; nevertheless, homology considerations are silently extended for them. This appears especially detrimental in the case of transmembrane helices (TMs) and signal peptides (SPs) where sequence similarity is necessarily a consequence of physical requirements rather than common ancestry. Thus, matching of SPs/TMs creates the illusion of matching hydrophobic cores. Therefore, inclusion of SPs/TMs into domain models can give rise to wrong annotations. More than 1001 domains among the 10,340 models of Pfam release 23 and 18 domains of SMART version 6 (out of 809) contain SP/TM regions. As expected, fragment-mode HMM searches generate promiscuous hits limited to solely the SP/TM part among clearly unrelated proteins. More worryingly, we show explicit examples that the scores of clearly false-positive hits, even in global-mode searches, can be elevated into the significance range just by matching the hydrophobic runs. In the PIR iProClass database v3.74 using conservative criteria, we find that at least between 2.1% and 13.6% of its annotated Pfam hits appear unjustified for a set of validated domain models. Thus, false-positive domain hits enforced by SP/TM regions can lead to dramatic annotation errors where the hit has nothing in common with the problematic domain model except the SP/TM region itself. We suggest a workflow of flagging problematic hits arising from SP/TM-containing models for critical reconsideration by annotation users. Sequence homology is a fundamental principle of biology. It implies common phylogenetic ancestry of genes and, subsequently, similarity of their protein products with regard to amino acid sequence, three-dimensional structure and molecular and cellular function. Originally an esoteric concept, homology with the proxy of sequence similarity is used to justify the transfer of functional annotation from well-studied protein examples to new sequences. Yet, functional annotation via sequence similarity seems to have hit a plateau in recent years since relentless annotation transfer led to error propagation across sequence databases; thus, leading experimental follow-up work astray. It must be emphasized that the trinity of sequence, 3D structural and functional similarity has only been proven for globular segments of proteins. For non-globular regions, similarity of sequence is not necessarily a result of divergent evolution from a common ancestor but the consequence of amino acid sequence bias. In our investigation, we found that protein domain databases contain many domain models with transmembrane regions and signal peptides, non-globular segments of proteins having hydrophobic bias. Many proteins have inherited completely wrong function assignments from these domain models. We fear that future function predictions will turn out futile if this issue is not immediately addressed.
Collapse
Affiliation(s)
- Wing-Cheong Wong
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), Singapore
- * E-mail: (WCW); (SMS); (FE)
| | - Sebastian Maurer-Stroh
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), Singapore
- School of Biological Sciences (SBS), Nanyang Technological University (NTU), Singapore
- * E-mail: (WCW); (SMS); (FE)
| | - Frank Eisenhaber
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), Singapore
- Department of Biological Sciences (DBS), National University of Singapore (NUS), Singapore
- School of Computer Engineering (SCE), Nanyang Technological University (NTU), Singapore
- * E-mail: (WCW); (SMS); (FE)
| |
Collapse
|
109
|
Rawat A, Gust KA, Deng Y, Garcia-Reyero N, Quinn MJ, Johnson MS, Indest KJ, Elasri MO, Perkins EJ. From raw materials to validated system: the construction of a genomic library and microarray to interpret systemic perturbations in Northern bobwhite. Physiol Genomics 2010; 42:219-35. [PMID: 20406850 PMCID: PMC3032282 DOI: 10.1152/physiolgenomics.00022.2010] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2010] [Accepted: 04/16/2010] [Indexed: 01/02/2023] Open
Abstract
The limited availability of genomic tools and data for nonmodel species impedes computational and systems biology approaches in nonmodel organisms. Here we describe the development, functional annotation, and utilization of genomic tools for the avian wildlife species Northern bobwhite (Colinus virginianus) to determine the molecular impacts of exposure to 2,6-dinitrotoluene (2,6-DNT), a field contaminant of military concern. Massively parallel pyrosequencing of a normalized multitissue library of Northern bobwhite cDNAs yielded 71,384 unique transcripts that were annotated with gene ontology (GO), pathway information, and protein domain analysis. Comparative genome analyses with model organisms revealed functional homologies in 8,825 unique Northern bobwhite genes that are orthologous to 48% of Gallus gallus protein-coding genes. Pathway analysis and GO enrichment of genes differentially expressed in livers of birds exposed for 60 days (d) to 10 and 60 mg/kg/d 2,6-DNT revealed several impacts validated by RT-qPCR including: prostaglandin pathway-mediated inflammation, increased expression of a heme synthesis pathway in response to anemia, and a shift in energy metabolism toward protein catabolism via inhibition of control points for glucose and lipid metabolic pathways, PCK1 and PPARGC1, respectively. This research effort provides the first comprehensive annotated gene library for Northern bobwhite. Transcript expression analysis provided insights into the metabolic perturbations underlying several observed toxicological phenotypes in a 2,6-DNT exposure case study. Furthermore, the systemic impact of dinitrotoluenes on liver function appears conserved across species as PPAR signaling is similarly affected in fathead minnow liver tissue after exposure to 2,4-DNT.
Collapse
Affiliation(s)
- Arun Rawat
- Department of Biological Sciences, University of Southern Mississippi, Hattiesburg, MS, USA
| | | | | | | | | | | | | | | | | |
Collapse
|
110
|
Comparative transcriptome and secretome analysis of wood decay fungi Postia placenta and Phanerochaete chrysosporium. Appl Environ Microbiol 2010; 76:3599-610. [PMID: 20400566 DOI: 10.1128/aem.00058-10] [Citation(s) in RCA: 210] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
Cellulose degradation by brown rot fungi, such as Postia placenta, is poorly understood relative to the phylogenetically related white rot basidiomycete, Phanerochaete chrysosporium. To elucidate the number, structure, and regulation of genes involved in lignocellulosic cell wall attack, secretome and transcriptome analyses were performed on both wood decay fungi cultured for 5 days in media containing ball-milled aspen or glucose as the sole carbon source. Using liquid chromatography-tandem mass spectrometry (LC-MS/MS), a total of 67 and 79 proteins were identified in the extracellular fluids of P. placenta and P. chrysosporium cultures, respectively. Viewed together with transcript profiles, P. chrysosporium employs an array of extracellular glycosyl hydrolases to simultaneously attack cellulose and hemicelluloses. In contrast, under these same conditions, P. placenta secretes an array of hemicellulases but few potential cellulases. The two species display distinct expression patterns for oxidoreductase-encoding genes. In P. placenta, these patterns are consistent with an extracellular Fenton system and include the upregulation of genes involved in iron acquisition, in the synthesis of low-molecular-weight quinones, and possibly in redox cycling reactions.
Collapse
|
111
|
Heo HS, Oh SJ, Kim JM, Kim HS, Chung HY. TREP_DB: transcriptional regulatory elements pattern database. Biochem Biophys Res Commun 2010; 394:309-316. [PMID: 20206134 DOI: 10.1016/j.bbrc.2010.02.169] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2010] [Accepted: 02/26/2010] [Indexed: 05/28/2023]
Abstract
Predicting and assigning functions for putative genes and hypothetical proteins are important goals in the post-genomic era. Many methods have been developed for this challenge, among which the straightforward way is function prediction using sequence homology. Homology-based function prediction applies sequence-alignment tools to find homology relationships between functions of known genes and putative genes, and transfers the most similar functions of known genes to putative genes. This approach fails completely for about 30% of genes, and only 3% have any supporting experimental evidence. According to supporting evidence, genes are known to be regulated by a common transcriptional regulatory element if the expression profiles of the coregulated genes are highly correlated. We propose a new conceptual approach and method for nonhomology-based function-prediction methods for putative genes and hypothetical proteins. We have established patterns, also considered to be combinations, of common transcriptional regulatory elements for functional classes of mouse (Mus musculus) transcripts (the TREP_DB). Using these results, we have also established a function-prediction method for putative genes and hypothetical proteins.
Collapse
Affiliation(s)
- Hyoung-Sam Heo
- Department of Pharmacy, College of Pharmacy and Molecular Inflammation Research Center for Aging Intervention, Pusan National University, Gumjung-gu, Busan 609-735, Republic of Korea
| | | | | | | | | |
Collapse
|
112
|
Tang ZQ, Lin HH, Zhang HL, Han LY, Chen X, Chen YZ. Prediction of functional class of proteins and peptides irrespective of sequence homology by support vector machines. Bioinform Biol Insights 2009; 1:19-47. [PMID: 20066123 PMCID: PMC2789692 DOI: 10.4137/bbi.s315] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open
Abstract
Various computational methods have been used for the prediction of protein and peptide function based on their sequences. A particular challenge is to derive functional properties from sequences that show low or no homology to proteins of known function. Recently, a machine learning method, support vector machines (SVM), have been explored for predicting functional class of proteins and peptides from amino acid sequence derived properties independent of sequence similarity, which have shown promising potential for a wide spectrum of protein and peptide classes including some of the low- and non-homologous proteins. This method can thus be explored as a potential tool to complement alignment-based, clustering-based, and structure-based methods for predicting protein function. This article reviews the strategies, current progresses, and underlying difficulties in using SVM for predicting the functional class of proteins. The relevant software and web-servers are described. The reported prediction performances in the application of these methods are also presented.
Collapse
Affiliation(s)
- Zhi Qun Tang
- Department of Pharmacy and Department of Computational Science, National University of Singapore, Republic of Singapore, 117543
| | - Hong Huang Lin
- Department of Pharmacy and Department of Computational Science, National University of Singapore, Republic of Singapore, 117543
| | - Hai Lei Zhang
- Department of Pharmacy and Department of Computational Science, National University of Singapore, Republic of Singapore, 117543
| | - Lian Yi Han
- Department of Pharmacy and Department of Computational Science, National University of Singapore, Republic of Singapore, 117543
| | - Xin Chen
- Department of Biotechnology, Zhejiang University, Hang Zhou, Zhejiang Province, P. R. China, 310029
| | - Yu Zong Chen
- Department of Pharmacy and Department of Computational Science, National University of Singapore, Republic of Singapore, 117543
- Shanghai Center for Bioinformatics Technology, Shanghai, P. R. China, 201203
| |
Collapse
|
113
|
Stubben CJ, Duffield ML, Cooper IA, Ford DC, Gans JD, Karlyshev AV, Lingard B, Oyston PCF, de Rochefort A, Song J, Wren BW, Titball RW, Wolinsky M. Steps toward broad-spectrum therapeutics: discovering virulence-associated genes present in diverse human pathogens. BMC Genomics 2009; 10:501. [PMID: 19874620 PMCID: PMC2774872 DOI: 10.1186/1471-2164-10-501] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2009] [Accepted: 10/29/2009] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND New and improved antimicrobial countermeasures are urgently needed to counteract increased resistance to existing antimicrobial treatments and to combat currently untreatable or new emerging infectious diseases. We demonstrate that computational comparative genomics, together with experimental screening, can identify potential generic (i.e., conserved across multiple pathogen species) and novel virulence-associated genes that may serve as targets for broad-spectrum countermeasures. RESULTS Using phylogenetic profiles of protein clusters from completed microbial genome sequences, we identified seventeen protein candidates that are common to diverse human pathogens and absent or uncommon in non-pathogens. Mutants of 13 of these candidates were successfully generated in Yersinia pseudotuberculosis and the potential role of the proteins in virulence was assayed in an animal model. Six candidate proteins are suggested to be involved in the virulence of Y. pseudotuberculosis, none of which have previously been implicated in the virulence of Y. pseudotuberculosis and three have no record of involvement in the virulence of any bacteria. CONCLUSION This work demonstrates a strategy for the identification of potential virulence factors that are conserved across a number of human pathogenic bacterial species, confirming the usefulness of this tool.
Collapse
Affiliation(s)
- Chris J Stubben
- Bioscience Division, Los Alamos National Laboratory, Los Alamos, NM, USA.
| | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
114
|
Herbert JMJ, Buffa FM, Vorschmitt H, Egginton S, Bicknell R. A new procedure for determining the genetic basis of a physiological process in a non-model species, illustrated by cold induced angiogenesis in the carp. BMC Genomics 2009; 10:490. [PMID: 19852815 PMCID: PMC2771047 DOI: 10.1186/1471-2164-10-490] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2009] [Accepted: 10/23/2009] [Indexed: 12/26/2022] Open
Abstract
BACKGROUND Physiological processes occur in many species for which there is yet no sequenced genome and for which we would like to identify the genetic basis. For example, some species increase their vascular network to minimise the effects of reduced oxygen diffusion and increased blood viscosity associated with low temperatures. Since many angiogenic and endothelial genes have been discovered in man, functional homolog relationships between carp, zebrafish and human were used to predict the genetic basis of cold-induced angiogenesis in Cyprinus Carpio (carp). In this work, carp sequences were collected and built into contigs. Human-carp functional homolog relationships were derived via zebrafish using a new Conditional Stepped Reciprocal Best Hit (CSRBH) protocol. Data sources including publications, Gene Ontology and cDNA libraries were then used to predict the identity of known or potential angiogenic genes. Finally, re-analyses of cold carp microarray data identified carp genes up-regulated in response to low temperatures in heart and muscle. RESULTS The CSRBH approach outperformed all other methods and attained 8,726 carp to human functional homolog relationships for 16,650 contiguous sequences. This represented 3,762 non-redundant genes and 908 of them were predicted to have a role in angiogenesis. The total number of up-regulated differentially expressed genes was 698 and 171 of them were putatively angiogenic. Of these, 5 genes representing the functional homologs NCL, RHOA, MMP9, GRN and MAPK1 are angiogenesis-related genes expressed in response to low temperature. CONCLUSION We show that CSRBH functional homologs relationships and re-analyses of gene expression data can be combined in a non-model species to predict genes of biological interest before a genome sequence is fully available. Programs to run these analyses locally are available from http://www.cbrg.ox.ac.uk/~jherbert/.
Collapse
Affiliation(s)
- John M J Herbert
- Cancer Research UK Angiogenesis Group, Institute for Biomedical Research, Schools of Immunity and Infection and Cancer studies, College of Medicine and Dentistry, University of Birmingham, Birmingham, B15 2TT, UK.
| | | | | | | | | |
Collapse
|
115
|
Bergholdt R, Brorsson C, Lage K, Nielsen JH, Brunak S, Pociot F. Expression profiling of human genetic and protein interaction networks in type 1 diabetes. PLoS One 2009; 4:e6250. [PMID: 19609442 PMCID: PMC2707614 DOI: 10.1371/journal.pone.0006250] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2009] [Accepted: 06/17/2009] [Indexed: 01/07/2023] Open
Abstract
Proteins contributing to a complex disease are often members of the same functional pathways. Elucidation of such pathways may provide increased knowledge about functional mechanisms underlying disease. By combining genetic interactions in Type 1 Diabetes (T1D) with protein interaction data we have previously identified sets of genes, likely to represent distinct cellular pathways involved in T1D risk. Here we evaluate the candidate genes involved in these putative interaction networks not only at the single gene level, but also in the context of the networks of which they form an integral part. mRNA expression levels for each gene were evaluated and profiling was performed by measuring and comparing constitutive expression in human islets versus cytokine-stimulated expression levels, and for lymphocytes by comparing expression levels among controls and T1D individuals. We identified differential regulation of several genes. In one of the networks four out of nine genes showed significant down regulation in human pancreatic islets after cytokine exposure supporting our prediction that the interaction network as a whole is a risk factor. In addition, we measured the enrichment of T1D associated SNPs in each of the four interaction networks to evaluate evidence of significant association at network level. This method provided additional support, in an independent data set, that two of the interaction networks could be involved in T1D and highlights the following processes as risk factors: oxidative stress, regulation of transcription and apoptosis. To understand biological systems, integration of genetic and functional information is necessary, and the current study has used this approach to improve understanding of T1D and the underlying biological mechanisms.
Collapse
Affiliation(s)
- Regine Bergholdt
- Hagedorn Research Institute and Steno Diabetes Center, Gentofte, Denmark.
| | | | | | | | | | | |
Collapse
|
116
|
Janky R, Helden JV, Babu MM. Investigating transcriptional regulation: From analysis of complex networks to discovery of cis-regulatory elements. Methods 2009; 48:277-86. [DOI: 10.1016/j.ymeth.2009.04.022] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2009] [Revised: 04/17/2009] [Accepted: 04/18/2009] [Indexed: 10/20/2022] Open
|
117
|
Sam LT, Mendonça EA, Li J, Blake J, Friedman C, Lussier YA. PhenoGO: an integrated resource for the multiscale mining of clinical and biological data. BMC Bioinformatics 2009; 10 Suppl 2:S8. [PMID: 19208196 PMCID: PMC2646241 DOI: 10.1186/1471-2105-10-s2-s8] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
The evolving complexity of genome-scale experiments has increasingly centralized the role of a highly computable, accurate, and comprehensive resource spanning multiple biological scales and viewpoints. To provide a resource to meet this need, we have significantly extended the PhenoGO database with gene-disease specific annotations and included an additional ten species. This a computationally-derived resource is primarily intended to provide phenotypic context (cell type, tissue, organ, and disease) for mining existing associations between gene products and GO terms specified in the Gene Ontology Databases Automated natural language processing (BioMedLEE) and computational ontology (PhenOS) methods were used to derive these relationships from the literature, expanding the database with information from ten additional species to include over 600,000 phenotypic contexts spanning eleven species from five GO annotation databases. A comprehensive evaluation evaluating the mappings (n = 300) found precision (positive predictive value) at 85%, and recall (sensitivity) at 76%. Phenotypes are encoded in general purpose ontologies such as Cell Ontology, the Unified Medical Language System, and in specialized ontologies such as the Mouse Anatomy and the Mammalian Phenotype Ontology. A web portal has also been developed, allowing for advanced filtering and querying of the database as well as download of the entire dataset .
Collapse
Affiliation(s)
- Lee T Sam
- Center for Biomedical Informatics, Department of Medicine, The University of Chicago, Chicago, IL, USA.
| | | | | | | | | | | |
Collapse
|
118
|
Abstract
As genome sequencing outstrips the rate of high-quality, low-throughput biochemical and genetic experimentation, accurate annotation of protein function becomes a bottleneck in the progress of the biomolecular sciences. Most gene products are now annotated by homology, in which an experimentally determined function is applied to a similar sequence. This procedure becomes error-prone between more divergent sequences and can contaminate biomolecular databases. Here, we propose a computational method of assignment of function, termed Generalized Functional Linkages (GFL), that combines nonhomology-based methods with other types of data. Functional linkages describe pairwise relationships between proteins that work together to perform a biological task. GFL provides a Bayesian framework that improves annotation by arbitrating a competition among biological process annotations to best describe the target protein. GFL addresses the unequal strengths of functional linkages among proteins, the quality of existing annotations, and the similarity among them while incorporating available knowledge about the cellular location or individual molecular function of the target protein. We demonstrate GFL with functional linkages defined by an algorithm known as zorch that quantifies connectivity in protein-protein interaction networks. Even when using proteins linked only by indirect or high-throughput interactions, GFL predicts the biological processes of many proteins in Saccharomyces cerevisiae, improving the accuracy of annotation by 20% over majority voting.
Collapse
|
119
|
|
120
|
Koonin EV, Wolf YI. Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world. Nucleic Acids Res 2008; 36:6688-719. [PMID: 18948295 PMCID: PMC2588523 DOI: 10.1093/nar/gkn668] [Citation(s) in RCA: 468] [Impact Index Per Article: 29.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023] Open
Abstract
The first bacterial genome was sequenced in 1995, and the first archaeal genome in 1996. Soon after these breakthroughs, an exponential rate of genome sequencing was established, with a doubling time of approximately 20 months for bacteria and approximately 34 months for archaea. Comparative analysis of the hundreds of sequenced bacterial and dozens of archaeal genomes leads to several generalizations on the principles of genome organization and evolution. A crucial finding that enables functional characterization of the sequenced genomes and evolutionary reconstruction is that the majority of archaeal and bacterial genes have conserved orthologs in other, often, distant organisms. However, comparative genomics also shows that horizontal gene transfer (HGT) is a dominant force of prokaryotic evolution, along with the loss of genetic material resulting in genome contraction. A crucial component of the prokaryotic world is the mobilome, the enormous collection of viruses, plasmids and other selfish elements, which are in constant exchange with more stable chromosomes and serve as HGT vehicles. Thus, the prokaryotic genome space is a tightly connected, although compartmentalized, network, a novel notion that undermines the ‘Tree of Life’ model of evolution and requires a new conceptual framework and tools for the study of prokaryotic evolution.
Collapse
Affiliation(s)
- Eugene V Koonin
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.
| | | |
Collapse
|
121
|
Discovering functional novelty in metagenomes: examples from light-mediated processes. J Bacteriol 2008; 191:32-41. [PMID: 18849420 DOI: 10.1128/jb.01084-08] [Citation(s) in RCA: 48] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open
Abstract
The emerging coverage of diverse habitats by metagenomic shotgun data opens new avenues of discovering functional novelty using computational tools. Here, we apply three different concepts for predicting novel functions within light-mediated microbial pathways in five diverse environments. Using phylogenetic approaches, we discovered two novel deep-branching subfamilies of photolyases (involved in light-mediated repair) distributed abundantly in high-UV environments. Using neighborhood approaches, we were able to assign seven novel functional partners in luciferase synthesis, nitrogen metabolism, and quorum sensing to BLUF domain-containing proteins (involved in light sensing). Finally, by domain analysis, for RcaE proteins (involved in chromatic adaptation), we predict 16 novel domain architectures that indicate novel functionalities in habitats with little or no light. Quantification of protein abundance in the various environments supports our findings that bacteria utilize light for sensing, repair, and adaptation far more widely than previously thought. While the discoveries illustrate the opportunities in function discovery, we also discuss the immense conceptual and practical challenges that come along with this new type of data.
Collapse
|
122
|
Abstract
The idea behind the gene neighbor method is that conservation of gene order in evolutionarily distant prokaryotes indicates functional association. The procedure presented here starts with the organization of all the genomes into pairs of adjacent genes. Then, pairs of genes in a genome of interest are mapped to their corresponding orthologs in other, informative, genomes. The final step is to determine whether the orthologs of each original pair of genes are also adjacent in the informative genome.
Collapse
|
123
|
Yano N, Fadden-Paiva KJ, Endoh M, Sakai H, Kurokawa K, Dworkin LD, Rifai A. Profiling the IgA nephropathy renal transcriptome: analysis by complementary DNA array hybridization. Nephrology (Carlton) 2008. [DOI: 10.1046/j.1440-1797.7.s3.10.x] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
124
|
Eisenhaber F. From a heap of facts to predictive biological theory: the future of life sciences viewed through the prism of a bioinformatics textbook introduction to bioinformatics 3rd edition. (2008). By Arthur M. Lesk. Oxford University Press. 482 pp. ISBN 978-0-19-920804-3. Bioessays 2008. [DOI: 10.1002/bies.20819] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|
125
|
Bergholdt R, Størling ZM, Lage K, Karlberg EO, Olason PI, Aalund M, Nerup J, Brunak S, Workman CT, Pociot F. Integrative analysis for finding genes and networks involved in diabetes and other complex diseases. Genome Biol 2008; 8:R253. [PMID: 18045462 PMCID: PMC2258178 DOI: 10.1186/gb-2007-8-11-r253] [Citation(s) in RCA: 46] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2007] [Revised: 10/31/2007] [Accepted: 11/28/2007] [Indexed: 01/17/2023] Open
Abstract
An integrative analysis combining genetic interactions and protein interactions can be used to identify candidate genes/proteins for type 1 diabetes and other complex diseases. We have developed an integrative analysis method combining genetic interactions, identified using type 1 diabetes genome scan data, and a high-confidence human protein interaction network. Resulting networks were ranked by the significance of the enrichment of proteins from interacting regions. We identified a number of new protein network modules and novel candidate genes/proteins for type 1 diabetes. We propose this type of integrative analysis as a general method for the elucidation of genes and networks involved in diabetes and other complex diseases.
Collapse
Affiliation(s)
- Regine Bergholdt
- Steno Diabetes Center, Niels Steensensvej 2, DK-2820 Gentofte, Denmark.
| | | | | | | | | | | | | | | | | | | |
Collapse
|
126
|
Tran MK, Schultz CJ, Baumann U. Conserved upstream open reading frames in higher plants. BMC Genomics 2008; 9:361. [PMID: 18667093 PMCID: PMC2527020 DOI: 10.1186/1471-2164-9-361] [Citation(s) in RCA: 54] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2008] [Accepted: 07/31/2008] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Upstream open reading frames (uORFs) can down-regulate the translation of the main open reading frame (mORF) through two broad mechanisms: ribosomal stalling and reducing reinitiation efficiency. In distantly related plants, such as rice and Arabidopsis, it has been found that conserved uORFs are rare in these transcriptomes with approximately 100 loci. It is unclear how prevalent conserved uORFs are in closely related plants. RESULTS We used a homology-based approach to identify conserved uORFs in five cereals (monocots) that could potentially regulate translation. Our approach used a modified reciprocal best hit method to identify putative orthologous sequences that were then analysed by a comparative R-nomics program called uORFSCAN to find conserved uORFs. CONCLUSION This research identified new genes that may be controlled at the level of translation by conserved uORFs. We report that conserved uORFs are rare (<150 loci contain them) in cereal transcriptomes, are generally short (less than 100 nt), highly conserved (50% median amino acid sequence similarity), position independent in their 5'-UTRs, and their start codon context and the usage of rare codons for translation does not appear to be important.
Collapse
Affiliation(s)
- Michael K Tran
- Australian Centre for Plant Functional Genomics PMB 1 Glen Osmond SA 5064, Australia.
| | | | | |
Collapse
|
127
|
Ahmad I, Hoessli DC, Qazi WM, Khurshid A, Mehmood A, Walker‐Nasir E, Ahmad M, Shakoori AR, Nasir‐ud‐Din. MAPRes: An efficient method to analyze protein sequence around post‐translational modification sites. J Cell Biochem 2008; 104:1220-31. [DOI: 10.1002/jcb.21699] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
|
128
|
Martinez-Guerrero CE, Ciria R, Abreu-Goodger C, Moreno-Hagelsieb G, Merino E. GeConT 2: gene context analysis for orthologous proteins, conserved domains and metabolic pathways. Nucleic Acids Res 2008; 36:W176-80. [PMID: 18511460 PMCID: PMC2447741 DOI: 10.1093/nar/gkn330] [Citation(s) in RCA: 33] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
The Gene Context Tool (GeConT) allows users to visualize the genomic context of a gene or a group of genes and their orthologous relationships within fully sequenced bacterial genomes. The new version of the server incorporates information from the COG, Pfam and KEGG databases, allowing users to have an integrated graphical representation of the function of genes at multiple levels, their phylogenetic distribution and their genomic context. The sequence of any of the genes can be easily retrieved, as well as the 5′ or 3′ regulatory regions, greatly facilitating further types of analysis. GeConT 2 is available at: http://bioinfo.ibt.unam.mx/gecont.
Collapse
Affiliation(s)
- C E Martinez-Guerrero
- Departmento de Microbiología Molecular, Instituto de Biotecnología, Universidad Nacional Autónoma de México, Cuernavaca, Morelos 62210, México
| | | | | | | | | |
Collapse
|
129
|
Khwaja TA, Wajahat T, Ahmad I, Hoessli DC, Walker-Nasir E, Kaleem A, Qazi WM, Shakoori AR, Din NU. In silico modulation of apoptotic Bcl-2 proteins by mistletoe lectin-1: functional consequences of protein modifications. J Cell Biochem 2008; 103:479-91. [PMID: 17583555 DOI: 10.1002/jcb.21412] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
The mistletoe lectin-1 (ML-1) modulates tumor cell apoptosis by triggering signaling cascades through the complex interplay of phosphorylation and O-linked N-acetylglucosamine (O-GlcNAc) modification in pro- and anti-apoptotic proteins. In particular, ML-1 is predicted to induce dephosphorylation of Bcl-2-family proteins and their alternative O-GlcNAc modification at specific, conserved Ser/Thr residues. The sites for phosphorylation and glycosylation were predicted and analyzed using Netphos 2.0 and YinOYang 1.2. The involvement of modified Ser/Thr, and among them the potential Yin Yang sites that may undergo both types of posttranslational modification, is proposed to mediate apoptosis modulation by ML-1.
Collapse
Affiliation(s)
- Tasneem A Khwaja
- Institute of Molecular Sciences and Bioinformatics, Lahore, Pakistan
| | | | | | | | | | | | | | | | | |
Collapse
|
130
|
Sridhar J, Rafi ZA. Functional annotations in bacterial genomes based on small RNA signatures. Bioinformation 2008; 2:284-95. [PMID: 18478081 PMCID: PMC2374372 DOI: 10.6026/97320630002284] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2008] [Accepted: 03/25/2008] [Indexed: 02/01/2023] Open
Abstract
One of the key challenges in computational genomics is annotating coding genes and identification of regulatory RNAs in complete genomes. An attempt is made in this study which uses the regulatory RNA locations and their conserved flanking genes identified within the genomic backbone of template genome to search for similar RNA locations in query genomes. The search is based on recently reported coexistence of small RNAs and their conserved flanking genes in related genomes. Based on our study, 54 additional sRNA locations and functions of 96 uncharacterized genes are predicted in two draft genomes viz., Serratia marcesens Db1 and Yersinia enterocolitica 8081. Although most of the identified additional small RNA regions and their corresponding flanking genes are homologous in nature, the proposed anchoring technique could successfully identify four non-homologous small RNA regions in Y. enterocolitica genome also. The KEGG Orthology (KO) based automated functional predictions confirms the predicted functions of 65 flanking genes having defined KO numbers, out of the total 96 predictions made by this method. This coexistence based method shows more sensitivity than controlled vocabularies in locating orthologous gene pairs even in the absence of defined Orthology numbers. All functional predictions made by this study in Y. enterocolitica 8081 were confirmed by the recently published complete genome sequence and annotations. This study also reports the possible regions of gene rearrangements in these two genomes and further characterization of such RNA regions could shed more light on their possible role in genome evolution.
Collapse
Affiliation(s)
- Jayavel Sridhar
- Centre of Excellence in Bioinformatics, School of Biotechnology, Madurai Kamaraj University, Madurai 625021, Tamilnadu, India
| | - Ziauddin Ahamed Rafi
- Centre of Excellence in Bioinformatics, School of Biotechnology, Madurai Kamaraj University, Madurai 625021, Tamilnadu, India
| |
Collapse
|
131
|
Gonzalez O, Zimmer R. Assigning functional linkages to proteins using phylogenetic profiles and continuous phenotypes. ACTA ACUST UNITED AC 2008; 24:1257-63. [PMID: 18381403 DOI: 10.1093/bioinformatics/btn106] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
MOTIVATION A class of non-homology-based methods for protein function prediction relies on the assumption that genes linked to a phenotypic trait are preferentially conserved among organisms that share the trait. These methods typically compare pairs of binary strings, where one string encodes the phylogenetic distribution of a trait and the other of a protein. In this work, we extended the approach to automatically deal with continuous phenotypes. RESULTS Rather than use a priori rules, which can be very subjective, to construct binary profiles from continuous phenotypes, we propose to systematically explore thresholds which can meaningfully separate the phenotype values. We illustrate our method by analyzing optimal growth temperatures, and demonstrate its usefulness by automatically retrieving genes which have been associated with thermophilic growth. We also apply the general approach, for the first time, to optimal growth pH, and make novel predictions. Finally, we show that our method can also be applied to other properties which may not be classically considered as phenotypes. Specifically, we studied correlations between genome size and the distribution of genes.
Collapse
Affiliation(s)
- Orland Gonzalez
- Institute for Informatics, Ludwig-Maximilians-Universität München, Amalienstr. 17, 80333 Munich, Germany.
| | | |
Collapse
|
132
|
Linghu B, Snitkin ES, Holloway DT, Gustafson AM, Xia Y, DeLisi C. High-precision high-coverage functional inference from integrated data sources. BMC Bioinformatics 2008; 9:119. [PMID: 18298847 PMCID: PMC2292694 DOI: 10.1186/1471-2105-9-119] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2007] [Accepted: 02/25/2008] [Indexed: 11/15/2022] Open
Abstract
Background Information obtained from diverse data sources can be combined in a principled manner using various machine learning methods to increase the reliability and range of knowledge about protein function. The result is a weighted functional linkage network (FLN) in which linked neighbors share at least one function with high probability. Precision is, however, low. Aiming to provide precise functional annotation for as many proteins as possible, we explore and propose a two-step framework for functional annotation (1) construction of a high-coverage and reliable FLN via machine learning techniques (2) development of a decision rule for the constructed FLN to optimize functional annotation. Results We first apply this framework to Saccharomyces cerevisiae. In the first step, we demonstrate that four commonly used machine learning methods, Linear SVM, Linear Discriminant Analysis, Naïve Bayes, and Neural Network, all combine heterogeneous data to produce reliable and high-coverage FLNs, in which the linkage weight more accurately estimates functional coupling of linked proteins than use individual data sources alone. In the second step, empirical tuning of an adjustable decision rule on the constructed FLN reveals that basing annotation on maximum edge weight results in the most precise annotation at high coverages. In particular at low coverage all rules evaluated perform comparably. At coverage above approximately 50%, however, they diverge rapidly. At full coverage, the maximum weight decision rule still has a precision of approximately 70%, whereas for other methods, precision ranges from a high of slightly more than 30%, down to 3%. In addition, a scoring scheme to estimate the precisions of individual predictions is also provided. Finally, tests of the robustness of the framework indicate that our framework can be successfully applied to less studied organisms. Conclusion We provide a general two-step function-annotation framework, and show that high coverage, high precision annotations can be achieved by constructing a high-coverage and reliable FLN via data integration followed by applying a maximum weight decision rule.
Collapse
Affiliation(s)
- Bolan Linghu
- Bioinformatics Graduate Program, Boston University, Boston, MA, 02215, USA.
| | | | | | | | | | | |
Collapse
|
133
|
Nair R, Rost B. Protein subcellular localization prediction using artificial intelligence technology. Methods Mol Biol 2008; 484:435-63. [PMID: 18592195 DOI: 10.1007/978-1-59745-398-1_27] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/20/2023]
Abstract
Proteins perform many important tasks in living organisms, such as catalysis of biochemical reactions, transport of nutrients, and recognition and transmission of signals. The plethora of aspects of the role of any particular protein is referred to as its "function." One aspect of protein function that has been the target of intensive research by computational biologists is its subcellular localization. Proteins must be localized in the same subcellular compartment to cooperate toward a common physiological function. Aberrant subcellular localization of proteins can result in several diseases, including kidney stones, cancer, and Alzheimer's disease. To date, sequence homology remains the most widely used method for inferring the function of a protein. However, the application of advanced artificial intelligence (AI)-based techniques in recent years has resulted in significant improvements in our ability to predict the subcellular localization of a protein. The prediction accuracy has risen steadily over the years, in large part due to the application of AI-based methods such as hidden Markov models (HMMs), neural networks (NNs), and support vector machines (SVMs), although the availability of larger experimental datasets has also played a role. Automatic methods that mine textual information from the biological literature and molecular biology databases have considerably sped up the process of annotation for proteins for which some information regarding function is available in the literature. State-of-the-art methods based on NNs and HMMs can predict the presence of N-terminal sorting signals extremely accurately. Ab initio methods that predict subcellular localization for any protein sequence using only the native amino acid sequence and features predicted from the native sequence have shown the most remarkable improvements. The prediction accuracy of these methods has increased by over 30% in the past decade. The accuracy of these methods is now on par with high-throughput methods for predicting localization, and they are beginning to play an important role in directing experimental research. In this chapter, we review some of the most important methods for the prediction of subcellular localization.
Collapse
Affiliation(s)
- Rajesh Nair
- CUBIC Department of Biochemistry and Molecular Biophysics and Center for Computational Biology and Bioinformatics, Columbia University, New York, NY, USA
| | | |
Collapse
|
134
|
Lee D, Redfern O, Orengo C. Predicting protein function from sequence and structure. Nat Rev Mol Cell Biol 2007; 8:995-1005. [PMID: 18037900 DOI: 10.1038/nrm2281] [Citation(s) in RCA: 359] [Impact Index Per Article: 21.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
135
|
Moreno-Hagelsieb G, Latimer K. Choosing BLAST options for better detection of orthologs as reciprocal best hits. Bioinformatics 2007; 24:319-24. [PMID: 18042555 DOI: 10.1093/bioinformatics/btm585] [Citation(s) in RCA: 341] [Impact Index Per Article: 20.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION The analyses of the increasing number of genome sequences requires shortcuts for the detection of orthologs, such as Reciprocal Best Hits (RBH), where orthologs are assumed if two genes each in a different genome find each other as the best hit in the other genome. Two BLAST options seem to affect alignment scores the most, and thus the choice of a best hit: the filtering of low information sequence segments and the algorithm used to produce the final alignment. Thus, we decided to test whether such options would help better detect orthologs. RESULTS Using Escherichia coli K12 as an example, we compared the number and quality of orthologs detected as RBH. We tested four different conditions derived from two options: filtering of low-information segments, hard (default) versus soft; and alignment algorithm, default (based on matching words) versus Smith-Waterman. All options resulted in significant differences in the number of orthologs detected, with the highest numbers obtained with the combination of soft filtering with Smith-Waterman alignments. We compared these results with those of Reciprocal Shortest Distances (RSD), supposed to be superior to RBH because it uses an evolutionary measure of distance, rather than BLAST statistics, to rank homologs and thus detect orthologs. RSD barely increased the number of orthologs detected over those found with RBH. Error estimates, based on analyses of conservation of gene order, found small differences in the quality of orthologs detected using RBH. However, RSD showed the highest error rates. Thus, RSD have no advantages over RBH. AVAILABILITY Orthologs detected as Reciprocal Best Hits using soft masking and Smith-Waterman alignments can be downloaded from http://popolvuh.wlu.ca/Orthologs.
Collapse
Affiliation(s)
- Gabriel Moreno-Hagelsieb
- Department of Biology, Wilfrid Laurier University, 75 University Avenue West, Waterloo, ON, Canada, N2L 3C5.
| | | |
Collapse
|
136
|
Gotzek D, Ross KG. Genetic regulation of colony social organization in fire ants: an integrative overview. QUARTERLY REVIEW OF BIOLOGY 2007; 82:201-26. [PMID: 17937246 DOI: 10.1086/519965] [Citation(s) in RCA: 69] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/04/2022]
Abstract
Expression of colony social organization in fire ants appears to be under the control of a single Mendelian factor of large effect. Variation in colony queen number in Solenopsis invicta and its relatives is associated with allelic variation at the gene Gp-9, but not with variation at other unlinked genes; workers regulate queen identity and number on the basis of Gp-9 genotypic compatibility. Nongenetic factors, such as prior social experience, queen reproductive status, and local environment, have negligible effects on queen numbers which illustrates the nearly complete penetrance of Gp-9. As predicted, queen number can be manipulated experimentally by altering worker Gp-9 genotype frequencies. The Gp-9 allele lineage associated with polygyny in South American fire ants has been retained across multiple speciation events, which may signal the action of balancing selection to maintain social polymorphism in these species. Moreover, positive selection is implicated in driving the molecular evolution of Gp-9 in association with the origin of polygyny. The identity of the product of Gp-9 as an odorant-binding protein suggests plausible scenarios for its direct involvement in the regulation of queen number via a role in chemical communication. While these and other lines of evidence show that Gp-9 represents a legitimate candidate gene of major effect, studies aimed at determining (i) the biochemical pathways in which GP-9 functions; (ii) the phenotypic effects of molecular variation at Gp-9 and other pathway genes; and (iii) the potential involvement of genes in linkage disequilibrium with Gp-9 are needed to elucidate the genetic architecture underlying social organization in fire ants. Information that reveals the links between molecular variation, individual phenotype, and colony-level behaviors, combined with behavioral models that incorporate details of the chemical communication involved in regulating queen number, will yield a novel integrated view of the evolutionary changes underlying a key social adaptation.
Collapse
Affiliation(s)
- Dietrich Gotzek
- Department of Ecology and Evolution, University of Lausanne 1015 Lausanne, Switzerland.
| | | |
Collapse
|
137
|
McLaughlin WA, Chen K, Hou T, Wang W. On the detection of functionally coherent groups of protein domains with an extension to protein annotation. BMC Bioinformatics 2007; 8:390. [PMID: 17937820 PMCID: PMC2151957 DOI: 10.1186/1471-2105-8-390] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2007] [Accepted: 10/16/2007] [Indexed: 01/31/2023] Open
Abstract
Background Protein domains coordinate to perform multifaceted cellular functions, and domain combinations serve as the functional building blocks of the cell. The available methods to identify functional domain combinations are limited in their scope, e.g. to the identification of combinations falling within individual proteins or within specific regions in a translated genome. Further effort is needed to identify groups of domains that span across two or more proteins and are linked by a cooperative function. Such functional domain combinations can be useful for protein annotation. Results Using a new computational method, we have identified 114 groups of domains, referred to as domain assembly units (DASSEM units), in the proteome of budding yeast Saccharomyces cerevisiae. The units participate in many important cellular processes such as transcription regulation, translation initiation, and mRNA splicing. Within the units the domains were found to function in a cooperative manner; and each domain contributed to a different aspect of the unit's overall function. The member domains of DASSEM units were found to be significantly enriched among proteins contained in transcription modules, defined as genes sharing similar expression profiles and presumably similar functions. The observation further confirmed the functional coherence of DASSEM units. The functional linkages of units were found in both functionally characterized and uncharacterized proteins, which enabled the assessment of protein function based on domain composition. Conclusion A new computational method was developed to identify groups of domains that are linked by a common function in the proteome of Saccharomyces cerevisiae. These groups can either lie within individual proteins or span across different proteins. We propose that the functional linkages among the domains within the DASSEM units can be used as a non-homology based tool to annotate uncharacterized proteins.
Collapse
Affiliation(s)
- William A McLaughlin
- Department of Chemistry and Biochemistry, Center for Theoretical Biological Physics, University of California, San Diego, 9500 Gilman Drive La Jolla, CA 92093-0359, USA.
| | | | | | | |
Collapse
|
138
|
Harrington ED, Singh AH, Doerks T, Letunic I, von Mering C, Jensen LJ, Raes J, Bork P. Quantitative assessment of protein function prediction from metagenomics shotgun sequences. Proc Natl Acad Sci U S A 2007; 104:13913-8. [PMID: 17717083 PMCID: PMC1955820 DOI: 10.1073/pnas.0702636104] [Citation(s) in RCA: 63] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
To assess the potential of protein function prediction in environmental genomics data, we analyzed shotgun sequences from four diverse and complex habitats. Using homology searches as well as customized gene neighborhood methods that incorporate intergenic and evolutionary distances, we inferred specific functions for 76% of the 1.4 million predicted ORFs in these samples (83% when nonspecific functions are considered). Surprisingly, these fractions are only slightly smaller than the corresponding ones in completely sequenced genomes (83% and 86%, respectively, by using the same methodology) and considerably higher than previously thought. For as many as 75,448 ORFs (5% of the total), only neighborhood methods can assign functions, illustrated here by a previously undescribed gene associated with the well characterized heme biosynthesis operon and a potential transcription factor that might regulate a coupling between fatty acid biosynthesis and degradation. Our results further suggest that, although functions can be inferred for most proteins on earth, many functions remain to be discovered in numerous small, rare protein families.
Collapse
Affiliation(s)
- E. D. Harrington
- *Structural and Computational Biology Unit, European Molecular Biology Laboratory, Meyerhofstrasse 1, 69117 Heidelberg, Germany; and
| | - A. H. Singh
- *Structural and Computational Biology Unit, European Molecular Biology Laboratory, Meyerhofstrasse 1, 69117 Heidelberg, Germany; and
| | - T. Doerks
- *Structural and Computational Biology Unit, European Molecular Biology Laboratory, Meyerhofstrasse 1, 69117 Heidelberg, Germany; and
| | - I. Letunic
- *Structural and Computational Biology Unit, European Molecular Biology Laboratory, Meyerhofstrasse 1, 69117 Heidelberg, Germany; and
| | - C. von Mering
- *Structural and Computational Biology Unit, European Molecular Biology Laboratory, Meyerhofstrasse 1, 69117 Heidelberg, Germany; and
| | - L. J. Jensen
- *Structural and Computational Biology Unit, European Molecular Biology Laboratory, Meyerhofstrasse 1, 69117 Heidelberg, Germany; and
| | - J. Raes
- *Structural and Computational Biology Unit, European Molecular Biology Laboratory, Meyerhofstrasse 1, 69117 Heidelberg, Germany; and
| | - P. Bork
- *Structural and Computational Biology Unit, European Molecular Biology Laboratory, Meyerhofstrasse 1, 69117 Heidelberg, Germany; and
- Max Delbrück Centre for Molecular Medicine, D-13092 Berlin, Germany
- To whom correspondence should be addressed. E-mail:
| |
Collapse
|
139
|
Ahmad I, Hoessli DC, Gupta R, Walker-Nasir E, Rafik SM, Choudhary MI, Shakoori AR. In silico determination of intracellular glycosylation and phosphorylation sites in human selectins: implications for biological function. J Cell Biochem 2007; 100:1558-72. [PMID: 17230456 DOI: 10.1002/jcb.21156] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
Post-translational modifications provide the proteins with the possibility to perform functions in addition to those determined by their primary sequence. However, analysis of multifunctional protein structures in the environment of cells and body fluids is made especially difficult by the presence of other interacting proteins. Bioinformatics tools are therefore helpful to predict protein multifunctionality through the identification of serine and threonine residues wherein the hydroxyl group is likely to become modified by phosphorylation or glycosylation. Moreover, serines and threonines where both modifications are likely to occur can also be predicted (YinYang sites), to suggest further functional versatility. Structural modifications of hydroxyl groups of P-, E-, and L-selectins have been predicted and possible functions resulting from such modifications are proposed. Functional changes of the three selectins are based on the assumption that transitory and reversible protein modifications by phosphate and O-GlcNAc cause specific conformational changes and generate binding sites for other proteins. The computer-assisted prediction of glycosylation and phosphorylation sites in selectins should be helpful to assess the contribution of dynamic protein modifications in selectin-mediated inflammatory responses and cell-cell adhesion processes that are difficult to determine experimentally.
Collapse
Affiliation(s)
- Ishtiaq Ahmad
- Institute of Molecular Sciences and Bioinformatics, Lahore, Pakistan
| | | | | | | | | | | | | |
Collapse
|
140
|
Quantitative assessment of relationship between sequence similarity and function similarity. BMC Genomics 2007; 8:222. [PMID: 17620139 PMCID: PMC1949826 DOI: 10.1186/1471-2164-8-222] [Citation(s) in RCA: 58] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2006] [Accepted: 07/09/2007] [Indexed: 11/16/2022] Open
Abstract
Background Comparative sequence analysis is considered as the first step towards annotating new proteins in genome annotation. However, sequence comparison may lead to creation and propagation of function assignment errors. Thus, it is important to perform a thorough analysis for the quality of sequence-based function assignment using large-scale data in a systematic way. Results We present an analysis of the relationship between sequence similarity and function similarity for the proteins in four model organisms, i.e., Arabidopsis thaliana, Saccharomyces cerevisiae, Caenorrhabditis elegans, and Drosophila melanogaster. Using a measure of functional similarity based on the three categories of Gene Ontology (GO) classifications (biological process, molecular function, and cellular component), we quantified the correlation between functional similarity and sequence similarity measured by sequence identity or statistical significance of the alignment and compared such a correlation against randomly chosen protein pairs. Conclusion Various sequence-function relationships were identified from BLAST versus PSI-BLAST, sequence identity versus Expectation Value, GO indices versus semantic similarity approaches, and within genome versus between genome comparisons, for the three GO categories. Our study provides a benchmark to estimate the confidence in assignment of functions purely based on sequence similarity.
Collapse
|
141
|
Ahmad I, Hoessli DC, Walker-Nasir E, Choudhary MI, Rafik SM, Shakoori AR. Phosphorylation and glycosylation interplay: protein modifications at hydroxy amino acids and prediction of signaling functions of the human beta3 integrin family. J Cell Biochem 2007; 99:706-18. [PMID: 16676352 DOI: 10.1002/jcb.20814] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Protein functions are determined by their three-dimensional structures and the folded 3-D structure is in turn governed by the primary structure and post-translational modifications the protein undergoes during synthesis and transport. Defining protein functions in vivo in the cellular and extracellular environments is made very difficult in the presence of other molecules. However, the modifications taking place during and after protein folding are determined by the modification potential of amino acids and not by the primary structure or sequence. These post-translational modifications, like phosphorylation and O-linked N-acetylglucosamine (O-GlcNAc) modifications, are dynamic and result in temporary conformational changes that regulate many functions of the protein. Computer-assisted studies can help determining protein functions by assessing the modification potentials of a given protein. Integrins are important membrane receptors involved in bi-directional (outside-in and inside-out) signaling events. The beta3 integrin family, including, alpha(IIb)beta3 and alpha(v)beta3, has been studied for its role in platelet aggregation during clot formation and clot retraction based on hydroxyl group modification by phosphate and GlcNAc on Ser, Thr, or Tyr and their interplay on Ser and Thr in the cytoplasmic domain of the beta3 subunit. An antagonistic role of phosphate and GlcNAc interplay at Thr758 for controlling both inside-out and outside-in signaling events is proposed. Additionally, interplay of GlcNAc and phosphate at Ser752 has been proposed to control activation and inactivation of integrin-associated Src kinases. This study describes the multifunctional behavior of integrins based on their modification potential at hydroxyl groups of amino acids as a source of interplay.
Collapse
Affiliation(s)
- Ishtiaq Ahmad
- Institute of Molecular Sciences and Bioinformatics, Lahore, Pakistan
| | | | | | | | | | | |
Collapse
|
142
|
Raes J, Harrington ED, Singh AH, Bork P. Protein function space: viewing the limits or limited by our view? Curr Opin Struct Biol 2007; 17:362-9. [PMID: 17574832 DOI: 10.1016/j.sbi.2007.05.010] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2007] [Revised: 04/25/2007] [Accepted: 05/31/2007] [Indexed: 12/13/2022]
Abstract
Given that the number of protein functions on earth is finite, the rapid expansion of biological knowledge and the concomitant exponential increase in the number of protein sequences should, at some point, enable the estimation of the limits of protein function space. The functional coverage of protein sequences can be investigated using computational methods, especially given the massive amount of data being generated by large-scale environmental sequencing (metagenomics). In completely sequenced genomes, the fraction of proteins to which at least some functional features can be assigned has recently risen to as much as approximately 85%. Although this fraction is more uncertain in metagenomics surveys, because of environmental complexities and differences in analysis protocols, our global knowledge of protein functions still appears to be considerable. However, when we consider protein families, continued sequencing seems to yield an ever-increasing number of novel families. Until we reconcile these two views, the limits of protein space will remain obscured.
Collapse
Affiliation(s)
- Jeroen Raes
- European Molecular Biology Laboratory, Meyerhofstrasse 1, D-69117 Heidelberg, Germany
| | | | | | | |
Collapse
|
143
|
Bryliński M, Prymula K, Jurkowski W, Kochańczyk M, Stawowczyk E, Konieczny L, Roterman I. Prediction of functional sites based on the fuzzy oil drop model. PLoS Comput Biol 2007; 3:e94. [PMID: 17530916 PMCID: PMC1876487 DOI: 10.1371/journal.pcbi.0030094] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2006] [Accepted: 04/11/2007] [Indexed: 11/19/2022] Open
Abstract
A description of many biological processes requires knowledge of the 3-D structure of proteins and, in particular, the defined active site responsible for biological function. Many proteins, the genes of which have been identified as the result of human genome sequencing, and which were synthesized experimentally, await identification of their biological activity. Currently used methods do not always yield satisfactory results, and new algorithms need to be developed to recognize the localization of active sites in proteins. This paper describes a computational model that can be used to identify potential areas that are able to interact with other molecules (ligands, substrates, inhibitors, etc.). The model for active site recognition is based on the analysis of hydrophobicity distribution in protein molecules. It is shown, based on the analyses of proteins with known biological activity and of proteins of unknown function, that the region of significantly irregular hydrophobicity distribution in proteins appears to be function related.
Collapse
Affiliation(s)
- Michał Bryliński
- Department of Bioinformatics and Telemedicine, Jagiellonian University–Collegium Medicum, Kraków, Poland
- Faculty of Chemistry, Jagiellonian University, Kraków, Poland
| | - Katarzyna Prymula
- Department of Bioinformatics and Telemedicine, Jagiellonian University–Collegium Medicum, Kraków, Poland
- Faculty of Chemistry, Jagiellonian University, Kraków, Poland
| | - Wiktor Jurkowski
- Department of Bioinformatics and Telemedicine, Jagiellonian University–Collegium Medicum, Kraków, Poland
| | - Marek Kochańczyk
- Department of Bioinformatics and Telemedicine, Jagiellonian University–Collegium Medicum, Kraków, Poland
- Faculty of Physics, Astronomy and Applied Computer Science, Jagiellonian University, Kraków, Poland
| | - Ewa Stawowczyk
- Department of Bioinformatics and Telemedicine, Jagiellonian University–Collegium Medicum, Kraków, Poland
| | - Leszek Konieczny
- Institute of Medical Biochemistry, Jagiellonian University–Collegium Medicum, Kraków, Poland
| | - Irena Roterman
- Department of Bioinformatics and Telemedicine, Jagiellonian University–Collegium Medicum, Kraków, Poland
- Faculty of Physics, Astronomy and Applied Computer Science, Jagiellonian University, Kraków, Poland
- * To whom correspondence should be addressed. E-mail:
| |
Collapse
|
144
|
Brylinski M, Kochanczyk M, Broniatowska E, Roterman I. Localization of ligand binding site in proteins identified in silico. J Mol Model 2007; 13:665-75. [PMID: 17394030 DOI: 10.1007/s00894-007-0191-x] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2006] [Accepted: 02/26/2007] [Indexed: 01/21/2023]
Abstract
Knowledge-based models for protein folding assume that the early-stage structural form of a polypeptide is determined by the backbone conformation, followed by hydrophobic collapse. Side chain-side chain interactions, mostly of hydrophobic character, lead to the formation of the hydrophobic core, which seems to stabilize the structure of the protein in its natural environment. The fuzzy-oil-drop model is employed to represent the idealized hydrophobicity distribution in the protein molecule. Comparing it with the one empirically observed in the protein molecule reveals that they are not in agreement. It is shown in this study that the irregularity of hydrophobic distributions is aim-oriented. The character and strength of these irregularities in the organization of the hydrophobic core point to the specificity of a particular protein's structure/function. When the location of these irregularities is determined versus the idealized fuzzy-oil-drop, function-related areas in the protein molecule can be identified. The presented model can also be used to identify ways in which protein-protein complexes can possibly be created. Active sites can be predicted for any protein structure according to the presented model with the free prediction server at http://www.bioinformatics.cm-uj.krakow.pl/activesite. The implication based on the model presented in this work suggests the necessity of active presence of ligand during the protein folding process simulation.
Collapse
Affiliation(s)
- Michal Brylinski
- Department of Bioinformatics and Telemedicine, Jagiellonian University-Collegium Medicum, Łazarza 16, 31-530, Krakow, Poland
| | | | | | | |
Collapse
|
145
|
Bi R, Zhou Y, Lu F, Wang W. Predicting Gene Ontology functions based on support vector machines and statistical significance estimation. Neurocomputing 2007. [DOI: 10.1016/j.neucom.2006.10.006] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
146
|
Duan ZH, Hughes B, Reichel L, Perez DM, Shi T. The relationship between protein sequences and their gene ontology functions. BMC Bioinformatics 2006; 7 Suppl 4:S11. [PMID: 17217503 PMCID: PMC1780109 DOI: 10.1186/1471-2105-7-s4-s11] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Background One main research challenge in the post-genomic era is to understand the relationship between protein sequences and their biological functions. In recent years, several automated annotation systems have been developed for the functional assignment of uncharacterized proteins. The underlying assumption of these systems is that similar sequences imply similar biological functions. However, it has been noted that matching sequences do not always infer similar functions. Results In this paper, we present the correlation between protein sequences and protein functions for the yeast proteome in the context of gene ontology. A novel measure is introduced to define the overall similarity between two protein sequences. The effects of the level as well as the size of a gene ontology group on the degree of similarity were studied. The similarity distributions at different levels of gene ontology trees are presented. To evaluate the theoretical prediction power of similar sequences, we computed the posterior probability of correct predictions. Conclusion The results indicate that protein pairs of similar biological functions tend to have higher sequence similarity, although the similarity distribution in each functional group is heterogeneous and varies from group to group. We conclude that sequence similarity can serve as a key measure in protein function prediction. However, the resulting annotations must be verified through other means. A method that combines a broader range of measures is more likely to provide more accurate prediction. Our study indicates that the posterior probability of a correct prediction could serve as one of the key measures.
Collapse
Affiliation(s)
- Zhong-Hui Duan
- Department of Computer Science, University of Akron, Akron, OH, 44325, USA
| | - Brent Hughes
- Department of Computer Science, University of Akron, Akron, OH, 44325, USA
| | - Lothar Reichel
- Department of Mathematical Sciences, Kent State University, Kent, OH, 44242, USA
| | - Dianne M Perez
- Department of Molecular Cardiology, Lerner Research Institute, Cleveland Clinic Foundation, Cleveland, OH, 44195, USA
| | - Ting Shi
- Department of Molecular Cardiology, Lerner Research Institute, Cleveland Clinic Foundation, Cleveland, OH, 44195, USA
| |
Collapse
|
147
|
Liu Y, Li J, Sam L, Goh CS, Gerstein M, Lussier YA. An integrative genomic approach to uncover molecular mechanisms of prokaryotic traits. PLoS Comput Biol 2006; 2:e159. [PMID: 17112314 PMCID: PMC1636675 DOI: 10.1371/journal.pcbi.0020159] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2006] [Accepted: 10/10/2006] [Indexed: 11/18/2022] Open
Abstract
With mounting availability of genomic and phenotypic databases, data integration and mining become increasingly challenging. While efforts have been put forward to analyze prokaryotic phenotypes, current computational technologies either lack high throughput capacity for genomic scale analysis, or are limited in their capability to integrate and mine data across different scales of biology. Consequently, simultaneous analysis of associations among genomes, phenotypes, and gene functions is prohibited. Here, we developed a high throughput computational approach, and demonstrated for the first time the feasibility of integrating large quantities of prokaryotic phenotypes along with genomic datasets for mining across multiple scales of biology (protein domains, pathways, molecular functions, and cellular processes). Applying this method over 59 fully sequenced prokaryotic species, we identified genetic basis and molecular mechanisms underlying the phenotypes in bacteria. We identified 3,711 significant correlations between 1,499 distinct Pfam and 63 phenotypes, with 2,650 correlations and 1,061 anti-correlations. Manual evaluation of a random sample of these significant correlations showed a minimal precision of 30% (95% confidence interval: 20%-42%; n = 50). We stratified the most significant 478 predictions and subjected 100 to manual evaluation, of which 60 were corroborated in the literature. We furthermore unveiled 10 significant correlations between phenotypes and KEGG pathways, eight of which were corroborated in the evaluation, and 309 significant correlations between phenotypes and 166 GO concepts evaluated using a random sample (minimal precision = 72%; 95% confidence interval: 60%-80%; n = 50). Additionally, we conducted a novel large-scale phenomic visualization analysis to provide insight into the modular nature of common molecular mechanisms spanning multiple biological scales and reused by related phenotypes (metaphenotypes). We propose that this method elucidates which classes of molecular mechanisms are associated with phenotypes or metaphenotypes and holds promise in facilitating a computable systems biology approach to genomic and biomedical research.
Collapse
Affiliation(s)
- Yang Liu
- Section of Genetic Medicine, Department of Medicine, University of Chicago, Chicago, Illinois, United States of America
- Center for Biomedical Informatics, Department of Medicine, University of Chicago, Chicago, Illinois, United States of America
| | - Jianrong Li
- Section of Genetic Medicine, Department of Medicine, University of Chicago, Chicago, Illinois, United States of America
- Center for Biomedical Informatics, Department of Medicine, University of Chicago, Chicago, Illinois, United States of America
| | - Lee Sam
- Section of Genetic Medicine, Department of Medicine, University of Chicago, Chicago, Illinois, United States of America
- Center for Biomedical Informatics, Department of Medicine, University of Chicago, Chicago, Illinois, United States of America
| | - Chern-Sing Goh
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut, United States of America
| | - Mark Gerstein
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut, United States of America
- Department of Computer Science, Yale University, New Haven, Connecticut, United States of America
- * To whom correspondence should be addressed. E-mail: (MG); (YAL)
| | - Yves A Lussier
- Section of Genetic Medicine, Department of Medicine, University of Chicago, Chicago, Illinois, United States of America
- Center for Biomedical Informatics, Department of Medicine, University of Chicago, Chicago, Illinois, United States of America
- Department of Biomedical Informatics, Columbia University, New York, New York, United States of America
- * To whom correspondence should be addressed. E-mail: (MG); (YAL)
| |
Collapse
|
148
|
Scheeff ED, Bourne PE. Application of protein structure alignments to iterated hidden Markov model protocols for structure prediction. BMC Bioinformatics 2006; 7:410. [PMID: 16970830 PMCID: PMC1622756 DOI: 10.1186/1471-2105-7-410] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2006] [Accepted: 09/14/2006] [Indexed: 11/30/2022] Open
Abstract
Background One of the most powerful methods for the prediction of protein structure from sequence information alone is the iterative construction of profile-type models. Because profiles are built from sequence alignments, the sequences included in the alignment and the method used to align them will be important to the sensitivity of the resulting profile. The inclusion of highly diverse sequences will presumably produce a more powerful profile, but distantly related sequences can be difficult to align accurately using only sequence information. Therefore, it would be expected that the use of protein structure alignments to improve the selection and alignment of diverse sequence homologs might yield improved profiles. However, the actual utility of such an approach has remained unclear. Results We explored several iterative protocols for the generation of profile hidden Markov models. These protocols were tailored to allow the inclusion of protein structure alignments in the process, and were used for large-scale creation and benchmarking of structure alignment-enhanced models. We found that models using structure alignments did not provide an overall improvement over sequence-only models for superfamily-level structure predictions. However, the results also revealed that the structure alignment-enhanced models were complimentary to the sequence-only models, particularly at the edge of the "twilight zone". When the two sets of models were combined, they provided improved results over sequence-only models alone. In addition, we found that the beneficial effects of the structure alignment-enhanced models could not be realized if the structure-based alignments were replaced with sequence-based alignments. Our experiments with different iterative protocols for sequence-only models also suggested that simple protocol modifications were unable to yield equivalent improvements to those provided by the structure alignment-enhanced models. Finally, we found that models using structure alignments provided fold-level structure assignments that were superior to those produced by sequence-only models. Conclusion When attempting to predict the structure of remote homologs, we advocate a combined approach in which both traditional models and models incorporating structure alignments are used.
Collapse
Affiliation(s)
- Eric D Scheeff
- San Diego Supercomputer Center, University of California, San Diego, 9500 Gilman Dr., La Jolla, CA 92093-0537, USA
- Present address: Razavi-Newman Center for Bioinformatics, The Salk Institute for Biological Studies, 10010 North Torrey Pines Rd., La Jolla, CA 92037, USA
| | - Philip E Bourne
- San Diego Supercomputer Center, University of California, San Diego, 9500 Gilman Dr., La Jolla, CA 92093-0537, USA
- Department of Pharmacology, University of California, San Diego, 9500 Gilman Dr., La Jolla, CA 92093, USA
| |
Collapse
|
149
|
Rossi A, Marti-Renom MA, Sali A. Localization of binding sites in protein structures by optimization of a composite scoring function. Protein Sci 2006; 15:2366-80. [PMID: 16963645 PMCID: PMC2242385 DOI: 10.1110/ps.062247506] [Citation(s) in RCA: 17] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
Abstract
The rise in the number of functionally uncharacterized protein structures is increasing the demand for structure-based methods for functional annotation. Here, we describe a method for predicting the location of a binding site of a given type on a target protein structure. The method begins by constructing a scoring function, followed by a Monte Carlo optimization, to find a good scoring patch on the protein surface. The scoring function is a weighted linear combination of the z-scores of various properties of protein structure and sequence, including amino acid residue conservation, compactness, protrusion, convexity, rigidity, hydrophobicity, and charge density; the weights are calculated from a set of previously identified instances of the binding-site type on known protein structures. The scoring function can easily incorporate different types of information useful in localization, thus increasing the applicability and accuracy of the approach. To test the method, 1008 known protein structures were split into 20 different groups according to the type of the bound ligand. For nonsugar ligands, such as various nucleotides, binding sites were correctly identified in 55%-73% of the cases. The method is completely automated (http://salilab.org/patcher) and can be applied on a large scale in a structural genomics setting.
Collapse
Affiliation(s)
- Andrea Rossi
- Department of Biopharmaceutical Sciences and Pharmaceutical Chemistry, California Institute for Quantitative Biomedical Research, University of California, San Francisco, California 94143-2552, USA.
| | | | | |
Collapse
|
150
|
Kim WK, In YJ, Kim JH, Cho HJ, Kim JH, Kang S, Lee CY, Lee SC. Quantitative relationship of dioxin-responsive gene expression to dioxin response element in Hep3B and HepG2 human hepatocarcinoma cell lines. Toxicol Lett 2006; 165:174-81. [PMID: 16697128 DOI: 10.1016/j.toxlet.2006.03.007] [Citation(s) in RCA: 17] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2005] [Revised: 03/03/2006] [Accepted: 03/10/2006] [Indexed: 11/29/2022]
Abstract
Dioxin response element (DRE) is a cis-acting DNA sequence mediating the 2,3,7,8-tetrachlorodibenzo-p-dioxin (TCDD)-induced gene expression. The present study was undertaken to elucidate TCDD-responsive gene expression profiles and their relationships to the number of DREs in liver cancer cells. Hep3B and HepG2 human hepatocarcinoma cells were exposed to 50-nM TCDD for 0, 1, 2 and 4h in culture, after which gene expression profiles were analyzed by the microarray hybridization using a chip containing 24,000 cDNAs prepared from the human liver. The TCDD-responsive expression levels in each gene were calculated by dividing the densitometric values of the hybridization signal for h1, h2 and h4 by that of h0, followed by transformation of the resulting data into a log scale with the base of 2. Up- and down-regulated gene expressions were defined as >0.585 and <-0.585 by the log scale (>1.5 and <1/1.5 arithmetically), respectively, exhibited at any time after h0. Hep3B and HepG2 cells had 27 and 58 TCDD-responsive, up-regulated genes, respectively, of which 78% (21/27) and 62% (36/58) had one or more DREs. Of these 85, 80 genes were up-regulated exclusively in one of the two lines, with CYP1A1 and PPP1R15A being so regulated in both lines. Expression levels of the up-regulated genes at h1, h2 and h4 were correlated with each other (P<0.01) and the mean of these regressed to the number of DRE(s) in both lines (P<0.01). However, expression of a total of 93 TCDD-responsive, down-regulated genes, of which 46% contained DRE(s), had no relation to the number of DRE(s). In conclusion, results suggest that DREs may cooperatively mediate the expression of TCDD-responsive genes in liver cancer cells.
Collapse
Affiliation(s)
- Won Kon Kim
- Systemic Proteomics Research Center, Korea Research Institute of Bioscience and BioTechnology (KRIBB), Daejeon, South Korea
| | | | | | | | | | | | | | | |
Collapse
|