1
|
Vello F, Filippini F, Righetto I. Bioinformatics Goes Viral: I. Databases, Phylogenetics and Phylodynamics Tools for Boosting Virus Research. Viruses 2024; 16:1425. [PMID: 39339901 PMCID: PMC11437414 DOI: 10.3390/v16091425] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2024] [Revised: 08/21/2024] [Accepted: 09/03/2024] [Indexed: 09/30/2024] Open
Abstract
Computer-aided analysis of proteins or nucleic acids seems like a matter of course nowadays; however, the history of Bioinformatics and Computational Biology is quite recent. The advent of high-throughput sequencing has led to the production of "big data", which has also affected the field of virology. The collaboration between the communities of bioinformaticians and virologists already started a few decades ago and it was strongly enhanced by the recent SARS-CoV-2 pandemics. In this article, which is the first in a series on how bioinformatics can enhance virus research, we show that highly useful information is retrievable from selected general and dedicated databases. Indeed, an enormous amount of information-both in terms of nucleotide/protein sequences and their annotation-is deposited in the general databases of international organisations participating in the International Nucleotide Sequence Database Collaboration (INSDC). However, more and more virus-specific databases have been established and are progressively enriched with the contents and features reported in this article. Since viruses are intracellular obligate parasites, a special focus is given to host-pathogen protein-protein interaction databases. Finally, we illustrate several phylogenetic and phylodynamic tools, combining information on algorithms and features with practical information on how to use them and case studies that validate their usefulness. Databases and tools for functional inference will be covered in the next article of this series: Bioinformatics goes viral: II. Sequence-based and structure-based functional analyses for boosting virus research.
Collapse
Affiliation(s)
| | - Francesco Filippini
- Synthetic Biology and Biotechnology Unit, Department of Biology, University of Padua, 35131 Padua, Italy; (F.V.); (I.R.)
| | | |
Collapse
|
2
|
Fisher KJ, Kryazhimskiy S, Lang GI. Detecting genetic interactions using parallel evolution in experimental populations. Philos Trans R Soc Lond B Biol Sci 2019; 374:20180237. [PMID: 31154981 DOI: 10.1098/rstb.2018.0237] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023] Open
Abstract
Eukaryotic genomes contain thousands of genes organized into complex and interconnected genetic interaction networks. Most of our understanding of how genetic variation affects these networks comes from quantitative-trait loci mapping and from the systematic analysis of double-deletion (or knockdown) mutants, primarily in the yeast Saccharomyces cerevisiae. Evolve and re-sequence experiments are an alternative approach for identifying novel functional variants and genetic interactions, particularly between non-loss-of-function mutations. These experiments leverage natural selection to obtain genotypes with functionally important variants and positive genetic interactions. However, no systematic methods for detecting genetic interactions in these data are yet available. Here, we introduce a computational method based on the idea that variants in genes that interact will co-occur in evolved genotypes more often than expected by chance. We apply this method to a previously published yeast experimental evolution dataset. We find that genetic targets of selection are distributed non-uniformly among evolved genotypes, indicating that genetic interactions had a significant effect on evolutionary trajectories. We identify individual gene pairs with a statistically significant genetic interaction score. The strongest interaction is between genes TRK1 and PHO84, genes that have not been reported to interact in previous systematic studies. Our work demonstrates that leveraging parallelism in experimental evolution is useful for identifying genetic interactions that have escaped detection by other methods. This article is part of the theme issue 'Convergent evolution in the genomics era: new insights and directions'.
Collapse
Affiliation(s)
- Kaitlin J Fisher
- 1 Department of Biological Sciences, Lehigh University , Bethlehem, PA 18015 , USA
| | - Sergey Kryazhimskiy
- 2 Division of Biological Sciences, University of California San Diego , La Jolla, CA 92093 , USA
| | - Gregory I Lang
- 1 Department of Biological Sciences, Lehigh University , Bethlehem, PA 18015 , USA
| |
Collapse
|
3
|
Savel D, Koyutürk M. Characterizing human genomic coevolution in locus-gene regulatory interactions. BioData Min 2019; 12:8. [PMID: 30923571 PMCID: PMC6419833 DOI: 10.1186/s13040-019-0195-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2018] [Accepted: 02/19/2019] [Indexed: 11/10/2022] Open
Abstract
Background Coevolution has been used to identify and predict interactions and functional relationships between proteins of many different organisms including humans. Current efforts in annotating the human genome increasingly show that non-coding DNA sequence has important functional and regulatory interactions. Furthermore, regulatory elements do not necessarily reside in close proximity of the coding region for their target genes. Results We characterize coevolution as it appears in locus-gene interactions in the human genome, focusing on expression Quantitative Trait - Locus (eQTL) interactions. Our results show that in these interactions the conservation status of the loci is predictive of the conservation status of their target genes. Furthermore, comparing the phylogenetic histories of intra-chromosomal pairs of loci and transcription start sites, we find that pairs that appear coevolved are enriched for cis-eQTL interactions. Exploring this property we found that coevolution might be useful in prioritizing association tests in cis-eQTL detection. Conclusions The relationship between the conservation status of pairs of loci and protein coding transcription start sites reveal correlations with regulatory interactions. Pairs that appear coevolved are enriched for intra-chromosomal regulatory interactions, thus our results suggest that measures of coevolution can be useful for prediction and detection of new interactions. Measures of coevolution are genome-wide and could potentially be used to prioritize the detection of distant or inter-chromosomal interactions such as trans-eQTL interactions in the human genome.
Collapse
Affiliation(s)
- Daniel Savel
- 1Department of Electrical Engineering and Computer Science, Case Western Reserve University, 10900 Euclid Avenue, Cleveland, 44106 OH USA
| | - Mehmet Koyutürk
- 1Department of Electrical Engineering and Computer Science, Case Western Reserve University, 10900 Euclid Avenue, Cleveland, 44106 OH USA.,2Center for Proteomics and Bioinformatics, Case Western Reserve University, 10900 Euclid Avenue, Cleveland, 44106 OH USA
| |
Collapse
|
4
|
Hochberg R, Milam TL. Data Structures for Parsimony Correlation and Biosequence Co-Evolution. J Comput Biol 2014; 21:361-9. [DOI: 10.1089/cmb.2008.0107] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Affiliation(s)
- Robert Hochberg
- Department of Computer Science, East Carolina University, Greenville, North Carolina
| | - Treena Larrew Milam
- Department of Computer Science, East Carolina University, Greenville, North Carolina
| |
Collapse
|
5
|
Liu J, Duan X, Sun J, Yin Y, Li G, Wang L, Liu B. Bi-factor analysis based on noise-reduction (BIFANR): a new algorithm for detecting coevolving amino acid sites in proteins. PLoS One 2013; 8:e79764. [PMID: 24278175 PMCID: PMC3835919 DOI: 10.1371/journal.pone.0079764] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2013] [Accepted: 09/29/2013] [Indexed: 11/23/2022] Open
Abstract
Previous statistical analyses have shown that amino acid sites in a protein evolve in a correlated way instead of independently. Even though located distantly in the linear sequence, the coevolved amino acids could be spatially adjacent in the tertiary structure, and constitute specific protein sectors. Moreover, these protein sectors are independent of one another in structure, function, and even evolution. Thus, systematic studies on protein sectors inside a protein will contribute to the clarification of protein function. In this paper, we propose a new algorithm BIFANR (Bi-factor Analysis Based on Noise-reduction) for detecting protein sectors in amino acid sequences. After applying BIFANR on S1A family and PDZ family, we carried out internal correlation test, statistical independence test, evolutionary rate analysis, evolutionary independence analysis, and function analysis to assess the prediction. The results showed that the amino acids in certain predicted protein sector are closely correlated in structure, function, and evolution, while protein sectors are nearly statistically independent. The results also indicated that the protein sectors have distinct evolutionary directions. In addition, compared with other algorithms, BIFANR has higher accuracy and robustness under the influence of noise sites.
Collapse
Affiliation(s)
- Juntao Liu
- School of Mathematics, Shandong University, Jinan, China
| | - Xiaoyun Duan
- School of Life Science, Shandong University, Jinan, China
| | - Jianyang Sun
- School of Mathematics, Shandong University, Jinan, China
| | - Yanbin Yin
- Department of Biological Sciences, Northern Illinois University, DeKalb, Illinois, United States of America
| | - Guojun Li
- School of Mathematics, Shandong University, Jinan, China
| | - Lushan Wang
- School of Life Science, Shandong University, Jinan, China
| | - Bingqiang Liu
- School of Mathematics, Shandong University, Jinan, China
- * E-mail: Bingqiang Liu:
| |
Collapse
|
6
|
Muley VY, Ranjan A. Effect of reference genome selection on the performance of computational methods for genome-wide protein-protein interaction prediction. PLoS One 2012; 7:e42057. [PMID: 22844541 PMCID: PMC3406042 DOI: 10.1371/journal.pone.0042057] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2011] [Accepted: 07/02/2012] [Indexed: 12/20/2022] Open
Abstract
Background Recent progress in computational methods for predicting physical and functional protein-protein interactions has provided new insights into the complexity of biological processes. Most of these methods assume that functionally interacting proteins are likely to have a shared evolutionary history. This history can be traced out for the protein pairs of a query genome by correlating different evolutionary aspects of their homologs in multiple genomes known as the reference genomes. These methods include phylogenetic profiling, gene neighborhood and co-occurrence of the orthologous protein coding genes in the same cluster or operon. These are collectively known as genomic context methods. On the other hand a method called mirrortree is based on the similarity of phylogenetic trees between two interacting proteins. Comprehensive performance analyses of these methods have been frequently reported in literature. However, very few studies provide insight into the effect of reference genome selection on detection of meaningful protein interactions. Methods We analyzed the performance of four methods and their variants to understand the effect of reference genome selection on prediction efficacy. We used six sets of reference genomes, sampled in accordance with phylogenetic diversity and relationship between organisms from 565 bacteria. We used Escherichia coli as a model organism and the gold standard datasets of interacting proteins reported in DIP, EcoCyc and KEGG databases to compare the performance of the prediction methods. Conclusions Higher performance for predicting protein-protein interactions was achievable even with 100–150 bacterial genomes out of 565 genomes. Inclusion of archaeal genomes in the reference genome set improves performance. We find that in order to obtain a good performance, it is better to sample few genomes of related genera of prokaryotes from the large number of available genomes. Moreover, such a sampling allows for selecting 50–100 genomes for comparable accuracy of predictions when computational resources are limited.
Collapse
Affiliation(s)
- Vijaykumar Yogesh Muley
- Computational and Functional Genomics Group, Centre for DNA Fingerprinting and Diagnostics, Hyderabad, Andhra Pradesh, India
- Department of Biotechnology, Dr. Babasaheb Ambedkar Marathwada University, Sub-centre, Osmanabad, Maharashtra, India
| | - Akash Ranjan
- Computational and Functional Genomics Group, Centre for DNA Fingerprinting and Diagnostics, Hyderabad, Andhra Pradesh, India
- * E-mail:
| |
Collapse
|
7
|
MS_RHII-RSD, a dual-function RNase HII-(p)ppGpp synthetase from Mycobacterium smegmatis. J Bacteriol 2012; 194:4003-14. [PMID: 22636779 DOI: 10.1128/jb.00258-12] [Citation(s) in RCA: 40] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open
Abstract
In the noninfectious soil saprophyte Mycobacterium smegmatis, intracellular levels of the stress alarmones guanosine tetraphosphate and guanosine pentaphosphate, together termed (p)ppGpp, are regulated by the enzyme Rel(Msm). This enzyme consists of a single, bifunctional polypeptide chain that is capable of both synthesizing and hydrolyzing (p)ppGpp. The rel(Msm) knockout strain of M. smegmatis (Δrel(Msm)) is expected to show a (p)ppGpp null [(p)ppGpp(0)] phenotype. Contrary to this expectation, the strain is capable of synthesizing (p)ppGpp in vivo. In this study, we identify and functionally characterize the open reading frame (ORF), MSMEG_5849, that encodes a second functional (p)ppGpp synthetase in M. smegmatis. In addition to (p)ppGpp synthesis, the 567-amino-acid-long protein encoded by this gene is capable of hydrolyzing RNA·DNA hybrids and bears similarity to the conventional RNase HII enzymes. We have classified this protein as actRel(Msm) in accordance with the recent nomenclature proposed and have named it MS_RHII-RSD, indicating the two enzymatic activities present [RHII, RNase HII domain, originally identified as domain of unknown function 429 (DUF429), and RSD, RelA_SpoT nucleotidyl transferase domain, the SYNTH domain responsible for (p)ppGpp synthesis activity]. MS_RHII-RSD is expressed and is constitutively active in vivo and behaves like a monofunctional (p)ppGpp synthetase in vitro. The occurrence of the RNase HII and (p)ppGpp synthetase domains together on the same polypeptide chain is suggestive of an in vivo role for this novel protein as a link connecting the essential life processes of DNA replication, repair, and transcription to the highly conserved stress survival pathway, the stringent response.
Collapse
|
8
|
Mészáros B, Tóth J, Vértessy BG, Dosztányi Z, Simon I. Proteins with complex architecture as potential targets for drug design: a case study of Mycobacterium tuberculosis. PLoS Comput Biol 2011; 7:e1002118. [PMID: 21814507 PMCID: PMC3140968 DOI: 10.1371/journal.pcbi.1002118] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2011] [Accepted: 05/24/2011] [Indexed: 02/04/2023] Open
Abstract
Lengthy co-evolution of Homo sapiens and Mycobacterium tuberculosis, the main causative agent of tuberculosis, resulted in a dramatically successful pathogen species that presents considerable challenge for modern medicine. The continuous and ever increasing appearance of multi-drug resistant mycobacteria necessitates the identification of novel drug targets and drugs with new mechanisms of action. However, further insights are needed to establish automated protocols for target selection based on the available complete genome sequences. In the present study, we perform complete proteome level comparisons between M. tuberculosis, mycobacteria, other prokaryotes and available eukaryotes based on protein domains, local sequence similarities and protein disorder. We show that the enrichment of certain domains in the genome can indicate an important function specific to M. tuberculosis. We identified two families, termed pkn and PE/PPE that stand out in this respect. The common property of these two protein families is a complex domain organization that combines species-specific regions, commonly occurring domains and disordered segments. Besides highlighting promising novel drug target candidates in M. tuberculosis, the presented analysis can also be viewed as a general protocol to identify proteins involved in species-specific functions in a given organism. We conclude that target selection protocols should be extended to include proteins with complex domain architectures instead of focusing on sequentially unique and essential proteins only.
Collapse
Affiliation(s)
- Bálint Mészáros
- Institute of Enzymology, Hungarian Academy of Sciences, Budapest, Hungary
| | - Judit Tóth
- Institute of Enzymology, Hungarian Academy of Sciences, Budapest, Hungary
| | - Beáta G. Vértessy
- Institute of Enzymology, Hungarian Academy of Sciences, Budapest, Hungary
- Department of Applied Biotechnology, Budapest University of Technology and Economics, Budapest, Hungary
| | - Zsuzsanna Dosztányi
- Institute of Enzymology, Hungarian Academy of Sciences, Budapest, Hungary
- * E-mail: (ZD); (IS)
| | - István Simon
- Institute of Enzymology, Hungarian Academy of Sciences, Budapest, Hungary
- * E-mail: (ZD); (IS)
| |
Collapse
|
9
|
Abstract
Bioinformatic methods to predict protein-protein interactions (PPI) via coevolutionary analysis have -positioned themselves to compete alongside established in vitro methods, despite a lack of understanding for the underlying molecular mechanisms of the coevolutionary process. Investigating the alignment of coevolutionary predictions of PPI with experimental data can focus the effective scope of prediction and lead to better accuracies. A new rate-based coevolutionary method, MMM, preferentially finds obligate interacting proteins that form complexes, conforming to results from studies based on coimmunoprecipitation coupled with mass spectrometry. Using gold-standard databases as a benchmark for accuracy, MMM surpasses methods based on abundance ratios, suggesting that correlated evolutionary rates may yet be better than coexpression at predicting interacting proteins. At the level of protein domains, -coevolution is difficult to detect, even with MMM, except when considering small-scale experimental data involving proteins with multiple domains. Overall, these findings confirm that coevolutionary -methods can be confidently used in predicting PPI, either independently or as drivers of coimmunoprecipitation experiments.
Collapse
|
10
|
Koyutürk M. Algorithmic and analytical methods in network biology. WILEY INTERDISCIPLINARY REVIEWS. SYSTEMS BIOLOGY AND MEDICINE 2010; 2:277-292. [PMID: 20836029 PMCID: PMC3087298 DOI: 10.1002/wsbm.61] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Abstract
During the genomic revolution, algorithmic and analytical methods for organizing, integrating, analyzing, and querying biological sequence data proved invaluable. Today, increasing availability of high-throughput data pertaining to functional states of biomolecules, as well as their interactions, enables genome-scale studies of the cell from a systems perspective. The past decade witnessed significant efforts on the development of computational infrastructure for large-scale modeling and analysis of biological systems, commonly using network models. Such efforts lead to novel insights into the complexity of living systems, through development of sophisticated abstractions, algorithms, and analytical techniques that address a broad range of problems, including the following: (1) inference and reconstruction of complex cellular networks; (2) identification of common and coherent patterns in cellular networks, with a view to understanding the organizing principles and building blocks of cellular signaling, regulation, and metabolism; and (3) characterization of cellular mechanisms that underlie the differences between living systems, in terms of evolutionary diversity, development and differentiation, and complex phenotypes, including human disease. These problems pose significant algorithmic and analytical challenges because of the inherent complexity of the systems being studied; limitations of data in terms of availability, scope, and scale; intractability of resulting computational problems; and limitations of reference models for reliable statistical inference. This article provides a broad overview of existing algorithmic and analytical approaches to these problems, highlights key biological insights provided by these approaches, and outlines emerging opportunities and challenges in computational systems biology.
Collapse
Affiliation(s)
- Mehmet Koyutürk
- Department of Electrical Engineering & Computer Science, Case Western Reserve University, Cleveland, OH 44106, USA
- Center for Proteomics and Bioinformatics, Case Western Reserve University, Cleveland, OH 44106, USA
| |
Collapse
|
11
|
Frech C, Kommenda M, Dorfer V, Kern T, Hintner H, Bauer JW, Onder K. Improved homology-driven computational validation of protein-protein interactions motivated by the evolutionary gene duplication and divergence hypothesis. BMC Bioinformatics 2009; 10:21. [PMID: 19152684 PMCID: PMC2637843 DOI: 10.1186/1471-2105-10-21] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2008] [Accepted: 01/19/2009] [Indexed: 11/10/2022] Open
Abstract
Background Protein-protein interaction (PPI) data sets generated by high-throughput experiments are contaminated by large numbers of erroneous PPIs. Therefore, computational methods for PPI validation are necessary to improve the quality of such data sets. Against the background of the theory that most extant PPIs arose as a consequence of gene duplication, the sensitive search for homologous PPIs, i.e. for PPIs descending from a common ancestral PPI, should be a successful strategy for PPI validation. Results To validate an experimentally observed PPI, we combine FASTA and PSI-BLAST to perform a sensitive sequence-based search for pairs of interacting homologous proteins within a large, integrated PPI database. A novel scoring scheme that incorporates both quality and quantity of all observed matches allows us (1) to consider also tentative paralogs and orthologs in this analysis and (2) to combine search results from more than one homology detection method. ROC curves illustrate the high efficacy of this approach and its improvement over other homology-based validation methods. Conclusion New PPIs are primarily derived from preexisting PPIs and not invented de novo. Thus, the hallmark of true PPIs is the existence of homologous PPIs. The sensitive search for homologous PPIs within a large body of known PPIs is an efficient strategy to separate biologically relevant PPIs from the many spurious PPIs reported by high-throughput experiments.
Collapse
Affiliation(s)
- Christian Frech
- Upper Austria University of Applied Sciences, Hagenberg, Austria.
| | | | | | | | | | | | | |
Collapse
|
12
|
Molecular Coevolution and the Three-Dimensionality of Natural Selection. Evol Biol 2009. [DOI: 10.1007/978-3-642-00952-5_14] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
13
|
Jiang Z. Protein Function Predictions Based on the Phylogenetic Profile Method. Crit Rev Biotechnol 2008; 28:233-8. [PMID: 19051102 DOI: 10.1080/07388550802512633] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
14
|
Yellaboina S, Dudekula DB, Ko MS. Prediction of evolutionarily conserved interologs in Mus musculus. BMC Genomics 2008; 9:465. [PMID: 18842131 PMCID: PMC2571111 DOI: 10.1186/1471-2164-9-465] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2008] [Accepted: 10/08/2008] [Indexed: 12/03/2022] Open
Abstract
Background Identification of protein-protein interactions is an important first step to understand living systems. High-throughput experimental approaches have accumulated large amount of information on protein-protein interactions in human and other model organisms. Such interaction information has been successfully transferred to other species, in which the experimental data are limited. However, the annotation transfer method could yield false positive interologs due to the lack of conservation of interactions when applied to phylogenetically distant organisms. Results To address this issue, we used phylogenetic profile method to filter false positives in interologs based on the notion that evolutionary conserved interactions show similar patterns of occurrence along the genomes. The approach was applied to Mus musculus, in which the experimentally identified interactions are limited. We first inferred the protein-protein interactions in Mus musculus by using two approaches: i) identifying mouse orthologs of interacting proteins (interologs) based on the experimental protein-protein interaction data from other organisms; and ii) analyzing frequency of mouse ortholog co-occurrence in predicted operons of bacteria. We then filtered possible false-positives in the predicted interactions using the phylogenetic profiles. We found that this filtering method significantly increased the frequency of interacting protein-pairs coexpressed in the same cells/tissues in gene expression omnibus (GEO) database as well as the frequency of interacting protein-pairs shared the similar Gene Ontology (GO) terms for biological processes and cellular localizations. The data supports the notion that phylogenetic profile helps to reduce the number of false positives in interologs. Conclusion We have developed protein-protein interaction database in mouse, which contains 41109 interologs. We have also developed a web interface to facilitate the use of database .
Collapse
Affiliation(s)
- Sailu Yellaboina
- Developmental Genomics and Aging Section, Laboratory of Genetics, National Institute on Aging, National Institutes of Health, Baltimore, MD 21224, USA.
| | | | | |
Collapse
|
15
|
Abstract
Non-independent evolution of amino acid sites has become a noticeable limitation of most methods aimed at identifying selective constraints at functionally important amino acid sites or protein regions. The need for a generalised framework to account for non-independence of amino acid sites has fuelled the design and development of new mathematical models and computational tools centred on resolving this problem. Molecular coevolution is one of the most active areas of research, with an increasing rate of new models and methods being developed everyday. Both parametric and non-parametric methods have been developed to account for correlated variability of amino acid sites. These methods have been utilised for detecting phylogenetic, functional and structural coevolution as well as to identify surfaces of amino acid sites involved in protein-protein interactions. Here we discuss and briefly describe these methods, and identify their advantages and limitations.
Collapse
Affiliation(s)
- Francisco M. Codoñer
- Evolutionary Genetics and Bioinformatics Laboratory, Department of Genetics, Smurfit Institute of Genetics, University of Dublin, Trinity College
- Institute of Immunology, Biology Department, National University of Ireland Maynooth
| | - Mario A. Fares
- Evolutionary Genetics and Bioinformatics Laboratory, Department of Genetics, Smurfit Institute of Genetics, University of Dublin, Trinity College
| |
Collapse
|
16
|
Kensche PR, van Noort V, Dutilh BE, Huynen MA. Practical and theoretical advances in predicting the function of a protein by its phylogenetic distribution. J R Soc Interface 2008; 5:151-70. [PMID: 17535793 PMCID: PMC2405902 DOI: 10.1098/rsif.2007.1047] [Citation(s) in RCA: 76] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2007] [Revised: 05/05/2007] [Accepted: 05/05/2007] [Indexed: 11/12/2022] Open
Abstract
The gap between the amount of genome information released by genome sequencing projects and our knowledge about the proteins' functions is rapidly increasing. To fill this gap, various 'genomic-context' methods have been proposed that exploit sequenced genomes to predict the functions of the encoded proteins. One class of methods, phylogenetic profiling, predicts protein function by correlating the phylogenetic distribution of genes with that of other genes or phenotypic characteristics. The functions of a number of proteins, including ones of medical relevance, have thus been predicted and subsequently confirmed experimentally. Additionally, various approaches to measure the similarity of phylogenetic profiles and to account for the phylogenetic bias in the data have been proposed. We review the successful applications of phylogenetic profiling and analyse the performance of various profile similarity measures with a set of one microsporidial and 25 fungal genomes. In the fungi, phylogenetic profiling yields high-confidence predictions for the highest and only the highest scoring gene pairs illustrating both the power and the limitations of the approach. Both practical examples and theoretical considerations suggest that in order to get a reliable and specific picture of a protein's function, results from phylogenetic profiling have to be combined with other sources of evidence.
Collapse
Affiliation(s)
- Philip R. Kensche
- Centre for Molecular and Biomolecular Informatics/Nijmegen, Centre for Molecular Life Sciences, Radboud University Medical CentrePO Box 9101, 6500 HB Nijmegen, The Netherlands
| | - Vera van Noort
- European Molecular Biology Laboratory, Meyerhofstrasse 169117 Heidelberg, Germany
| | - Bas E. Dutilh
- Centre for Molecular and Biomolecular Informatics/Nijmegen, Centre for Molecular Life Sciences, Radboud University Medical CentrePO Box 9101, 6500 HB Nijmegen, The Netherlands
| | - Martijn A. Huynen
- Centre for Molecular and Biomolecular Informatics/Nijmegen, Centre for Molecular Life Sciences, Radboud University Medical CentrePO Box 9101, 6500 HB Nijmegen, The Netherlands
| |
Collapse
|
17
|
Jothi R, Przytycka TM, Aravind L. Discovering functional linkages and uncharacterized cellular pathways using phylogenetic profile comparisons: a comprehensive assessment. BMC Bioinformatics 2007; 8:173. [PMID: 17521444 PMCID: PMC1904249 DOI: 10.1186/1471-2105-8-173] [Citation(s) in RCA: 69] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2007] [Accepted: 05/23/2007] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND A widely-used approach for discovering functional and physical interactions among proteins involves phylogenetic profile comparisons (PPCs). Here, proteins with similar profiles are inferred to be functionally related under the assumption that proteins involved in the same metabolic pathway or cellular system are likely to have been co-inherited during evolution. RESULTS Our experimentation with E. coli and yeast proteins with 16 different carefully composed reference sets of genomes revealed that the phyletic patterns of proteins in prokaryotes alone could be adequate enough to make reasonably accurate functional linkage predictions. A slight improvement in performance is observed on adding few eukaryotes into the reference set, but a noticeable drop-off in performance is observed with increased number of eukaryotes. Inclusion of most parasitic, pathogenic or vertebrate genomes and multiple strains of the same species into the reference set do not necessarily contribute to an improved sensitivity or accuracy. Interestingly, we also found that evolutionary histories of individual pathways have a significant affect on the performance of the PPC approach with respect to a particular reference set. For example, to accurately predict functional links in carbohydrate or lipid metabolism, a reference set solely composed of prokaryotic (or bacterial) genomes performed among the best compared to one composed of genomes from all three super-kingdoms; this is in contrast to predicting functional links in translation for which a reference set composed of prokaryotic (or bacterial) genomes performed the worst. We also demonstrate that the widely used random null model to quantify the statistical significance of profile similarity is incomplete, which could result in an increased number of false-positives. CONCLUSION Contrary to previous proposals, it is not merely the number of genomes but a careful selection of informative genomes in the reference set that influences the prediction accuracy of the PPC approach. We note that the predictive power of the PPC approach, especially in eukaryotes, is heavily influenced by the primary endosymbiosis and subsequent bacterial contributions. The over-representation of parasitic unicellular eukaryotes and vertebrates additionally make eukaryotes less useful in the reference sets. Reference sets composed of highly non-redundant set of genomes from all three super-kingdoms fare better with pathways showing considerable vertical inheritance and strong conservation (e.g. translation apparatus), while reference sets solely composed of prokaryotic genomes fare better for more variable pathways like carbohydrate metabolism. Differential performance of the PPC approach on various pathways, and a weak positive correlation between functional and profile similarities suggest that caution should be exercised while interpreting functional linkages inferred from genome-wide large-scale profile comparisons using a single reference set.
Collapse
Affiliation(s)
- Raja Jothi
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Teresa M Przytycka
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - L Aravind
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| |
Collapse
|