1
|
Rallis D, Baltogianni M, Kapetaniou K, Kosmeri C, Giapros V. Bioinformatics in Neonatal/Pediatric Medicine-A Literature Review. J Pers Med 2024; 14:767. [PMID: 39064021 PMCID: PMC11277633 DOI: 10.3390/jpm14070767] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2024] [Revised: 07/14/2024] [Accepted: 07/16/2024] [Indexed: 07/28/2024] Open
Abstract
Bioinformatics is a scientific field that uses computer technology to gather, store, analyze, and share biological data and information. DNA sequences of genes or entire genomes, protein amino acid sequences, nucleic acid, and protein-nucleic acid complex structures are examples of traditional bioinformatics data. Moreover, proteomics, the distribution of proteins in cells, interactomics, the patterns of interactions between proteins and nucleic acids, and metabolomics, the types and patterns of small-molecule transformations by the biochemical pathways in cells, are further data streams. Currently, the objectives of bioinformatics are integrative, focusing on how various data combinations might be utilized to comprehend organisms and diseases. Bioinformatic techniques have become popular as novel instruments for examining the fundamental mechanisms behind neonatal diseases. In the first few weeks of newborn life, these methods can be utilized in conjunction with clinical data to identify the most vulnerable neonates and to gain a better understanding of certain mortalities, including respiratory distress, bronchopulmonary dysplasia, sepsis, or inborn errors of metabolism. In the current study, we performed a literature review to summarize the current application of bioinformatics in neonatal medicine. Our aim was to provide evidence that could supply novel insights into the underlying mechanism of neonatal pathophysiology and could be used as an early diagnostic tool in neonatal care.
Collapse
Affiliation(s)
- Dimitrios Rallis
- Neonatal Intensive Care Unit, School of Medicine, University of Ioannina, 45110 Ioannina, Greece; (D.R.); (M.B.)
| | - Maria Baltogianni
- Neonatal Intensive Care Unit, School of Medicine, University of Ioannina, 45110 Ioannina, Greece; (D.R.); (M.B.)
| | - Konstantina Kapetaniou
- Department of Pediatrics, School of Medicine, University of Ioannina, 45110 Ioannina, Greece; (K.K.); (C.K.)
| | - Chrysoula Kosmeri
- Department of Pediatrics, School of Medicine, University of Ioannina, 45110 Ioannina, Greece; (K.K.); (C.K.)
| | - Vasileios Giapros
- Neonatal Intensive Care Unit, School of Medicine, University of Ioannina, 45110 Ioannina, Greece; (D.R.); (M.B.)
| |
Collapse
|
2
|
Vasileiou D, Karapiperis C, Baltsavia I, Chasapi A, Ahrén D, Janssen PJ, Iliopoulos I, Promponas VJ, Enright AJ, Ouzounis CA. CGG toolkit: Software components for computational genomics. PLoS Comput Biol 2023; 19:e1011498. [PMID: 37934729 PMCID: PMC10629618 DOI: 10.1371/journal.pcbi.1011498] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2023] [Accepted: 09/07/2023] [Indexed: 11/09/2023] Open
Abstract
Public-domain availability for bioinformatics software resources is a key requirement that ensures long-term permanence and methodological reproducibility for research and development across the life sciences. These issues are particularly critical for widely used, efficient, and well-proven methods, especially those developed in research settings that often face funding discontinuities. We re-launch a range of established software components for computational genomics, as legacy version 1.0.1, suitable for sequence matching, masking, searching, clustering and visualization for protein family discovery, annotation and functional characterization on a genome scale. These applications are made available online as open source and include MagicMatch, GeneCAST, support scripts for CoGenT-like sequence collections, GeneRAGE and DifFuse, supported by centrally administered bioinformatics infrastructure funding. The toolkit may also be conceived as a flexible genome comparison software pipeline that supports research in this domain. We illustrate basic use by examples and pictorial representations of the registered tools, which are further described with appropriate documentation files in the corresponding GitHub release.
Collapse
Affiliation(s)
- Dimitrios Vasileiou
- Biological Computation & Process Laboratory, Chemical Process & Energy Resources Institute, Centre for Research & Technology Hellas, Thessalonica, Greece
| | - Christos Karapiperis
- Biological Computation & Process Laboratory, Chemical Process & Energy Resources Institute, Centre for Research & Technology Hellas, Thessalonica, Greece
- Biological Computation & Computational Biology Group, AIIA Lab, School of Informatics, Aristotle University of Thessalonica, Thessalonica, Greece
| | - Ismini Baltsavia
- Computational Biology Group, Faculty of Medicine, University of Crete, Heraklion, Greece
| | - Anastasia Chasapi
- Biological Computation & Process Laboratory, Chemical Process & Energy Resources Institute, Centre for Research & Technology Hellas, Thessalonica, Greece
| | - Dag Ahrén
- Department of Biology, Microbial Ecology Group, Lund University, Lund, Sweden
| | - Paul J. Janssen
- Nuclear Medical Applications, Belgian Nuclear Research Centre SCK CEN, Mol, Belgium
| | - Ioannis Iliopoulos
- Computational Biology Group, Faculty of Medicine, University of Crete, Heraklion, Greece
| | - Vasilis J. Promponas
- Bioinformatics Research Laboratory, Department of Biological Sciences, New Campus, University of Cyprus, Nicosia, Cyprus
| | - Anton J. Enright
- Department of Pathology, University of Cambridge, Tennis Court Road, Cambridge, United Kingdom
| | - Christos A. Ouzounis
- Biological Computation & Process Laboratory, Chemical Process & Energy Resources Institute, Centre for Research & Technology Hellas, Thessalonica, Greece
- Biological Computation & Computational Biology Group, AIIA Lab, School of Informatics, Aristotle University of Thessalonica, Thessalonica, Greece
- SysBioBio.info (SBBI), Thessalonica, Greece
| |
Collapse
|
3
|
Role of Bioinformatics in Biological Sciences. Adv Bioinformatics 2021. [DOI: 10.1007/978-981-33-6191-1_3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022] Open
|
4
|
Noreña – P A, González Muñoz A, Mosquera-Rendón J, Botero K, Cristancho MA. Colombia, an unknown genetic diversity in the era of Big Data. BMC Genomics 2018; 19:859. [PMID: 30537922 PMCID: PMC6288850 DOI: 10.1186/s12864-018-5194-8] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023] Open
Abstract
BACKGROUND Latin America harbors some of the most biodiverse countries in the world, including Colombia. Despite the increasing use of cutting-edge technologies in genomics and bioinformatics in several biological science fields around the world, the region has fallen behind in the inclusion of these approaches in biodiversity studies. In this study, we used data mining methods to search in four main public databases of genetic sequences such as: NCBI Nucleotide and BioProject, Pathosystems Resource Integration Center, and Barcode of Life Data Systems databases. We aimed to determine how much of the Colombian biodiversity is contained in genetic data stored in these public databases and how much of this information has been generated by national institutions. Additionally, we compared this data for Colombia with other countries of high biodiversity in Latin America, such as Brazil, Argentina, Costa Rica, Mexico, and Peru. RESULTS In Nucleotide, we found that 66.84% of total records for Colombia have been published at the national level, and this data represents less than 5% of the total number of species reported for the country. In BioProject, 70.46% of records were generated by national institutions and the great majority of them is represented by microorganisms. In BOLD Systems, 26% of records have been submitted by national institutions, representing 258 species for Colombia. This number of species reported for Colombia span approximately 0.46% of the total biodiversity reported for the country (56,343 species). Finally, in PATRIC database, 13.25% of the reported sequences were contributed by national institutions. Colombia has a better biodiversity representation in public databases in comparison to other Latin American countries, like Costa Rica and Peru. Mexico and Argentina have the highest representation of species at the national level, despite Brazil and Colombia, which actually hold the first and second places in biodiversity worldwide. CONCLUSIONS Our findings show gaps in the representation of the Colombian biodiversity at the molecular and genetic levels in widely consulted public databases. National funding for high-throughput molecular research, NGS technologies costs, and access to genetic resources are limiting factors. This fact should be taken as an opportunity to foster the development of collaborative projects between research groups in the Latin American region to study the vast biodiversity of these countries using 'omics' technologies.
Collapse
Affiliation(s)
- Alejandra Noreña – P
- Bioinformatics Unit, Centro de Bioinformática y Biología Computacional de Colombia– BIOS, Manizales, Colombia
| | - Andrea González Muñoz
- Bioinformatics Unit, Centro de Bioinformática y Biología Computacional de Colombia– BIOS, Manizales, Colombia
| | - Jeanneth Mosquera-Rendón
- Bioinformatics Unit, Centro de Bioinformática y Biología Computacional de Colombia– BIOS, Manizales, Colombia
| | - Kelly Botero
- Bioinformatics Unit, Centro de Bioinformática y Biología Computacional de Colombia– BIOS, Manizales, Colombia
| | - Marco A. Cristancho
- Bioinformatics Unit, Centro de Bioinformática y Biología Computacional de Colombia– BIOS, Manizales, Colombia
- Vicerrectoría de Investigaciones, Universidad de los Andes, Bogotá, Colombia
| |
Collapse
|
5
|
Structure-based function analysis of putative conserved proteins with isomerase activity from Haemophilus influenzae. 3 Biotech 2015; 5:741-763. [PMID: 28324524 PMCID: PMC4569619 DOI: 10.1007/s13205-014-0274-1] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2014] [Accepted: 12/18/2014] [Indexed: 01/09/2023] Open
Abstract
Haemophilus influenzae, a Gram-negative bacterium and a member of the family Pasteurellaceae, causes chronic bronchitis, bacteremia, meningitis, etc. The H. influenzae is the first organism whose genome was completely sequenced and annotated. Here, we have extensively analyzed the genome of H. influenzae using available proteins structure and function analysis tools. The objective of this analysis is to assign a precise function to hypothetical proteins (HPs) whose functions are not determined so far. Function prediction of these proteins is helpful in precise understanding of mechanisms of pathogenesis and biochemical pathways important for selecting novel therapeutic target. After an extensive analysis of H. Influenzae genome we have found 13 HPs showing high level of sequence and structural similarity to the enzyme isomerase. Consequently, the structures of HPs have been modeled and analyzed to determine their precise functions. We found these HPs are alanine racemase, lysine 2, 3-aminomutase, topoisomerase DNA-binding C4 zinc finger, pseudouridine synthase B, C and E (Rlu B, C and E), hydroxypyruvate isomerase, nucleoside-diphosphate-sugar epimerase, amidophosphoribosyltransferase, aldose-1-epimerase, tautomerase/MIF, Xylose isomerase-like, have TIM barrel domain and sedoheptulose-7-phosphate isomerase like activity, signifying their corresponding functions in the H. influenzae. This work provides a better understanding of the role HPs with isomerase activities in the survival and pathogenesis of H. influenzae.
Collapse
|
6
|
Shahbaaz M, Ahmad F, Imtaiyaz Hassan M. Structure-based functional annotation of putative conserved proteins having lyase activity from Haemophilus influenzae. 3 Biotech 2015; 5:317-336. [PMID: 28324295 PMCID: PMC4434415 DOI: 10.1007/s13205-014-0231-z] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2014] [Accepted: 05/28/2014] [Indexed: 12/20/2022] Open
Abstract
Haemophilus influenzae is a small pleomorphic Gram-negative bacteria which causes several chronic diseases, including bacteremia, meningitis, cellulitis, epiglottitis, septic arthritis, pneumonia, and empyema. Here we extensively analyzed the sequenced genome of H. influenzae strain Rd KW20 using protein family databases, protein structure prediction, pathways and genome context methods to assign a precise function to proteins whose functions are unknown. These proteins are termed as hypothetical proteins (HPs), for which no experimental information is available. Function prediction of these proteins would surely be supportive to precisely understand the biochemical pathways and mechanism of pathogenesis of Haemophilus influenzae. During the extensive analysis of H. influenzae genome, we found the presence of eight HPs showing lyase activity. Subsequently, we modeled and analyzed three-dimensional structure of all these HPs to determine their functions more precisely. We found these HPs possess cystathionine-β-synthase, cyclase, carboxymuconolactone decarboxylase, pseudouridine synthase A and C, D-tagatose-1,6-bisphosphate aldolase and aminodeoxychorismate lyase-like features, indicating their corresponding functions in the H. influenzae. Lyases are actively involved in the regulation of biosynthesis of various hormones, metabolic pathways, signal transduction, and DNA repair. Lyases are also considered as a key player for various biological processes. These enzymes are critically essential for the survival and pathogenesis of H. influenzae and, therefore, these enzymes may be considered as a potential target for structure-based rational drug design. Our structure–function relationship analysis will be useful to search and design potential lead molecules based on the structure of these lyases, for drug design and discovery.
Collapse
Affiliation(s)
- Mohd Shahbaaz
- Department of Computer Science, Jamia Millia Islamia, New Delhi, 110025, India
| | - Faizan Ahmad
- Center for Interdisciplinary Research in Basic Sciences, Jamia Millia Islamia, Jamia Nagar, New Delhi, 110025, India
| | - Md Imtaiyaz Hassan
- Center for Interdisciplinary Research in Basic Sciences, Jamia Millia Islamia, Jamia Nagar, New Delhi, 110025, India.
| |
Collapse
|
7
|
Szilágyi SM, Szilágyi L. A fast hierarchical clustering algorithm for large-scale protein sequence data sets. Comput Biol Med 2014; 48:94-101. [PMID: 24657908 DOI: 10.1016/j.compbiomed.2014.02.016] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2013] [Revised: 02/10/2014] [Accepted: 02/25/2014] [Indexed: 10/25/2022]
Abstract
TRIBE-MCL is a Markov clustering algorithm that operates on a graph built from pairwise similarity information of the input data. Edge weights stored in the stochastic similarity matrix are alternately fed to the two main operations, inflation and expansion, and are normalized in each main loop to maintain the probabilistic constraint. In this paper we propose an efficient implementation of the TRIBE-MCL clustering algorithm, suitable for fast and accurate grouping of protein sequences. A modified sparse matrix structure is introduced that can efficiently handle most operations of the main loop. Taking advantage of the symmetry of the similarity matrix, a fast matrix squaring formula is also introduced to facilitate the time consuming expansion. The proposed algorithm was tested on protein sequence databases like SCOP95. In terms of efficiency, the proposed solution improves execution speed by two orders of magnitude, compared to recently published efficient solutions, reducing the total runtime well below 1min in the case of the 11,944proteins of SCOP95. This improvement in computation time is reached without losing anything from the partition quality. Convergence is generally reached in approximately 50 iterations. The efficient execution enabled us to perform a thorough evaluation of classification results and to formulate recommendations regarding the choice of the algorithm׳s parameter values.
Collapse
Affiliation(s)
- Sándor M Szilágyi
- Petru Maior University, Department of Informatics, Str. Nicolae Iorga Nr. 1, 540088 Tîrgu Mureş, Romania.
| | - László Szilágyi
- Budapest University of Technology and Economics, Department of Control Engineering and Information Technology, Magyar tudósok krt. 2, H-1117 Budapest, Hungary; Sapientia University of Transylvania, Faculty of Technical and Human Sciences, Şoseaua Sighişoarei 1/C, 540485 Tîrgu Mureş, Romania.
| |
Collapse
|
8
|
Promponas VJ, Ouzounis CA, Iliopoulos I. Experimental evidence validating the computational inference of functional associations from gene fusion events: a critical survey. Brief Bioinform 2012; 15:443-54. [PMID: 23220349 PMCID: PMC4017328 DOI: 10.1093/bib/bbs072] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
More than a decade ago, a number of methods were proposed for the inference of protein interactions, using whole-genome information from gene clusters, gene fusions and phylogenetic profiles. This structural and evolutionary view of entire genomes has provided a valuable approach for the functional characterization of proteins, especially those without sequence similarity to proteins of known function. Furthermore, this view has raised the real possibility to detect functional associations of genes and their corresponding proteins for any entire genome sequence. Yet, despite these exciting developments, there have been relatively few cases of real use of these methods outside the computational biology field, as reflected from citation analysis. These methods have the potential to be used in high-throughput experimental settings in functional genomics and proteomics to validate results with very high accuracy and good coverage. In this critical survey, we provide a comprehensive overview of 30 most prominent examples of single pairwise protein interaction cases in small-scale studies, where protein interactions have either been detected by gene fusion or yielded additional, corroborating evidence from biochemical observations. Our conclusion is that with the derivation of a validated gold-standard corpus and better data integration with big experiments, gene fusion detection can truly become a valuable tool for large-scale experimental biology.
Collapse
Affiliation(s)
- Vasilis J Promponas
- Institute of Agrobiotechnology, Centre for Research & Technology Hellas (CERTH), 57001 Thessaloniki, Greece.
| | | | | |
Collapse
|
9
|
Midha M, Polavarapu R, Meetei PA, Krishnan H, Mohareer K, Vindal V. OrFin: A web tool for detection of putative orthologs. Bioinformation 2012; 8:738-9. [PMID: 23055622 PMCID: PMC3449379 DOI: 10.6026/97320630008738] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2012] [Accepted: 07/16/2012] [Indexed: 11/23/2022] Open
Abstract
UNLABELLED Identification of ortholog is one of the important tasks to understand a novel genome. It helps to assign functional annotations, from one organism to another organism. To identify the putative ortholog, Reciprocal Best BLAST hit (RBBH) method is known to be an efficient approach. OrFin makes use of the same approach to identify pair of orthologous proteins for a given set of sequences of two species. It is a user-friendly web tool which works with user defined parameters to search RBBHs. Results are produced in both html and text format. AVAILABILITY This web tool is freely available at http://bifl.uohyd.ac.in/orfin.
Collapse
Affiliation(s)
- Mohit Midha
- Department of Biotechnology, School of Life Sciences, University of Hyderabad, Hyderabad 500046, India
| | - Raja Polavarapu
- Department of Biotechnology, School of Life Sciences, University of Hyderabad, Hyderabad 500046, India
| | | | - Hari Krishnan
- Department of Biotechnology, School of Life Sciences, University of Hyderabad, Hyderabad 500046, India
| | | | - Vaibhav Vindal
- Department of Biotechnology, School of Life Sciences, University of Hyderabad, Hyderabad 500046, India
| |
Collapse
|
10
|
Mazandu GK, Mulder NJ. Function prediction and analysis of mycobacterium tuberculosis hypothetical proteins. Int J Mol Sci 2012; 13:7283-7302. [PMID: 22837694 PMCID: PMC3397526 DOI: 10.3390/ijms13067283] [Citation(s) in RCA: 68] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2012] [Revised: 05/28/2012] [Accepted: 06/07/2012] [Indexed: 11/16/2022] Open
Abstract
High-throughput biology technologies have yielded complete genome sequences and functional genomics data for several organisms, including crucial microbial pathogens of humans, animals and plants. However, up to 50% of genes within a genome are often labeled "unknown", "uncharacterized" or "hypothetical", limiting our understanding of virulence and pathogenicity of these organisms. Even though biological functions of proteins encoded by these genes are not known, many of them have been predicted to be involved in key processes in these organisms. In particular, for Mycobacterium tuberculosis, some of these "hypothetical" proteins, for example those belonging to the Pro-Glu or Pro-Pro-Glu (PE/PPE) family, have been suspected to play a crucial role in the intracellular lifestyle of this pathogen, and may contribute to its survival in different environments. We have generated a functional interaction network for Mycobacterium tuberculosis proteins and used this to predict functions for many of its hypothetical proteins. Here we performed functional enrichment analysis of these proteins based on their predicted biological functions to identify annotations that are statistically relevant, and analysed and compared network properties of hypothetical proteins to the known proteins. From the statistically significant annotations and network information, we have tried to derive biologically meaningful annotations related to infection and disease. This quantitative analysis provides an overview of the functional contributions of Mycobacterium tuberculosis "hypothetical" proteins to many basic cellular functions, including its adaptability in the host system and its ability to evade the host immune response.
Collapse
Affiliation(s)
| | - Nicola J. Mulder
- Author to whom correspondence should be addressed; E-Mail: ; Tel.: +27-21-406-6058; Fax: +27-21-406-6068
| |
Collapse
|
11
|
Szilágyi L, Medvés L, Szilágyi SM. A modified Markov clustering approach to unsupervised classification of protein sequences. Neurocomputing 2010. [DOI: 10.1016/j.neucom.2010.02.023] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
12
|
Corchado JM, De Paz JF, Rodríguez S, Bajo J. Model of experts for decision support in the diagnosis of leukemia patients. Artif Intell Med 2009; 46:179-200. [DOI: 10.1016/j.artmed.2008.12.001] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2008] [Revised: 11/11/2008] [Accepted: 12/01/2008] [Indexed: 11/26/2022]
|
13
|
Tsoka S. Computational methodologies for genome evolution and functional association. Comput Chem Eng 2007. [DOI: 10.1016/j.compchemeng.2006.11.011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
14
|
Abstract
The study of gene expression profiling of cells and tissue has become a major tool for discovery in medicine. Microarray experiments allow description of genome-wide expression changes in health and disease. The results of such experiments are expected to change the methods employed in the diagnosis and prognosis of disease in obstetrics and gynecology. Moreover, an unbiased and systematic study of gene expression profiling should allow the establishment of a new taxonomy of disease for obstetric and gynecologic syndromes. Thus, a new era is emerging in which reproductive processes and disorders could be characterized using molecular tools and fingerprinting. The design, analysis, and interpretation of microarray experiments require specialized knowledge that is not part of the standard curriculum of our discipline. This article describes the types of studies that can be conducted with microarray experiments (class comparison, class prediction, class discovery). We discuss key issues pertaining to experimental design, data preprocessing, and gene selection methods. Common types of data representation are illustrated. Potential pitfalls in the interpretation of microarray experiments, as well as the strengths and limitations of this technology, are highlighted. This article is intended to assist clinicians in appraising the quality of the scientific evidence now reported in the obstetric and gynecologic literature.
Collapse
Affiliation(s)
- Adi L. Tarca
- Perinatology Research Branch, National Institute of Child Health and Human Development, National Institutes of Health, Department of Health and Human Services, Bethesda, MD, and Detroit, MI
- Department of Computer Science, Wayne State University
| | - Roberto Romero
- Perinatology Research Branch, National Institute of Child Health and Human Development, National Institutes of Health, Department of Health and Human Services, Bethesda, MD, and Detroit, MI
- Center for Molecular Medicine and Genetics, Wayne State University
| | - Sorin Draghici
- Department of Computer Science, Wayne State University
- Karmanos Cancer Institute, Detroit, MI
| |
Collapse
|
15
|
Yura K, Yamaguchi A, Go M. Coverage of whole proteome by structural genomics observed through protein homology modeling database. JOURNAL OF STRUCTURAL AND FUNCTIONAL GENOMICS 2006; 7:65-76. [PMID: 17146617 PMCID: PMC1769342 DOI: 10.1007/s10969-006-9010-3] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/11/2006] [Accepted: 08/08/2006] [Indexed: 11/07/2022]
Abstract
We have been developing FAMSBASE, a protein homology-modeling database of whole ORFs predicted from genome sequences. The latest update of FAMSBASE ( http://daisy.nagahama-i-bio.ac.jp/Famsbase/ ), which is based on the protein three-dimensional (3D) structures released by November 2003, contains modeled 3D structures for 368,724 open reading frames (ORFs) derived from genomes of 276 species, namely 17 archaebacterial, 130 eubacterial, 18 eukaryotic and 111 phage genomes. Those 276 genomes are predicted to have 734,193 ORFs in total and the current FAMSBASE contains protein 3D structure of approximately 50% of the ORF products. However, cases that a modeled 3D structure covers the whole part of an ORF product are rare. When portion of an ORF with 3D structure is compared in three kingdoms of life, in archaebacteria and eubacteria, approximately 60% of the ORFs have modeled 3D structures covering almost the entire amino acid sequences, however, the percentage falls to about 30% in eukaryotes. When annual differences in the number of ORFs with modeled 3D structure are calculated, the fraction of modeled 3D structures of soluble protein for archaebacteria is increased by 5%, and that for eubacteria by 7% in the last 3 years. Assuming that this rate would be maintained and that determination of 3D structures for predicted disordered regions is unattainable, whole soluble protein model structures of prokaryotes without the putative disordered regions will be in hand within 15 years. For eukaryotic proteins, they will be in hand within 25 years. The 3D structures we will have at those times are not the 3D structure of the entire proteins encoded in single ORFs, but the 3D structures of separate structural domains. Measuring or predicting spatial arrangements of structural domains in an ORF will then be a coming issue of structural genomics.
Collapse
Affiliation(s)
- Kei Yura
- Quantum Bioinformatics Team, Center for Computational Science and Engineering, Japan Atomic Energy Agency, Kyoto 619-0215, Japan.
| | | | | |
Collapse
|
16
|
Tsoka S, Simon D, Ouzounis CA. Automated metabolic reconstruction for Methanococcus jannaschii. ARCHAEA-AN INTERNATIONAL MICROBIOLOGICAL JOURNAL 2005; 1:223-9. [PMID: 15810431 PMCID: PMC2685575 DOI: 10.1155/2004/324925] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
We present the computational prediction and synthesis of the metabolic pathways in Methanococcus jannaschii from its genomic sequence using the PathoLogic software. Metabolic reconstruction is based on a reference knowledge base of metabolic pathways and is performed with minimal manual intervention. We predict the existence of 609 metabolic reactions that are assembled in 113 metabolic pathways and an additional 17 super-pathways consisting of one or more component pathways. These assignments represent significantly improved enzyme and pathway predictions compared with previous metabolic reconstructions, and some key metabolic reactions, previously missing, have been identified. Our results, in the form of enzymatic assignments and metabolic pathway predictions, form a database (MJCyc) that is accessible over the World Wide Web for further dissemination among members of the scientific community.
Collapse
Affiliation(s)
- Sophia Tsoka
- Computational Genomics Group, The European Bioinformatics Institute, EMBL Cambridge Outstation, Cambridge CB10 1SD, UK.
| | | | | |
Collapse
|
17
|
Tsoka S, Ouzounis CA. Metabolic database systems for the analysis of genome-wide function. Biotechnol Bioeng 2004; 84:750-5. [PMID: 14708115 DOI: 10.1002/bit.10881] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Genome sequencing projects provide an inventory of molecular components for a wide variety of organisms. Metabolic databases integrate these functional descriptions of individual modules into a higher-level characterization of cellular metabolism. This article reviews efforts related to the development of metabolic databases and discusses how such systems have aided the delineation of genome properties. We illustrate the design features of metabolic databases and discuss the challenges facing metabolic as well as databases of other functional type.
Collapse
Affiliation(s)
- Sophia Tsoka
- Computational Genomics Group, The European Bioinformatics Institute, EMBL Cambridge Outstation, Cambridge CB1O 1SD, UK.
| | | |
Collapse
|
18
|
Abstract
Classification of proteins into families is one of the main goals of functional analysis. Proteins are usually assigned to a family on the basis of the presence of family-specific patterns, domains, or structural elements. Whereas proteins belonging to the same family are generally similar to each other, the extent of similarity varies widely across families. Some families are characterized by short, well-defined motifs, whereas others contain longer, less-specific motifs. We present a simple method for visualizing such differences. We applied our method to the Arabidopsis thaliana families listed at The Arabidopsis Information Resource (TAIR) Web site and for 76% of the nontrivial families (families with more than one member), our method identifies simple similarity measures that are necessary and sufficient to cluster members of the family together. Our visualization method can be used as part of an annotation pipeline to identify potentially incorrectly defined families. We also describe how our method can be extended to identify novel families and to assign unclassified proteins into known families.
Collapse
Affiliation(s)
- Vamsi Veeramachaneni
- Department of Biology, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
| | | |
Collapse
|
19
|
von Mering C, Zdobnov EM, Tsoka S, Ciccarelli FD, Pereira-Leal JB, Ouzounis CA, Bork P. Genome evolution reveals biochemical networks and functional modules. Proc Natl Acad Sci U S A 2003; 100:15428-33. [PMID: 14673105 PMCID: PMC307584 DOI: 10.1073/pnas.2136809100] [Citation(s) in RCA: 120] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022] Open
Abstract
The analysis of completely sequenced genomes uncovers an astonishing variability between species in terms of gene content and order. During genome history, the genes are frequently rear-ranged, duplicated, lost, or transferred horizontally between genomes. These events appear to be stochastic, yet they are under selective constraints resulting from the functional interactions between genes. These genomic constraints form the basis for a variety of techniques that employ systematic genome comparisons to predict functional associations among genes. The most powerful techniques to date are based on conserved gene neighborhood, gene fusion events, and common phylogenetic distributions of gene families. Here we show that these techniques, if integrated quantitatively and applied to a sufficiently large number of genomes, have reached a resolution which allows the characterization of function at a higher level than that of the individual gene: global modularity becomes detectable in a functional protein network. In Escherichia coli, the predicted modules can be bench-marked by comparison to known metabolic pathways. We found as many as 74% of the known metabolic enzymes clustering together in modules, with an average pathway specificity of at least 84%. The modules extend beyond metabolism, and have led to hundreds of reliable functional predictions both at the protein and pathway level. The results indicate that modularity in protein networks is intrinsically encoded in present-day genomes.
Collapse
Affiliation(s)
- Christian von Mering
- European Molecular Biology Laboratory, Meyerhofstrasse 1, D-69117 Heidelberg, Germany
| | | | | | | | | | | | | |
Collapse
|
20
|
Audit B, Ouzounis CA. From genes to genomes: universal scale-invariant properties of microbial chromosome organisation. J Mol Biol 2003; 332:617-33. [PMID: 12963371 DOI: 10.1016/s0022-2836(03)00811-8] [Citation(s) in RCA: 33] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Abstract
The availability of complete genome sequences for a large variety of organisms is a major advance in understanding genome structure and function. One attribute of genome structure is chromosome organisation in terms of gene localisation and orientation. For example, bacterial operons, i.e. clusters of co-oriented genes that form transcription units, enable functionally related genes to be expressed simultaneously. The description of genome organisation was pioneered with the study of the distribution of genes of the Escherichia coli partial genetic map before the full genome sequence was known. Deploying powerful techniques from circular statistics and signal processing, we revisit the issue of gene localisation and orientation using 89 complete microbial chromosomes from the eubacterial and archaeal domains. We demonstrate that there is no characteristic size pertinent to the description of chromosome structure, e.g. there does not exist any single length appropriate to describe gene clustering. Our results show that, for all 89 chromosomes, gene positions and gene orientations share a common form of scale-invariant correlations known as "long-range correlations" that we can reveal for distances from the gene length, up to the chromosome size. This observation indicates that genes tend to assemble and to co-orient over any scale of observation greater than a few kilobases. This unexpected property of chromosome structure can be portrayed as an operon-like organisation at all scales and implies that a complete scale range extending over more than three orders of magnitudes of chromosome segment lengths is necessary to properly describe prokaryotic genome organisation. We propose that this pattern results from the effects of the superhelical context on gene expression coupled with the structure and dynamics of the nucleoid, possibly accommodating the diverse gene expression profiles needed during the different stages of cellular life.
Collapse
Affiliation(s)
- Benjamin Audit
- Wellcome Trust Genome Campus, Computational Genomics Group, The European Bioinformatics Institute, EMBL Cambridge Outstation, Cambridge, CB10 1SD, UK
| | | |
Collapse
|
21
|
Peregrin-Alvarez JM, Tsoka S, Ouzounis CA. The phylogenetic extent of metabolic enzymes and pathways. Genome Res 2003; 13:422-7. [PMID: 12618373 PMCID: PMC430287 DOI: 10.1101/gr.246903] [Citation(s) in RCA: 77] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
The evolution of metabolic enzymes and pathways has been a subject of intense study for more than half a century. Yet, so far, previous studies have focused on a small number of enzyme families or biochemical pathways. Here, we examine the phylogenetic distribution of the full-known metabolic complement of Escherichia coli, using sequence comparison against taxa-specific databases. Half of the metabolic enzymes have homologs in all domains of life, representing families involved in some of the most fundamental cellular processes. We thus show for the first time and in a comprehensive way that metabolism is conserved at the enzyme level. In addition, our analysis suggests that despite the sequence conservation and the extensive phylogenetic distribution of metabolic enzymes, their groupings into biochemical pathways are much more variable than previously thought.
Collapse
Affiliation(s)
- José Manuel Peregrin-Alvarez
- Computational Genomics Group, The European Bioinformatics Institute, EMBL Cambridge Outstation, Cambridge CB10 1SD, UK
| | | | | |
Collapse
|
22
|
Abstract
Bioinformatics is the discipline that develops and applies informatics to the field of molecular biology. Although a comprehensive review of the entire field of bioinformatics is beyond the scope of this article, I review the basic tenets of the field and provide a topical sampling of the popular technologies available to clinicians and researchers. These technologies include tools and methods for sequence analysis (nucleotide and protein sequences), rendering of secondary and tertiary structures for these molecules, and protein fold prediction that can lead to rational drug design. I then discuss signaling pathways, new standards for data representation of genes and proteins, and finally the promise of merging these molecular data with the clinical world (the new science of phenomics).
Collapse
Affiliation(s)
- Peter L Elkin
- Division of Area General Internal Medicine, Mayo Clinic, Rochester, Minn 55905, USA
| |
Collapse
|
23
|
Ayoubi P, Jin X, Leite S, Liu X, Martajaja J, Abduraham A, Wan Q, Yan W, Misawa E, Prade RA. PipeOnline 2.0: automated EST processing and functional data sorting. Nucleic Acids Res 2002; 30:4761-9. [PMID: 12409467 PMCID: PMC135791 DOI: 10.1093/nar/gkf585] [Citation(s) in RCA: 45] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Expressed sequence tags (ESTs) are generated and deposited in the public domain, as redundant, unannotated, single-pass reactions, with virtually no biological content. PipeOnline automatically analyses and transforms large collections of raw DNA-sequence data from chromatograms or FASTA files by calling the quality of bases, screening and removing vector sequences, assembling and rewriting consensus sequences of redundant input files into a unigene EST data set and finally through translation, amino acid sequence similarity searches, annotation of public databases and functional data. PipeOnline generates an annotated database, retaining the processed unigene sequence, clone/file history, alignments with similar sequences, and proposed functional classification, if available. Functional annotation is automatic and based on a novel method that relies on homology of amino acid sequence multiplicity within GenBank records. Records are examined through a function ordered browser or keyword queries with automated export of results. PipeOnline offers customization for individual projects (MyPipeOnline), automated updating and alert service. PipeOnline is available at http://stress-genomics.org.
Collapse
Affiliation(s)
- Patricia Ayoubi
- Department of Microbiology and Molecular Genetics and. School of Mechanical and Aerospace Engineering, Oklahoma State University, Stillwater, OK 74078, USA
| | | | | | | | | | | | | | | | | | | |
Collapse
|
24
|
Rigoutsos I, Huynh T, Floratos A, Parida L, Platt D. Dictionary-driven protein annotation. Nucleic Acids Res 2002; 30:3901-16. [PMID: 12202776 PMCID: PMC137405 DOI: 10.1093/nar/gkf464] [Citation(s) in RCA: 20] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2002] [Revised: 06/04/2002] [Accepted: 06/04/2002] [Indexed: 11/14/2022] Open
Abstract
Computational methods seeking to automatically determine the properties (functional, structural, physicochemical, etc.) of a protein directly from the sequence have long been the focus of numerous research groups. With the advent of advanced sequencing methods and systems, the number of amino acid sequences that are being deposited in the public databases has been increasing steadily. This has in turn generated a renewed demand for automated approaches that can annotate individual sequences and complete genomes quickly, exhaustively and objectively. In this paper, we present one such approach that is centered around and exploits the Bio-Dictionary, a collection of amino acid patterns that completely covers the natural sequence space and can capture functional and structural signals that have been reused during evolution, within and across protein families. Our annotation approach also makes use of a weighted, position-specific scoring scheme that is unaffected by the over-representation of well-conserved proteins and protein fragments in the databases used. For a given query sequence, the method permits one to determine, in a single pass, the following: local and global similarities between the query and any protein already present in a public database; the likeness of the query to all available archaeal/ bacterial/eukaryotic/viral sequences in the database as a function of amino acid position within the query; the character of secondary structure of the query as a function of amino acid position within the query; the cytoplasmic, transmembrane or extracellular behavior of the query; the nature and position of binding domains, active sites, post-translationally modified sites, signal peptides, etc. In terms of performance, the proposed method is exhaustive, objective and allows for the rapid annotation of individual sequences and full genomes. Annotation examples are presented and discussed in Results, including individual queries and complete genomes that were released publicly after we built the Bio-Dictionary that is used in our experiments. Finally, we have computed the annotations of more than 70 complete genomes and made them available on the World Wide Web at http://cbcsrv.watson.ibm.com/Annotations/.
Collapse
Affiliation(s)
- Isidore Rigoutsos
- Bioinformatics and Pattern Discovery Group, IBM TJ Watson Research Center, Yorktown Heights, NY 10598, USA.
| | | | | | | | | |
Collapse
|
25
|
Affiliation(s)
- Ardeshir Bayat
- Centre for Integrated Genomic Medical Research, University of Manchester, Manchester M13 9PT.
| |
Collapse
|
26
|
Enright AJ, Van Dongen S, Ouzounis CA. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res 2002; 30:1575-84. [PMID: 11917018 PMCID: PMC101833 DOI: 10.1093/nar/30.7.1575] [Citation(s) in RCA: 2344] [Impact Index Per Article: 106.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023] Open
Abstract
Detection of protein families in large databases is one of the principal research objectives in structural and functional genomics. Protein family classification can significantly contribute to the delineation of functional diversity of homologous proteins, the prediction of function based on domain architecture or the presence of sequence motifs as well as comparative genomics, providing valuable evolutionary insights. We present a novel approach called TRIBE-MCL for rapid and accurate clustering of protein sequences into families. The method relies on the Markov cluster (MCL) algorithm for the assignment of proteins into families based on precomputed sequence similarity information. This novel approach does not suffer from the problems that normally hinder other protein sequence clustering algorithms, such as the presence of multi-domain proteins, promiscuous domains and fragmented proteins. The method has been rigorously tested and validated on a number of very large databases, including SwissProt, InterPro, SCOP and the draft human genome. Our results indicate that the method is ideally suited to the rapid and accurate detection of protein families on a large scale. The method has been used to detect and categorise protein families within the draft human genome and the resulting families have been used to annotate a large proportion of human proteins.
Collapse
Affiliation(s)
- A J Enright
- Computational Genomics Group, The European Bioinformatics Institute, EMBL Cambridge Outstation, Cambridge CB10 1SD, UK.
| | | | | |
Collapse
|
27
|
Abstract
Annotation, the process by which structural or functional information is inferred for genes or proteins, is crucial for obtaining value from genome sequences. We define the process of annotating a previously annotated genome sequence as 're-annotation', and examine the strengths and weaknesses of current manual and automatic genome-wide re-annotation approaches.
Collapse
Affiliation(s)
- Christos A Ouzounis
- Computational Genomics Group, The European Bioinformatics Institute, EMBL Cambridge Outstation, Cambridge CB10 1SD, UK. E-mail:
| | - Peter D Karp
- Bioinformatics Research Group, AI Center, SRI International, Menlo Park, CA 94025, USA. E-mail:
| |
Collapse
|
28
|
Dandekar T, Du F, Schirmer RH, Schmidt S. Medical target prediction from genome sequence: combining different sequence analysis algorithms with expert knowledge and input from artificial intelligence approaches. COMPUTERS & CHEMISTRY 2001; 26:15-21. [PMID: 11765847 DOI: 10.1016/s0097-8485(01)00095-x] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
Abstract
By exploiting the rapid increase in available sequence data, the definition of medically relevant protein targets has been improved by a combination of: (i) differential genome analysis (target list): and (ii) analysis of individual proteins (target analysis). Fast sequence comparisons, data mining, and genetic algorithms further promote these procedures. Mycobacterium tuberculosis proteins were chosen as applied examples.
Collapse
Affiliation(s)
- T Dandekar
- European Molecular Biology Laboratory, PO Box 102209, Meyerhostrasse 1, D-69012 Heidelberg, Germany.
| | | | | | | |
Collapse
|
29
|
Lecompte O, Thompson JD, Plewniak F, Thierry J, Poch O. Multiple alignment of complete sequences (MACS) in the post-genomic era. Gene 2001; 270:17-30. [PMID: 11403999 DOI: 10.1016/s0378-1119(01)00461-9] [Citation(s) in RCA: 47] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Multiple alignment, since its introduction in the early seventies, has become a cornerstone of modern molecular biology. It has traditionally been used to deduce structure / function by homology, to detect conserved motifs and in phylogenetic studies. There has recently been some renewed interest in the development of multiple alignment techniques, with current opinion moving away from a single all-encompassing algorithm to iterative and / or co-operative strategies. The exploitation of multiple alignments in genome annotation projects represents a qualitative leap in the functional analysis process, opening the way to the study of the co-evolution of validated sets of proteins and to reliable phylogenomic analysis. However, the alignment of the highly complex proteins detected by today's advanced database search methods is a daunting task. In addition, with the explosion of the sequence databases and with the establishment of numerous specialized biological databases, multiple alignment programs must evolve if they are to successfully rise to the new challenges of the post-genomic era. The way forward is clearly an integrated system bringing together sequence data, knowledge-based systems and prediction methods with their inherent unreliability. The incorporation of such heterogeneous, often non-consistent, data will require major changes to the fundamental alignment algorithms used to date. Such an integrated multiple alignment system will provide an ideal workbench for the validation, propagation and presentation of this information in a format that is concise, clear and intuitive.
Collapse
Affiliation(s)
- O Lecompte
- Laboratoire de Biologie et Génomique Structurales, Institut de Génétique et de Biologie Moléculaire et Cellulaire (CNRS/INSERM/ULP), BP 163, 67404 Cedex, Illkirch, France
| | | | | | | | | |
Collapse
|
30
|
Abstract
Fold assignments for newly sequenced genomes belong to the most important and interesting applications of the booming field of protein structure prediction. We present a brief survey and a discussion of such assignments completed to date, using as an example several fold assignment projects for proteins from the Escherichia coli genome. This review focuses on steps that are necessary to go beyond the simple assignment projects and into the development of tools extending our understanding of functions of proteins in newly sequenced genomes. This paper also discusses several problems seldom addressed in the literature, such as the problem of domain prediction and complementary predictions (e.g., transmembrane regions and flexible regions) and cross-correlation of predictions from different servers. The influence of sequence and structure database growth on prediction success is also addressed. Finally, we discuss the perspectives of the field in the context of massive sequence and structure determination projects, as well as the development of novel prediction methods.
Collapse
Affiliation(s)
- K Pawlowski
- AstraZeneca R&D Lund, Lund, S-221 87, Sweden
| | | | | | | |
Collapse
|
31
|
Current Awareness on Comparative and Functional Genomics. Comp Funct Genomics 2001. [PMCID: PMC2447194 DOI: 10.1002/cfg.56] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/04/2022] Open
|
32
|
Iliopoulos I, Tsoka S, Andrade MA, Janssen P, Audit B, Tramontano A, Valencia A, Leroy C, Sander C, Ouzounis CA. Genome sequences and great expectations. Genome Biol 2001; 2:INTERACTIONS0001. [PMID: 11178275 PMCID: PMC150431 DOI: 10.1186/gb-2000-2-1-interactions0001] [Citation(s) in RCA: 34] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
To assess how automatic function assignment will contribute to genome annotation in the next five years, we have performed an analysis of 31 available genome sequences. An emerging pattern is that function can be predicted for almost two-thirds of the 73,500 genes that were analyzed. Despite progress in computational biology, there will always be a great need for large-scale experimental determination of protein function.
Collapse
Affiliation(s)
- Ioannis Iliopoulos
- Computational Genomics Group, Research Programme, The European Bioinformatics Institute, EMBL Cambridge Outstation, Cambridge, CB10 1SD, UK
| | - Sophia Tsoka
- Computational Genomics Group, Research Programme, The European Bioinformatics Institute, EMBL Cambridge Outstation, Cambridge, CB10 1SD, UK
| | - Miguel A Andrade
- Biological Structures and BioComputing Programme, EMBL, Meyerhofstrasse 1, Heidelberg, Germany
| | - Paul Janssen
- Computational Genomics Group, Research Programme, The European Bioinformatics Institute, EMBL Cambridge Outstation, Cambridge, CB10 1SD, UK
| | - Benjamin Audit
- Computational Genomics Group, Research Programme, The European Bioinformatics Institute, EMBL Cambridge Outstation, Cambridge, CB10 1SD, UK
| | - Anna Tramontano
- Department of Computational Biology and Chemistry, IRBM, Via Pontina, Pomezia, Rome, Italy
| | - Alfonso Valencia
- Protein Design Group, National Center for Biotechnology, Cantoblanco, Madrid, Spain
| | - Christophe Leroy
- MIT Center for Genome Research, One Kendall Square, Cambridge, MA 02139, USA
| | - Chris Sander
- MIT Center for Genome Research, One Kendall Square, Cambridge, MA 02139, USA
| | - Christos A Ouzounis
- Computational Genomics Group, Research Programme, The European Bioinformatics Institute, EMBL Cambridge Outstation, Cambridge, CB10 1SD, UK
| |
Collapse
|
33
|
Abstract
Automated sequence technology has rendered functional biology amenable to genomic scale analysis. Among genome-wide exploratory approaches, the two-hybrid system in yeast (Y2H) has outranked other techniques because it is the system of choice to detect protein-protein interactions. Deciphering the cascade of binding events in a whole cell helps define signal transduction and metabolic pathways or enzymatic complexes. The function of proteins is eventually attributed through whole cell protein interaction maps where totally unknown proteins are partnered with fully annotated proteins belonging to the same functional category. Since its first description in the late 1980's, several versions of the Y2H have been developed in order to overcome the major limitations of the system, namely false positives and false negatives. Optimized versions have been recently applied at multi-molecular and genomic scale. These genome-wide surveys can be methodologically divided into two types of approaches: one either tests combinations of predefined polypeptides (the so-called matrix approach) using various short-cuts to speed up the process, or one screens with a given polypeptide (bait) for potential partners (preys) present in complex libraries of genomic or complementary DNA (library screening). In the former strategy, one tests what one knows, for example pair-wise interactions between full-length open reading frames from recently sequenced and annotated genomes. Although based on a one-by-one scheme, this method is reported to be amenable to large-scale genomics thanks to multicloning strategies and to the use of small robotics workstations. In the latter, highly complex cDNA or genomic libraries of protein domains can be screened to saturation with high-throughput screening systems allowing the discovery of yet unidentified proteins. Both approaches have strengths and drawbacks that will be discussed here. None yields a full proteome-wide screening since certain proteins (e.g. some transcription factors) are not usable in Y2H. Novel two-hybrid assays have been recently described in bacteria. Applications of these time- and cost-effective assays to genomic screening will be discussed and compared to the Y2H technology.
Collapse
|