1
|
Hendargo KJ, Patel AO, Chukwudozie OS, Moreno-Hagelsieb G, Christen JA, Medrano-Soto A, Saier MH. Sequence Similarity among Structural Repeats in the Piezo Family of Mechanosensitive Ion Channels. Microb Physiol 2023; 33:49-62. [PMID: 37321192 PMCID: PMC11283329 DOI: 10.1159/000531468] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2022] [Accepted: 06/05/2023] [Indexed: 06/17/2023]
Abstract
Members of the Piezo family of mechanically activated cation channels are involved in multiple physiological processes in higher eukaryotes, including vascular development, cell differentiation, touch perception, hearing, and more, but they are also common in single-celled eukaryotic microorganisms. Mutations in these proteins in humans are associated with a variety of diseases, such as colorectal adenomatous polyposis, dehydrated hereditary stomatocytosis, and hereditary xerocytosis. Available 3D structures for Piezo proteins show nine regions of four transmembrane segments each that have the same fold. Despite the remarkable similarity among the nine characteristic structural repeats in the family, no significant sequence similarity among them has been reported. Using bioinformatics approaches and the Transporter Classification Database (TCDB) as reference, we reliably identified sequence similarity among repeats based on four lines of evidence: (1) hidden Markov model-profile similarities across repeats at the family level, (2) pairwise sequence similarities between different repeats across Piezo homologs, (3) Piezo-specific conserved sequence signatures that consistently identify the same regions across repeats, and (4) conserved residues that maintain the same orientation and location in 3D space.
Collapse
Affiliation(s)
- Kevin J. Hendargo
- Department of Molecular Biology, School of Biological Sciences, University of California, San Diego, CA, USA
| | - Ashay O. Patel
- Department of Molecular Biology, School of Biological Sciences, University of California, San Diego, CA, USA
| | - Onyeka S. Chukwudozie
- Department of Molecular Biology, School of Biological Sciences, University of California, San Diego, CA, USA
| | | | - J. Andrés Christen
- Departamento de Probabilidad y Estadística, Centro de Investigación en Matemáticas, CIMAT, Guanajuato, Mexico
| | - Arturo Medrano-Soto
- Department of Molecular Biology, School of Biological Sciences, University of California, San Diego, CA, USA
| | - Milton H. Saier
- Department of Molecular Biology, School of Biological Sciences, University of California, San Diego, CA, USA
| |
Collapse
|
2
|
Tantoso E, Eisenhaber B, Eisenhaber F. Optimizing the Parametrization of Homologue Classification in the Pan-Genome Computation for a Bacterial Species: Case Study Streptococcus pyogenes. Methods Mol Biol 2022; 2449:299-324. [PMID: 35507269 DOI: 10.1007/978-1-0716-2095-3_13] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
The paradigm shift associated with the introduction of the pan-genome concept has drawn the attention from singular reference genomes toward the actual sequence diversity within organism populations, strain collections, clades, etc. A single genome is no longer sufficient to describe bacteria of interest, but instead, the genomic repertoire of all existing strains is the key to the metabolic, evolutionary, or pathogenic potential of a species. The classification of orthologous genes derived from a collection of taxonomically related genome sequences is central to bacterial pan-genome computational analysis. In this work, we present a review of methods for computing pan-genome gene clusters including their comparative analysis for the case of Streptococcus pyogenes strain genomes. We exhaustively scanned the parametrization space of the homologue searching procedures and find optimal parameters (sequence identity (60%) and coverage (50-60%) in the pairwise alignment) for the orthologous clustering of gene sequences. We find that the sequence identity threshold influences the number of gene families ~3 times stronger than the sequence coverage threshold.
Collapse
Affiliation(s)
- Erwin Tantoso
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), Singapore, Republic of Singapore
| | - Birgit Eisenhaber
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), Singapore, Republic of Singapore
- Genome Institute Singapore (GIS), Agency for Science, Technology and Research (A*STAR), Singapore, Republic of Singapore
| | - Frank Eisenhaber
- Genome Institute and Bioinformatics Institute, Singapore, Singapore.
| |
Collapse
|
3
|
Tyler D, Hendargo KJ, Medrano-Soto A, Saier MH. Discovery and Characterization of the Phospholemman/SIMP/Viroporin Superfamily. Microb Physiol 2022; 32:83-94. [PMID: 35152214 PMCID: PMC9355910 DOI: 10.1159/000521947] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2021] [Accepted: 01/11/2022] [Indexed: 11/19/2022]
Abstract
Using bioinformatic approaches, we present evidence of distant relatedness among the Ephemerovirus Viroporin family, the Rhabdoviridae Putative Viroporin U5 family, the Phospholemman family, and the Small Integral Membrane Protein family. Our approach is based on the transitivity property of homology complemented with five validation criteria: (1) significant sequence similarity and alignment coverage, (2) compatibility of topology of transmembrane segments, (3) overlap of hydropathy profiles, (4) conservation of protein domains, and (5) conservation of sequence motifs. Our results indicate that Pfam protein domains PF02038 and PF15831 can be found in or projected onto members of all four families. In addition, we identified a 26-residue motif conserved across the superfamily. This motif is characterized by hydrophobic residues that help anchor the protein to the membrane and charged residues that constitute phosphorylation sites. In addition, all members of the four families with annotated function are either responsible for or affect the transport of ions into and/or out of the cell. Taken together, these results justify the creation of the novel Phospholemman/SIMP/Viroporin superfamily. Given that transport proteins can be found not just in cells, but also in viruses, the ability to relate viroporin protein families with their eukaryotic and bacterial counterparts is an important development in this superfamily.
Collapse
Affiliation(s)
| | | | - Arturo Medrano-Soto
- Corresponding Authors: Milton H. Saier, Jr. & Arturo
Medrano-Soto, Department of Molecular Biology, Division of Biological Sciences.,
University of California, San Diego., 9500 Gilman Drive #0116, La Jolla,
California. 92093-0116, Tel: 858-534-4084, &
| | - Milton H. Saier
- Corresponding Authors: Milton H. Saier, Jr. & Arturo
Medrano-Soto, Department of Molecular Biology, Division of Biological Sciences.,
University of California, San Diego., 9500 Gilman Drive #0116, La Jolla,
California. 92093-0116, Tel: 858-534-4084, &
| |
Collapse
|
4
|
Harrison PM. fLPS 2.0: rapid annotation of compositionally-biased regions in biological sequences. PeerJ 2021; 9:e12363. [PMID: 34760378 PMCID: PMC8557692 DOI: 10.7717/peerj.12363] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2021] [Accepted: 09/30/2021] [Indexed: 12/12/2022] Open
Abstract
Compositionally-biased (CB) regions in biological sequences are enriched for a subset of sequence residue types. These can be shorter regions with a concentrated bias (i.e., those termed ‘low-complexity’), or longer regions that have a compositional skew. These regions comprise a prominent class of the uncharacterized ‘dark matter’ of the protein universe. Here, I report the latest version of the fLPS package for the annotation of CB regions, which includes added consideration of DNA sequences, to label the eight possible biased regions of DNA. In this version, the user is now able to restrict analysis to a specified subset of residue types, and also to filter for previously annotated domains to enable detection of discontinuous CB regions. A ‘thorough’ option has been added which enables the labelling of subtler biases, typically made from a skew for several residue types. In the output, protein CB regions are now labelled with bias classes reflecting the physico-chemical character of the biasing residues. The fLPS 2.0 package is available from: https://github.com/pmharrison/flps2 or in a Supplemental File of this paper.
Collapse
Affiliation(s)
- Paul M Harrison
- Department of Biology, McGill University, Montreal, QC, Canada
| |
Collapse
|
5
|
Eisenhaber B, Sinha S, Jadalanki CK, Shitov VA, Tan QW, Sirota FL, Eisenhaber F. Conserved sequence motifs in human TMTC1, TMTC2, TMTC3, and TMTC4, new O-mannosyltransferases from the GT-C/PMT clan, are rationalized as ligand binding sites. Biol Direct 2021; 16:4. [PMID: 33436046 PMCID: PMC7801869 DOI: 10.1186/s13062-021-00291-w] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2020] [Accepted: 01/04/2021] [Indexed: 12/05/2022] Open
Abstract
BACKGROUND The human proteins TMTC1, TMTC2, TMTC3 and TMTC4 have been experimentally shown to be components of a new O-mannosylation pathway. Their own mannosyl-transferase activity has been suspected but their actual enzymatic potential has not been demonstrated yet. So far, sequence analysis of TMTCs has been compromised by evolutionary sequence divergence within their membrane-embedded N-terminal region, sequence inaccuracies in the protein databases and the difficulty to interpret the large functional variety of known homologous proteins (mostly sugar transferases and some with known 3D structure). RESULTS Evolutionary conserved molecular function among TMTCs is only possible with conserved membrane topology within their membrane-embedded N-terminal regions leading to the placement of homologous long intermittent loops at the same membrane side. Using this criterion, we demonstrate that all TMTCs have 11 transmembrane regions. The sequence segment homologous to Pfam model DUF1736 is actually just a loop between TM7 and TM8 that is located in the ER lumen and that contains a small hydrophobic, but not membrane-embedded helix. Not only do the membrane-embedded N-terminal regions of TMTCs share a common fold and 3D structural similarity with subgroups of GT-C sugar transferases. The conservation of residues critical for catalysis, for binding of a divalent metal ion and of the phosphate group of a lipid-linked sugar moiety throughout enzymatically and structurally well-studied GT-Cs and sequences of TMTCs indicates that TMTCs are actually sugar-transferring enzymes. We present credible 3D structural models of all four TMTCs (derived from their closest known homologues 5ezm/5f15) and find observed conserved sequence motifs rationalized as binding sites for a metal ion and for a dolichyl-phosphate-mannose moiety. CONCLUSIONS With the results from both careful sequence analysis and structural modelling, we can conclusively say that the TMTCs are enzymatically active sugar transferases belonging to the GT-C/PMT superfamily. The DUF1736 segment, the loop between TM7 and TM8, is critical for catalysis and lipid-linked sugar moiety binding. Together with the available indirect experimental data, we conclude that the TMTCs are not only part of an O-mannosylation pathway in the endoplasmic reticulum of upper eukaryotes but, actually, they are the sought mannosyl-transferases.
Collapse
Affiliation(s)
- Birgit Eisenhaber
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), 30 Biopolis Street, #07-01 Matrix, Singapore, 138671, Republic of Singapore.
- Genome Institute of Singapore (BII), Agency for Science, Technology and Research (A*STAR), 60 Biopolis Street, Singapore, 138672, Republic of Singapore.
| | - Swati Sinha
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), 30 Biopolis Street, #07-01 Matrix, Singapore, 138671, Republic of Singapore
| | - Chaitanya K Jadalanki
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), 30 Biopolis Street, #07-01 Matrix, Singapore, 138671, Republic of Singapore
| | - Vladimir A Shitov
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), 30 Biopolis Street, #07-01 Matrix, Singapore, 138671, Republic of Singapore
- Siberian State Medical University, Moskovskiy Trakt, 2, Tomsk, Tomsk Oblast, 634050, Russia
| | - Qiao Wen Tan
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), 30 Biopolis Street, #07-01 Matrix, Singapore, 138671, Republic of Singapore
- School of Biological Science (SBS), Nanyang Technological University (NTU), 60 Nanyang Drive, Singapore, 637551, Republic of Singapore
| | - Fernanda L Sirota
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), 30 Biopolis Street, #07-01 Matrix, Singapore, 138671, Republic of Singapore
| | - Frank Eisenhaber
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), 30 Biopolis Street, #07-01 Matrix, Singapore, 138671, Republic of Singapore.
- Genome Institute of Singapore (BII), Agency for Science, Technology and Research (A*STAR), 60 Biopolis Street, Singapore, 138672, Republic of Singapore.
- School of Biological Science (SBS), Nanyang Technological University (NTU), 60 Nanyang Drive, Singapore, 637551, Republic of Singapore.
| |
Collapse
|
6
|
Medrano-Soto A, Ghazi F, Hendargo KJ, Moreno-Hagelsieb G, Myers S, Saier MH. Expansion of the Transporter-Opsin-G protein-coupled receptor superfamily with five new protein families. PLoS One 2020; 15:e0231085. [PMID: 32320418 PMCID: PMC7176098 DOI: 10.1371/journal.pone.0231085] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2019] [Accepted: 03/17/2020] [Indexed: 02/06/2023] Open
Abstract
Here we provide bioinformatic evidence that the Organo-Arsenical Exporter (ArsP), Endoplasmic Reticulum Retention Receptor (KDELR), Mitochondrial Pyruvate Carrier (MPC), L-Alanine Exporter (AlaE), and the Lipid-linked Sugar Translocase (LST) protein families are members of the Transporter-Opsin-G Protein-coupled Receptor (TOG) Superfamily. These families share domains homologous to well-established TOG superfamily members, and their topologies of transmembranal segments (TMSs) are compatible with the basic 4-TMS repeat unit characteristic of this Superfamily. These repeat units tend to occur twice in proteins as a result of intragenic duplication events, often with subsequent gain/loss of TMSs in many superfamily members. Transporters within the ArsP family allow microbial pathogens to expel toxic arsenic compounds from the cell. Members of the KDELR family are involved in the selective retrieval of proteins that reside in the endoplasmic reticulum. Proteins of the MPC family are involved in the transport of pyruvate into mitochondria, providing the organelle with a major oxidative fuel. Members of family AlaE excrete L-alanine from the cell. Members of the LST family are involved in the translocation of lipid-linked glucose across the membrane. These five families substantially expand the range of substrates of transport carriers in the superfamily, although KDEL receptors have no known transport function. Clustering of protein sequences reveals the relationships among families, and the resulting tree correlates well with the degrees of sequence similarity documented between families. The analyses and programs developed to detect distant relatedness, provide insights into the structural, functional, and evolutionary relationships that exist between families of the TOG superfamily, and should be of value to many other investigators.
Collapse
Affiliation(s)
- Arturo Medrano-Soto
- Department of Molecular Biology, Division of Biological Sciences, University of California, San Diego, La Jolla, California, United States of America
| | - Faezeh Ghazi
- Department of Molecular Biology, Division of Biological Sciences, University of California, San Diego, La Jolla, California, United States of America
| | - Kevin J. Hendargo
- Department of Molecular Biology, Division of Biological Sciences, University of California, San Diego, La Jolla, California, United States of America
| | | | - Scott Myers
- Department of Molecular Biology, Division of Biological Sciences, University of California, San Diego, La Jolla, California, United States of America
| | - Milton H. Saier
- Department of Molecular Biology, Division of Biological Sciences, University of California, San Diego, La Jolla, California, United States of America
- * E-mail:
| |
Collapse
|
7
|
Wang SC, Davejan P, Hendargo KJ, Javadi-Razaz I, Chou A, Yee DC, Ghazi F, Lam KJK, Conn AM, Madrigal A, Medrano-Soto A, Saier MH. Expansion of the Major Facilitator Superfamily (MFS) to include novel transporters as well as transmembrane-acting enzymes. BIOCHIMICA ET BIOPHYSICA ACTA-BIOMEMBRANES 2020; 1862:183277. [PMID: 32205149 DOI: 10.1016/j.bbamem.2020.183277] [Citation(s) in RCA: 31] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/11/2019] [Revised: 03/14/2020] [Accepted: 03/17/2020] [Indexed: 12/14/2022]
Abstract
The Major Facilitator Superfamily (MFS) is currently the largest characterized superfamily of transmembrane secondary transport proteins. Its diverse members are found in essentially all organisms in the biosphere and function by uniport, symport, and/or antiport mechanisms. In 1993 we first named and described the MFS which then consisted of 5 previously known families that had not been known to be related, and by 2012 we had identified a total of 74 families, classified phylogenetically within the MFS, all of which included only transport proteins. This superfamily has since expanded to 89 families, all included under TC# 2.A.1, and a few transporter families outside of TC# 2.A.1 were identified as members of the MFS. In this study, we assign nine previously unclassified protein families in the Transporter Classification Database (TCDB; http://www.tcdb.org) to the MFS based on multiple criteria and bioinformatic methodologies. In addition, we find integral membrane domains distantly related to partial or full-length MFS permeases in Lysyl tRNA Synthases (TC# 9.B.111), Lysylphosphatidyl Glycerol Synthases (TC# 4.H.1), and cytochrome b561 transmembrane electron carriers (TC# 5.B.2). Sequence alignments, overlap of hydropathy plots, compatibility of repeat units, similarity of complexity profiles of transmembrane segments, shared protein domains and 3D structural similarities between transport proteins were analyzed to assist in inferring homology. The MFS now includes 105 families.
Collapse
Affiliation(s)
- Steven C Wang
- Department of Molecular Biology, Division of Biological Sciences, University of California at San Diego, La Jolla, CA 92093-0116, United States of America
| | - Pauldeen Davejan
- Department of Molecular Biology, Division of Biological Sciences, University of California at San Diego, La Jolla, CA 92093-0116, United States of America
| | - Kevin J Hendargo
- Department of Molecular Biology, Division of Biological Sciences, University of California at San Diego, La Jolla, CA 92093-0116, United States of America
| | - Ida Javadi-Razaz
- Department of Molecular Biology, Division of Biological Sciences, University of California at San Diego, La Jolla, CA 92093-0116, United States of America
| | - Amy Chou
- Department of Molecular Biology, Division of Biological Sciences, University of California at San Diego, La Jolla, CA 92093-0116, United States of America
| | - Daniel C Yee
- Department of Molecular Biology, Division of Biological Sciences, University of California at San Diego, La Jolla, CA 92093-0116, United States of America
| | - Faezeh Ghazi
- Department of Molecular Biology, Division of Biological Sciences, University of California at San Diego, La Jolla, CA 92093-0116, United States of America
| | - Katie Jing Kay Lam
- Department of Molecular Biology, Division of Biological Sciences, University of California at San Diego, La Jolla, CA 92093-0116, United States of America
| | - Adam M Conn
- Department of Molecular Biology, Division of Biological Sciences, University of California at San Diego, La Jolla, CA 92093-0116, United States of America
| | - Assael Madrigal
- Department of Molecular Biology, Division of Biological Sciences, University of California at San Diego, La Jolla, CA 92093-0116, United States of America
| | - Arturo Medrano-Soto
- Department of Molecular Biology, Division of Biological Sciences, University of California at San Diego, La Jolla, CA 92093-0116, United States of America
| | - Milton H Saier
- Department of Molecular Biology, Division of Biological Sciences, University of California at San Diego, La Jolla, CA 92093-0116, United States of America.
| |
Collapse
|
8
|
Zafar H, Saier MH. Comparative genomics of transport proteins in seven Bacteroides species. PLoS One 2018; 13:e0208151. [PMID: 30517169 PMCID: PMC6281302 DOI: 10.1371/journal.pone.0208151] [Citation(s) in RCA: 26] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2018] [Accepted: 11/12/2018] [Indexed: 01/29/2023] Open
Abstract
The communities of beneficial bacteria that live in our intestines, the gut microbiome, are important for the development and function of the immune system. Bacteroides species make up a significant fraction of the human gut microbiome, and can be probiotic and pathogenic, depending upon various genetic and environmental factors. These can cause disease conditions such as intra-abdominal sepsis, appendicitis, bacteremia, endocarditis, pericarditis, skin infections, brain abscesses and meningitis. In this study, we identify the transport systems and predict their substrates within seven Bacteroides species, all shown to be probiotic; however, four of them (B. thetaiotaomicron, B. vulgatus, B. ovatus, B. fragilis) can be pathogenic (probiotic and pathogenic; PAP), while B. cellulosilyticus, B. salanitronis and B. dorei are believed to play only probiotic roles (only probiotic; OP). The transport system characteristics of the four PAP and three OP strains were identified and tabulated, and results were compared among the seven strains, and with E. coli and Salmonella strains. The Bacteroides strains studied contain similarities and differences in the numbers and types of transport proteins tabulated, but both OP and PAP strains contain similar outer membrane carbohydrate receptors, pore-forming toxins and protein secretion systems, the similarities were noteworthy, but these Bacteroides strains showed striking differences with probiotic and pathogenic enteric bacteria, particularly with respect to their high affinity outer membrane receptors and auxiliary proteins involved in complex carbohydrate utilization. The results reveal striking similarities between the PAP and OP species of Bacteroides, and suggest that OP species may possess currently unrecognized pathogenic potential.
Collapse
Affiliation(s)
- Hassan Zafar
- Department of Molecular Biology, Division of Biological Sciences, University of California at San Diego, La Jolla, CA, United States of America
- Institute of Microbiology, University of Agriculture, Faisalabad, Punjab, Pakistan
| | - Milton H. Saier
- Department of Molecular Biology, Division of Biological Sciences, University of California at San Diego, La Jolla, CA, United States of America
| |
Collapse
|
9
|
Soluri MF, Puccio S, Caredda G, Grillo G, Licciulli VF, Consiglio A, Edomi P, Santoro C, Sblattero D, Peano C. Interactome-Seq: A Protocol for Domainome Library Construction, Validation and Selection by Phage Display and Next Generation Sequencing. J Vis Exp 2018. [PMID: 30346377 DOI: 10.3791/56981] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/14/2023] Open
Abstract
Folding reporters are proteins with easily identifiable phenotypes, such as antibiotic resistance, whose folding and function is compromised when fused to poorly folding proteins or random open reading frames. We have developed a strategy where, by using TEM-1 β-lactamase (the enzyme conferring ampicillin resistance) on a genomic scale, we can select collections of correctly folded protein domains from the coding portion of the DNA of any intronless genome. The protein fragments obtained by this approach, the so called "domainome", will be well expressed and soluble, making them suitable for structural/functional studies. By cloning and displaying the "domainome" directly in a phage display system, we have showed that it is possible to select specific protein domains with the desired binding properties (e.g., to other proteins or to antibodies), thus providing essential experimental information for gene annotation or antigen identification. The identification of the most enriched clones in a selected polyclonal population can be achieved by using novel next-generation sequencing technologies (NGS). For these reasons, we introduce deep sequencing analysis of the library itself and the selection outputs to provide complete information on diversity, abundance and precise mapping of each of the selected fragment. The protocols presented here show the key steps for library construction, characterization, and validation.
Collapse
Affiliation(s)
- Maria Felicia Soluri
- Department of Health Sciences, Università del Piemonte Orientale & IRCAD, Novara, Italy
| | - Simone Puccio
- Institute of Biomedical Technologies, National Research Council, Segrate, Milan, Italy
| | - Giada Caredda
- Institute of Biomedical Technologies, National Research Council, Segrate, Milan, Italy
| | - Giorgio Grillo
- Institute of Biomedical Technologies, National Research Council, Bari, Italy
| | | | - Arianna Consiglio
- Institute of Biomedical Technologies, National Research Council, Bari, Italy
| | - Paolo Edomi
- Department of Life Sciences, University of Trieste, Italy
| | - Claudio Santoro
- Department of Health Sciences, Università del Piemonte Orientale & IRCAD, Novara, Italy
| | | | - Clelia Peano
- Institute of Genetic and Biomedical Research, National Research Council, Rozzano, Milan, Italy; Humanitas Clinical and Research Center, Rozzano, Milan, Italy;
| |
Collapse
|
10
|
Olson D, Wheeler T. ULTRA: A Model Based Tool to Detect Tandem Repeats. ACM-BCB ... ... : THE ... ACM CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY AND BIOMEDICINE. ACM CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY AND BIOMEDICINE 2018; 2018:37-46. [PMID: 31080962 DOI: 10.1145/3233547.3233604] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
Abstract
In biological sequences, tandem repeats consist of tens to hundreds of residues of a repeated pattern, such as atgatgatgatgatg ('atg' repeated), often the result of replication slippage. Over time, these repeats decay so that the original sharp pattern of repetition is somewhat obscured, but even degenerate repeats pose a problem for sequence annotation: when two sequences both contain shared patterns of similar repetition, the result can be a false signal of sequence homology. We describe an implementation of a new hidden Markov model for detecting tandem repeats that shows substantially improved sensitivity to labeling decayed repetitive regions, presents low and reliable false annotation rates across a wide range of sequence composition, and produces scores that follow a stable distribution. On typical genomic sequence, the time and memory requirements of the resulting tool (ULTRA) are competitive with the most heavily used tool for repeat masking (TRF). ULTRA is released under an open source license and lays the groundwork for inclusion of the model in sequence alignment tools and annotation pipelines.
Collapse
|
11
|
Iyer MS, Joshi AG, Sowdhamini R. Genome-wide survey of remote homologues for protein domain superfamilies of known structure reveals unequal distribution across structural classes. Mol Omics 2018; 14:266-280. [PMID: 29971307 DOI: 10.1039/c8mo00008e] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
Abstract
Domains are the basic building blocks of proteins which can combine to give rise to different domain architectures. Annotation of domains in a sequence is the first step towards understanding the biological function. Since there are a limited number of folds and evolutionarily related proteins have a similar structure, function can be inferred through remote homology. Computational sequence searches were performed for remote homologues on genomes of around ∼160 000 different organisms, starting from nearly 11 000 superfamily queries of known structure. Case studies revealed that most of the associated domains are involved in the same biological process. Using all the proteins predicted to have at least one structural domain, a coverage of 61% of Pfam families was achieved which is higher than the existing methods (43.36% by SIFTS). Taxonomic analysis of the proteins revealed 493 superfamilies in all the major kingdoms of life and a few lateral gene transfers between viruses and cellular organisms. The distribution of remote homologues across different classes, folds and superfamilies was studied and reveals that sequences are unequally distributed across structural classes. Finally, domain architectures were computed for the homologues and these data were compiled for each superfamily and organism.
Collapse
Affiliation(s)
- Meenakshi S Iyer
- National Centre for Biological Sciences (TIFR), GKVK Campus, Bellary Road, Bangalore, Karnataka 560 065, India.
| | | | | |
Collapse
|
12
|
Eisenhaber B, Sinha S, Wong WC, Eisenhaber F. Function of a membrane-embedded domain evolutionarily multiplied in the GPI lipid anchor pathway proteins PIG-B, PIG-M, PIG-U, PIG-W, PIG-V, and PIG-Z. Cell Cycle 2018; 17:874-880. [PMID: 29764287 PMCID: PMC6056205 DOI: 10.1080/15384101.2018.1456294] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
Distant homology relationships among proteins with many transmembrane regions (TMs) are difficult to detect as they are clouded by the TMs’ hydrophobic compositional bias and mutational divergence in connecting loops. In the case of several GPI lipid anchor biosynthesis pathway components, the hidden evolutionary signal can be revealed with dissectHMMER, a sequence similarity search tool focusing on fold-critical, high complexity sequence segments. We find that a sequence module with 10 TMs in PIG-W, described as acyl transferase, is homologous to PIG-U, a transamidase subunit without characterized molecular function, and to mannosyltransferases PIG-B, PIG-M, PIG-V and PIG-Z. We conclude that this new, membrane-embedded domain named BindGPILA functions as the unit for recognizing, binding and stabilizing the GPI lipid anchor in a modification-competent form as this appears the only functional aspect shared among all proteins. Thus, PIG-U's likely molecular function is shuttling/presenting the anchor in a productive conformation to the transamidase complex.
Collapse
Affiliation(s)
- Birgit Eisenhaber
- a Bioinformatics Institute, Agency for Science , Technology and Research (A*STAR) , 30 Biopolis Street, #07-01 Matrix, Singapore 138671 , Republic of Singapore
| | - Swati Sinha
- a Bioinformatics Institute, Agency for Science , Technology and Research (A*STAR) , 30 Biopolis Street, #07-01 Matrix, Singapore 138671 , Republic of Singapore
| | - Wing-Cheong Wong
- a Bioinformatics Institute, Agency for Science , Technology and Research (A*STAR) , 30 Biopolis Street, #07-01 Matrix, Singapore 138671 , Republic of Singapore
| | - Frank Eisenhaber
- a Bioinformatics Institute, Agency for Science , Technology and Research (A*STAR) , 30 Biopolis Street, #07-01 Matrix, Singapore 138671 , Republic of Singapore.,b School of Computer Engineering , Nanyang Technological University (NTU) , 50 Nanyang Drive, Singapore 637553 , Republic of Singapore
| |
Collapse
|
13
|
Medrano-Soto A, Moreno-Hagelsieb G, McLaughlin D, Ye ZS, Hendargo KJ, Saier MH. Bioinformatic characterization of the Anoctamin Superfamily of Ca2+-activated ion channels and lipid scramblases. PLoS One 2018; 13:e0192851. [PMID: 29579047 PMCID: PMC5868767 DOI: 10.1371/journal.pone.0192851] [Citation(s) in RCA: 40] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2017] [Accepted: 01/31/2018] [Indexed: 01/01/2023] Open
Abstract
Our laboratory has developed bioinformatic strategies for identifying distant phylogenetic relationships and characterizing families and superfamilies of transport proteins. Results using these tools suggest that the Anoctamin Superfamily of cation and anion channels, as well as lipid scramblases, includes three functionally characterized families: the Anoctamin (ANO), Transmembrane Channel (TMC) and Ca2+-permeable Stress-gated Cation Channel (CSC) families; as well as four families of functionally uncharacterized proteins, which we refer to as the Anoctamin-like (ANO-L), Transmembrane Channel-like (TMC-L), and CSC-like (CSC-L1 and CSC-L2) families. We have constructed protein clusters and trees showing the relative relationships among the seven families. Topological analyses suggest that the members of these families have essentially the same topologies. Comparative examination of these homologous families provides insight into possible mechanisms of action, indicates the currently recognized organismal distributions of these proteins, and suggests drug design potential for the disease-related channel proteins.
Collapse
Affiliation(s)
- Arturo Medrano-Soto
- Department of Molecular Biology, University of California at San Diego, La Jolla, California, United States of America
| | | | - Daniel McLaughlin
- Department of Molecular Biology, University of California at San Diego, La Jolla, California, United States of America
| | - Zachary S. Ye
- Department of Molecular Biology, University of California at San Diego, La Jolla, California, United States of America
| | - Kevin J. Hendargo
- Department of Molecular Biology, University of California at San Diego, La Jolla, California, United States of America
| | - Milton H. Saier
- Department of Molecular Biology, University of California at San Diego, La Jolla, California, United States of America
- * E-mail:
| |
Collapse
|
14
|
Baker JA, Wong WC, Eisenhaber B, Warwicker J, Eisenhaber F. Charged residues next to transmembrane regions revisited: "Positive-inside rule" is complemented by the "negative inside depletion/outside enrichment rule". BMC Biol 2017; 15:66. [PMID: 28738801 PMCID: PMC5525207 DOI: 10.1186/s12915-017-0404-4] [Citation(s) in RCA: 49] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2017] [Accepted: 07/07/2017] [Indexed: 11/25/2022] Open
Abstract
Background Transmembrane helices (TMHs) frequently occur amongst protein architectures as means for proteins to attach to or embed into biological membranes. Physical constraints such as the membrane’s hydrophobicity and electrostatic potential apply uniform requirements to TMHs and their flanking regions; consequently, they are mirrored in their sequence patterns (in addition to TMHs being a span of generally hydrophobic residues) on top of variations enforced by the specific protein’s biological functions. Results With statistics derived from a large body of protein sequences, we demonstrate that, in addition to the positive charge preference at the cytoplasmic inside (positive-inside rule), negatively charged residues preferentially occur or are even enriched at the non-cytoplasmic flank or, at least, they are suppressed at the cytoplasmic flank (negative-not-inside/negative-outside (NNI/NO) rule). As negative residues are generally rare within or near TMHs, the statistical significance is sensitive with regard to details of TMH alignment and residue frequency normalisation and also to dataset size; therefore, this trend was obscured in previous work. We observe variations amongst taxa as well as for organelles along the secretory pathway. The effect is most pronounced for TMHs from single-pass transmembrane (bitopic) proteins compared to those with multiple TMHs (polytopic proteins) and especially for the class of simple TMHs that evolved for the sole role as membrane anchors. Conclusions The charged-residue flank bias is only one of the TMH sequence features with a role in the anchorage mechanisms, others apparently being the leucine intra-helix propensity skew towards the cytoplasmic side, tryptophan flanking as well as the cysteine and tyrosine inside preference. These observations will stimulate new prediction methods for TMHs and protein topology from a sequence as well as new engineering designs for artificial membrane proteins. Electronic supplementary material The online version of this article (doi:10.1186/s12915-017-0404-4) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- James Alexander Baker
- Bioinformatics Institute, Agency for Science Technology and Research (A*STAR), 30 Biopolis Street #07-01, Matrix, Singapore, 138671, Singapore.,School of Chemistry, Manchester Institute of Biotechnology, 131 Princess Street, Manchester, M1 7DN, UK
| | - Wing-Cheong Wong
- Bioinformatics Institute, Agency for Science Technology and Research (A*STAR), 30 Biopolis Street #07-01, Matrix, Singapore, 138671, Singapore
| | - Birgit Eisenhaber
- Bioinformatics Institute, Agency for Science Technology and Research (A*STAR), 30 Biopolis Street #07-01, Matrix, Singapore, 138671, Singapore
| | - Jim Warwicker
- School of Chemistry, Manchester Institute of Biotechnology, 131 Princess Street, Manchester, M1 7DN, UK.
| | - Frank Eisenhaber
- Bioinformatics Institute, Agency for Science Technology and Research (A*STAR), 30 Biopolis Street #07-01, Matrix, Singapore, 138671, Singapore. .,School of Computer Engineering (SCE), Nanyang Technological University (NTU), 50 Nanyang Drive, Singapore, 637553, Singapore.
| |
Collapse
|
15
|
Saidijam M, Azizpour S, Patching SG. Comprehensive analysis of the numbers, lengths and amino acid compositions of transmembrane helices in prokaryotic, eukaryotic and viral integral membrane proteins of high-resolution structure. J Biomol Struct Dyn 2017; 36:443-464. [PMID: 28150531 DOI: 10.1080/07391102.2017.1285725] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Abstract
We report a comprehensive analysis of the numbers, lengths and amino acid compositions of transmembrane helices in 235 high-resolution structures of integral membrane proteins. The properties of 1551 transmembrane helices in the structures were compared with those obtained by analysis of the same amino acid sequences using topology prediction tools. Explanations for the 81 (5.2%) missing or additional transmembrane helices in the prediction results were identified. Main reasons for missing transmembrane helices were mis-identification of N-terminal signal peptides, breaks in α-helix conformation or charged residues in the middle of transmembrane helices and transmembrane helices with unusual amino acid composition. The main reason for additional transmembrane helices was mis-identification of amphipathic helices, extramembrane helices or hairpin re-entrant loops. Transmembrane helix length had an overall median of 24 residues and an average of 24.9 ± 7.0 residues and the most common length was 23 residues. The overall content of residues in transmembrane helices as a percentage of the full proteins had a median of 56.8% and an average of 55.7 ± 16.0%. Amino acid composition was analysed for the full proteins, transmembrane helices and extramembrane regions. Individual proteins or types of proteins with transmembrane helices containing extremes in contents of individual amino acids or combinations of amino acids with similar physicochemical properties were identified and linked to structure and/or function. In addition to overall median and average values, all results were analysed for proteins originating from different types of organism (prokaryotic, eukaryotic, viral) and for subgroups of receptors, channels, transporters and others.
Collapse
Affiliation(s)
- Massoud Saidijam
- a Department of Molecular Medicine and Genetics, Research Centre for Molecular Medicine, School of Medicine , Hamadan University of Medical Sciences , Hamadan , Iran
| | - Sonia Azizpour
- a Department of Molecular Medicine and Genetics, Research Centre for Molecular Medicine, School of Medicine , Hamadan University of Medical Sciences , Hamadan , Iran
| | - Simon G Patching
- b School of BioMedical Sciences and the Astbury Centre for Structural Molecular Biology , University of Leeds , Leeds , UK
| |
Collapse
|
16
|
Yap CK, Eisenhaber B, Eisenhaber F, Wong WC. xHMMER3x2: Utilizing HMMER3's speed and HMMER2's sensitivity and specificity in the glocal alignment mode for improved large-scale protein domain annotation. Biol Direct 2016; 11:63. [PMID: 27894340 PMCID: PMC5126834 DOI: 10.1186/s13062-016-0163-0] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2016] [Accepted: 10/24/2016] [Indexed: 01/27/2023] Open
Abstract
BACKGROUND While the local-mode HMMER3 is notable for its massive speed improvement, the slower glocal-mode HMMER2 is more exact for domain annotation by enforcing full domain-to-sequence alignments. Since a unit of domain necessarily implies a unit of function, local-mode HMMER3 alone remains insufficient for precise function annotation tasks. In addition, the incomparable E-values for the same domain model by different HMMER builds create difficulty when checking for domain annotation consistency on a large-scale basis. RESULTS In this work, both the speed of HMMER3 and glocal-mode alignment of HMMER2 are combined within the xHMMER3x2 framework for tackling the large-scale domain annotation task. Briefly, HMMER3 is utilized for initial domain detection so that HMMER2 can subsequently perform the glocal-mode, sequence-to-full-domain alignments for the detected HMMER3 hits. An E-value calibration procedure is required to ensure that the search space by HMMER2 is sufficiently replicated by HMMER3. We find that the latter is straightforwardly possible for ~80% of the models in the Pfam domain library (release 29). However in the case of the remaining ~20% of HMMER3 domain models, the respective HMMER2 counterparts are more sensitive. Thus, HMMER3 searches alone are insufficient to ensure sensitivity and a HMMER2-based search needs to be initiated. When tested on the set of UniProt human sequences, xHMMER3x2 can be configured to be between 7× and 201× faster than HMMER2, but with descending domain detection sensitivity from 99.8 to 95.7% with respect to HMMER2 alone; HMMER3's sensitivity was 95.7%. At extremes, xHMMER3x2 is either the slow glocal-mode HMMER2 or the fast HMMER3 with glocal-mode. Finally, the E-values to false-positive rates (FPR) mapping by xHMMER3x2 allows E-values of different model builds to be compared, so that any annotation discrepancies in a large-scale annotation exercise can be flagged for further examination by dissectHMMER. CONCLUSION The xHMMER3x2 workflow allows large-scale domain annotation speed to be drastically improved over HMMER2 without compromising for domain-detection with regard to sensitivity and sequence-to-domain alignment incompleteness. The xHMMER3x2 code and its webserver (for Pfam release 27, 28 and 29) are freely available at http://xhmmer3x2.bii.a-star.edu.sg/ . REVIEWERS Reviewed by Thomas Dandekar, L. Aravind, Oliviero Carugo and Shamil Sunyaev. For the full reviews, please go to the Reviewers' comments section.
Collapse
Affiliation(s)
- Choon-Kong Yap
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), 30 Biopolis Street, #07-01, Matrix, Singapore, 138671, Singapore
| | - Birgit Eisenhaber
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), 30 Biopolis Street, #07-01, Matrix, Singapore, 138671, Singapore
| | - Frank Eisenhaber
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), 30 Biopolis Street, #07-01, Matrix, Singapore, 138671, Singapore. .,School of Computer Engineering (SCE), Nanyang Technological University (NTU), 50 Nanyang Drive, Singapore, 637553, Singapore.
| | - Wing-Cheong Wong
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), 30 Biopolis Street, #07-01, Matrix, Singapore, 138671, Singapore.
| |
Collapse
|
17
|
The Recipe for Protein Sequence-Based Function Prediction and Its Implementation in the ANNOTATOR Software Environment. Methods Mol Biol 2016; 1415:477-506. [PMID: 27115649 DOI: 10.1007/978-1-4939-3572-7_25] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/02/2023]
|
18
|
Abstract
Systems medicine promotes a range of approaches and strategies to study human health and disease at a systems level with the aim of improving the overall well-being of (healthy) individuals, and preventing, diagnosing, or curing disease. In this chapter we discuss how bioinformatics critically contributes to systems medicine. First, we explain the role of bioinformatics in the management and analysis of data. In particular we show the importance of publicly available biological and clinical repositories to support systems medicine studies. Second, we discuss how the integration and analysis of multiple types of omics data through integrative bioinformatics may facilitate the determination of more predictive and robust disease signatures, lead to a better understanding of (patho)physiological molecular mechanisms, and facilitate personalized medicine. Third, we focus on network analysis and discuss how gene networks can be constructed from omics data and how these networks can be decomposed into smaller modules. We discuss how the resulting modules can be used to generate experimentally testable hypotheses, provide insight into disease mechanisms, and lead to predictive models. Throughout, we provide several examples demonstrating how bioinformatics contributes to systems medicine and discuss future challenges in bioinformatics that need to be addressed to enable the advancement of systems medicine.
Collapse
Affiliation(s)
- Ulf Schmitz
- Dept of Systems Biology & Bioinformatics, University of Rostock, Rostock, Germany
| | - Olaf Wolkenhauer
- Dept of Systems Biology & Bioinformatics, University of Rostock, Rostock, Germany
| |
Collapse
|
19
|
Ochoa A, Storey JD, Llinás M, Singh M. Beyond the E-Value: Stratified Statistics for Protein Domain Prediction. PLoS Comput Biol 2015; 11:e1004509. [PMID: 26575353 PMCID: PMC4648515 DOI: 10.1371/journal.pcbi.1004509] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2014] [Accepted: 08/03/2015] [Indexed: 01/25/2023] Open
Abstract
E-values have been the dominant statistic for protein sequence analysis for the past two decades: from identifying statistically significant local sequence alignments to evaluating matches to hidden Markov models describing protein domain families. Here we formally show that for “stratified” multiple hypothesis testing problems—that is, those in which statistical tests can be partitioned naturally—controlling the local False Discovery Rate (lFDR) per stratum, or partition, yields the most predictions across the data at any given threshold on the FDR or E-value over all strata combined. For the important problem of protein domain prediction, a key step in characterizing protein structure, function and evolution, we show that stratifying statistical tests by domain family yields excellent results. We develop the first FDR-estimating algorithms for domain prediction, and evaluate how well thresholds based on q-values, E-values and lFDRs perform in domain prediction using five complementary approaches for estimating empirical FDRs in this context. We show that stratified q-value thresholds substantially outperform E-values. Contradicting our theoretical results, q-values also outperform lFDRs; however, our tests reveal a small but coherent subset of domain families, biased towards models for specific repetitive patterns, for which weaknesses in random sequence models yield notably inaccurate statistical significance measures. Usage of lFDR thresholds outperform q-values for the remaining families, which have as-expected noise, suggesting that further improvements in domain predictions can be achieved with improved modeling of random sequences. Overall, our theoretical and empirical findings suggest that the use of stratified q-values and lFDRs could result in improvements in a host of structured multiple hypothesis testing problems arising in bioinformatics, including genome-wide association studies, orthology prediction, and motif scanning. Despite decades of research, it remains a challenge to distinguish homologous relationships between proteins from sequence similarities arising due to chance alone. This is an increasingly important problem as sequence database sizes continue to grow, and even today many computational analyses require that the statistics of billions of sequence comparisons be assessed automatically. Here we explore statistical significance evaluation on data that is stratified—that is, naturally partitioned into subsets that may differ in their amount of signal—and find a theoretically optimal criterion for automatically setting thresholds of significance for each stratum. For the task of domain prediction, an important component of efforts to annotate protein sequences and identify remote sequence homologs, we empirically show that our stratified analysis of statistical significance greatly improves upon a combined analysis. Further, we identify weaknesses in the prevailing random sequence model for assessing statistical significance for a small subset of domain families with repetitive sequence patterns and known biological, structural, and evolutionary properties. Our theoretical findings in statistics are relevant not only for identifying protein domains, but for arbitrary stratified problems in genomics and beyond.
Collapse
Affiliation(s)
- Alejandro Ochoa
- Department of Molecular Biology, Princeton University, Princeton, New Jersey, United States of America
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America
- Center for Statistics and Machine Learning, Princeton University, Princeton, New Jersey, United States of America
| | - John D. Storey
- Department of Molecular Biology, Princeton University, Princeton, New Jersey, United States of America
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America
- Center for Statistics and Machine Learning, Princeton University, Princeton, New Jersey, United States of America
| | - Manuel Llinás
- Department of Biochemistry and Molecular Biology, and the Huck Institutes of the Life Sciences, Pennsylvania State University, University Park, Pennsylvania, United States of America
| | - Mona Singh
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America
- Department of Computer Science, Princeton University, Princeton, New Jersey, United States of America
- * E-mail:
| |
Collapse
|
20
|
Wong WC, Yap CK, Eisenhaber B, Eisenhaber F. dissectHMMER: a HMMER-based score dissection framework that statistically evaluates fold-critical sequence segments for domain fold similarity. Biol Direct 2015; 10:39. [PMID: 26228544 PMCID: PMC4521371 DOI: 10.1186/s13062-015-0068-3] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2015] [Accepted: 07/20/2015] [Indexed: 11/10/2022] Open
Abstract
Background Annotation transfer for function and structure within the sequence homology concept essentially requires protein sequence similarity for the secondary structural blocks forming the fold of a protein. A simplistic similarity approach in the case of non-globular segments (coiled coils, low complexity regions, transmembrane regions, long loops, etc.) is not justified and a pertinent source for mistaken homologies. The latter is either due to positional sequence conservation as a result of a very simple, physically induced pattern or integral sequence properties that are critical for function. Furthermore, against the backdrop that the number of well-studied proteins continues to grow at a slow rate, it necessitates for a search methodology to dive deeper into the sequence similarity space to connect the unknown sequences to the well-studied ones, albeit more distant, for biological function postulations. Results Based on our previous work of dissecting the hidden markov model (HMMER) based similarity score into fold-critical and the non-globular contributions to improve homology inference, we propose a framework-dissectHMMER, that identifies more fold-related domain hits from standard HMMER searches. Subsequent statistical stratification of the fold-related hits into cohorts of functionally-related domains allows for the function postulation of the query sequence. Briefly, the technical problems as to how to recognize non-globular parts in the domain model, resolve contradictory HMMER2/HMMER3 results and evaluate fold-related domain hits for homology, are addressed in this work. The framework is benchmarked against a set of SCOP-to-Pfam domain models. Despite being a sequence-to-profile method, dissectHMMER performs favorably against a profile-to-profile based method-HHsuite/HHsearch. Examples of function annotation using dissectHMMER, including the function discovery of an uncharacterized membrane protein Q9K8K1_BACHD (WP_010899149.1) as a lactose/H+ symporter, are presented. Finally, dissectHMMER webserver is made publicly available at http://dissecthmmer.bii.a-star.edu.sg. Conclusions The proposed framework-dissectHMMER, is faithful to the original inception of the sequence homology concept while improving upon the existing HMMER search tool through the rescue of statistically evaluated false-negative yet fold-related domain hits to the query sequence. Overall, this translates into an opportunity for any novel protein sequence to be functionally characterized. Reviewers This article was reviewed by Masanori Arita, Shamil Sunyaev and L. Aravind. Electronic supplementary material The online version of this article (doi:10.1186/s13062-015-0068-3) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Wing-Cheong Wong
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), 30 Biopolis Street, #07-01, Matrix, Singapore, 138671, Singapore.
| | - Choon-Kong Yap
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), 30 Biopolis Street, #07-01, Matrix, Singapore, 138671, Singapore.
| | - Birgit Eisenhaber
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), 30 Biopolis Street, #07-01, Matrix, Singapore, 138671, Singapore.
| | - Frank Eisenhaber
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), 30 Biopolis Street, #07-01, Matrix, Singapore, 138671, Singapore. .,Department of Biological Sciences (DBS), National University of Singapore (NUS), 8 Medical Drive, Singapore, 117597, Singapore. .,School of Computer Engineering (SCE), Nanyang Technological University (NTU), 50 Nanyang Drive, Singapore, 637553, Singapore.
| |
Collapse
|
21
|
Mudgal R, Sandhya S, Chandra N, Srinivasan N. De-DUFing the DUFs: Deciphering distant evolutionary relationships of Domains of Unknown Function using sensitive homology detection methods. Biol Direct 2015; 10:38. [PMID: 26228684 PMCID: PMC4520260 DOI: 10.1186/s13062-015-0069-2] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2015] [Accepted: 07/20/2015] [Indexed: 12/23/2022] Open
Abstract
Background In the post-genomic era where sequences are being determined at a rapid rate, we are highly reliant on computational methods for their tentative biochemical characterization. The Pfam database currently contains 3,786 families corresponding to “Domains of Unknown Function” (DUF) or “Uncharacterized Protein Family” (UPF), of which 3,087 families have no reported three-dimensional structure, constituting almost one-fourth of the known protein families in search for both structure and function. Results We applied a ‘computational structural genomics’ approach using five state-of-the-art remote similarity detection methods to detect the relationship between uncharacterized DUFs and domain families of known structures. The association with a structural domain family could serve as a start point in elucidating the function of a DUF. Amongst these five methods, searches in SCOP-NrichD database have been applied for the first time. Predictions were classified into high, medium and low- confidence based on the consensus of results from various approaches and also annotated with enzyme and Gene ontology terms. 614 uncharacterized DUFs could be associated with a known structural domain, of which high confidence predictions, involving at least four methods, were made for 54 families. These structure-function relationships for the 614 DUF families can be accessed on-line at http://proline.biochem.iisc.ernet.in/RHD_DUFS/. For potential enzymes in this set, we assessed their compatibility with the associated fold and performed detailed structural and functional annotation by examining alignments and extent of conservation of functional residues. Detailed discussion is provided for interesting assignments for DUF3050, DUF1636, DUF1572, DUF2092 and DUF659. Conclusions This study provides insights into the structure and potential function for nearly 20 % of the DUFs. Use of different computational approaches enables us to reliably recognize distant relationships, especially when they converge to a common assignment because the methods are often complementary. We observe that while pointers to the structural domain can offer the right clues to the function of a protein, recognition of its precise functional role is still ‘non-trivial’ with many DUF domains conserving only some of the critical residues. It is not clear whether these are functional vestiges or instances involving alternate substrates and interacting partners. Reviewers This article was reviewed by Drs Eugene Koonin, Frank Eisenhaber and Srikrishna Subramanian. Electronic supplementary material The online version of this article (doi:10.1186/s13062-015-0069-2) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Richa Mudgal
- IISc Mathematics Initiative, Indian Institute of Science, Bangalore, 560 012, India.
| | - Sankaran Sandhya
- Molecular Biophysics Unit, Indian Institute of Science, Bangalore, 560 012, India.
| | - Nagasuma Chandra
- Department of Biochemistry, Indian Institute of Science, Bangalore, 560 012, India.
| | | |
Collapse
|
22
|
Goncearenco A, Berezovsky IN. Protein function from its emergence to diversity in contemporary proteins. Phys Biol 2015; 12:045002. [PMID: 26057563 DOI: 10.1088/1478-3975/12/4/045002] [Citation(s) in RCA: 43] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
Abstract
The goal of this work is to learn from nature the rules that govern evolution and the design of protein function. The fundamental laws of physics lie in the foundation of the protein structure and all stages of the protein evolution, determining optimal sizes and shapes at different levels of structural hierarchy. We looked back into the very onset of the protein evolution with a goal to find elementary functions (EFs) that came from the prebiotic world and served as building blocks of the first enzymes. We defined the basic structural and functional units of biochemical reactions-elementary functional loops. The diversity of contemporary enzymes can be described via combinations of a limited number of elementary chemical reactions, many of which are performed by the descendants of primitive prebiotic peptides/proteins. By analyzing protein sequences we were able to identify EFs shared by seemingly unrelated protein superfamilies and folds and to unravel evolutionary relations between them. Binding and metabolic processing of the metal- and nucleotide-containing cofactors and ligands are among the most abundant ancient EFs that became indispensable in many natural enzymes. Highly designable folds provide structural scaffolds for many different biochemical reactions. We show that contemporary proteins are built from a limited number of EFs, making their analysis instrumental for establishing the rules for protein design. Evolutionary studies help us to accumulate the library of essential EFs and to establish intricate relations between different folds and functional superfamilies. Generalized sequence-structure descriptors of the EF will become useful in future design and engineering of desired enzymatic functions.
Collapse
Affiliation(s)
- Alexander Goncearenco
- Computational Biology Unit and Department of Informatics, University of Bergen, N-5008 Bergen, Norway
| | | |
Collapse
|
23
|
Rossier BC, Baker ME, Studer RA. Epithelial sodium transport and its control by aldosterone: the story of our internal environment revisited. Physiol Rev 2015; 95:297-340. [PMID: 25540145 DOI: 10.1152/physrev.00011.2014] [Citation(s) in RCA: 155] [Impact Index Per Article: 17.2] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022] Open
Abstract
Transcription and translation require a high concentration of potassium across the entire tree of life. The conservation of a high intracellular potassium was an absolute requirement for the evolution of life on Earth. This was achieved by the interplay of P- and V-ATPases that can set up electrochemical gradients across the cell membrane, an energetically costly process requiring the synthesis of ATP by F-ATPases. In animals, the control of an extracellular compartment was achieved by the emergence of multicellular organisms able to produce tight epithelial barriers creating a stable extracellular milieu. Finally, the adaptation to a terrestrian environment was achieved by the evolution of distinct regulatory pathways allowing salt and water conservation. In this review we emphasize the critical and dual role of Na(+)-K(+)-ATPase in the control of the ionic composition of the extracellular fluid and the renin-angiotensin-aldosterone system (RAAS) in salt and water conservation in vertebrates. The action of aldosterone on transepithelial sodium transport by activation of the epithelial sodium channel (ENaC) at the apical membrane and that of Na(+)-K(+)-ATPase at the basolateral membrane may have evolved in lungfish before the emergence of tetrapods. Finally, we discuss the implication of RAAS in the origin of the present pandemia of hypertension and its associated cardiovascular diseases.
Collapse
Affiliation(s)
- Bernard C Rossier
- Department of Pharmacology and Toxicology, University of Lausanne, Lausanne, Switzerland; Division of Nephrology-Hypertension, University of California San Diego, La Jolla, California; and Institute of Structural and Molecular Biology, Division of Biosciences, University College London, London, United Kingdom
| | - Michael E Baker
- Department of Pharmacology and Toxicology, University of Lausanne, Lausanne, Switzerland; Division of Nephrology-Hypertension, University of California San Diego, La Jolla, California; and Institute of Structural and Molecular Biology, Division of Biosciences, University College London, London, United Kingdom
| | - Romain A Studer
- Department of Pharmacology and Toxicology, University of Lausanne, Lausanne, Switzerland; Division of Nephrology-Hypertension, University of California San Diego, La Jolla, California; and Institute of Structural and Molecular Biology, Division of Biosciences, University College London, London, United Kingdom
| |
Collapse
|
24
|
Kumar M. An enhanced algorithm for multiple sequence alignment of protein sequences using genetic algorithm. EXCLI JOURNAL 2015; 14:1232-55. [PMID: 27065770 PMCID: PMC4820728 DOI: 10.17179/excli2015-302] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/01/2015] [Accepted: 11/19/2015] [Indexed: 11/10/2022]
Abstract
One of the most fundamental operations in biological sequence analysis is multiple sequence alignment (MSA). The basic of multiple sequence alignment problems is to determine the most biologically plausible alignments of protein or DNA sequences. In this paper, an alignment method using genetic algorithm for multiple sequence alignment has been proposed. Two different genetic operators mainly crossover and mutation were defined and implemented with the proposed method in order to know the population evolution and quality of the sequence aligned. The proposed method is assessed with protein benchmark dataset, e.g., BALIBASE, by comparing the obtained results to those obtained with other alignment algorithms, e.g., SAGA, RBT-GA, PRRP, HMMT, SB-PIMA, CLUSTALX, CLUSTAL W, DIALIGN and PILEUP8 etc. Experiments on a wide range of data have shown that the proposed algorithm is much better (it terms of score) than previously proposed algorithms in its ability to achieve high alignment quality.
Collapse
Affiliation(s)
- Manish Kumar
- Department of Computer Science and Engineering, Indian School of Mines, Dhanbad, Jharkhand, India
| |
Collapse
|
25
|
Oome S, Van den Ackerveken G. Comparative and functional analysis of the widely occurring family of Nep1-like proteins. MOLECULAR PLANT-MICROBE INTERACTIONS : MPMI 2014; 27:1081-94. [PMID: 25025781 DOI: 10.1094/mpmi-04-14-0118-r] [Citation(s) in RCA: 40] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/02/2023]
Abstract
Nep1-like proteins (NLP) are best known for their cytotoxic activity in dicot plants. NLP are taxonomically widespread among microbes with very different lifestyles. To learn more about this enigmatic protein family, we analyzed more than 500 available NLP protein sequences from fungi, oomycetes, and bacteria. Phylogenetic clustering showed that, besides the previously documented two types, an additional, more divergent, third NLP type could be distinguished. By closely examining the three NLP types, we identified a noncytotoxic subgroup of type 1 NLP (designated type 1a), which have substitutions in amino acids making up a cation-binding pocket that is required for cytotoxicity. Type 2 NLP were found to contain a putative calcium-binding motif, which was shown to be required for cytotoxicity. Members of both type 1 and type 2 NLP were found to possess additional cysteine residues that, based on their predicted proximity, make up potential disulfide bridges that could provide additional stability to these secreted proteins. Type 1 and type 2 NLP, although both cytotoxic to plant cells, differ in their ability to induce necrosis when artificially targeted to different cellular compartments in planta, suggesting they have different mechanisms of cytotoxicity.
Collapse
|
26
|
Zhu Q, Kosoy M, Dittmar K. HGTector: an automated method facilitating genome-wide discovery of putative horizontal gene transfers. BMC Genomics 2014; 15:717. [PMID: 25159222 PMCID: PMC4155097 DOI: 10.1186/1471-2164-15-717] [Citation(s) in RCA: 95] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2014] [Accepted: 08/20/2014] [Indexed: 11/23/2022] Open
Abstract
Background First pass methods based on BLAST match are commonly used as an initial step to separate the different phylogenetic histories of genes in microbial genomes, and target putative horizontal gene transfer (HGT) events. This will continue to be necessary given the rapid growth of genomic data and the technical difficulties in conducting large-scale explicit phylogenetic analyses. However, these methods often produce misleading results due to their inability to resolve indirect phylogenetic links and their vulnerability to stochastic events. Results A new computational method of rapid, exhaustive and genome-wide detection of HGT was developed, featuring the systematic analysis of BLAST hit distribution patterns in the context of a priori defined hierarchical evolutionary categories. Genes that fall beyond a series of statistically determined thresholds are identified as not adhering to the typical vertical history of the organisms in question, but instead having a putative horizontal origin. Tests on simulated genomic data suggest that this approach effectively targets atypically distributed genes that are highly likely to be HGT-derived, and exhibits robust performance compared to conventional BLAST-based approaches. This method was further tested on real genomic datasets, including Rickettsia genomes, and was compared to previous studies. Results show consistency with currently employed categories of HGT prediction methods. In-depth analysis of both simulated and real genomic data suggests that the method is notably insensitive to stochastic events such as gene loss, rate variation and database error, which are common challenges to the current methodology. An automated pipeline was created to implement this approach and was made publicly available at: https://github.com/DittmarLab/HGTector. The program is versatile, easily deployed, has a low requirement for computational resources. Conclusions HGTector is an effective tool for initial or standalone large-scale discovery of candidate HGT-derived genes. Electronic supplementary material The online version of this article (doi:10.1186/1471-2164-15-717) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Qiyun Zhu
- Department of Biological Sciences, University at Buffalo, State University of New York, 109 Cooke Hall, Buffalo, NY 14260, USA.
| | | | | |
Collapse
|
27
|
Wong WC, Maurer-Stroh S, Eisenhaber B, Eisenhaber F. On the necessity of dissecting sequence similarity scores into segment-specific contributions for inferring protein homology, function prediction and annotation. BMC Bioinformatics 2014; 15:166. [PMID: 24890864 PMCID: PMC4061105 DOI: 10.1186/1471-2105-15-166] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2013] [Accepted: 05/27/2014] [Indexed: 02/01/2023] Open
Abstract
Background Protein sequence similarities to any types of non-globular segments (coiled coils, low complexity regions, transmembrane regions, long loops, etc. where either positional sequence conservation is the result of a very simple, physically induced pattern or rather integral sequence properties are critical) are pertinent sources for mistaken homologies. Regretfully, these considerations regularly escape attention in large-scale annotation studies since, often, there is no substitute to manual handling of these cases. Quantitative criteria are required to suppress events of function annotation transfer as a result of false homology assignments. Results The sequence homology concept is based on the similarity comparison between the structural elements, the basic building blocks for conferring the overall fold of a protein. We propose to dissect the total similarity score into fold-critical and other, remaining contributions and suggest that, for a valid homology statement, the fold-relevant score contribution should at least be significant on its own. As part of the article, we provide the DissectHMMER software program for dissecting HMMER2/3 scores into segment-specific contributions. We show that DissectHMMER reproduces HMMER2/3 scores with sufficient accuracy and that it is useful in automated decisions about homology for instructive sequence examples. To generalize the dissection concept for cases without 3D structural information, we find that a dissection based on alignment quality is an appropriate surrogate. The approach was applied to a large-scale study of SMART and PFAM domains in the space of seed sequences and in the space of UniProt/SwissProt. Conclusions Sequence similarity core dissection with regard to fold-critical and other contributions systematically suppresses false hits and, additionally, recovers previously obscured homology relationships such as the one between aquaporins and formate/nitrite transporters that, so far, was only supported by structure comparison.
Collapse
Affiliation(s)
- Wing-Cheong Wong
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), 30 Biopolis Street, #07-01, Matrix, Singapore 138671, Singapore.
| | | | | | | |
Collapse
|
28
|
Eisenhaber B, Eisenhaber S, Kwang TY, Grüber G, Eisenhaber F. Transamidase subunit GAA1/GPAA1 is a M28 family metallo-peptide-synthetase that catalyzes the peptide bond formation between the substrate protein's omega-site and the GPI lipid anchor's phosphoethanolamine. Cell Cycle 2014; 13:1912-7. [PMID: 24743167 PMCID: PMC4111754 DOI: 10.4161/cc.28761] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
Abstract
The transamidase subunit GAA1/GPAA1 is predicted to be the enzyme that catalyzes the attachment of the glycosylphosphatidyl (GPI) lipid anchor to the carbonyl intermediate of the substrate protein at the ω-site. Its ~300-amino acid residue lumenal domain is a M28 family metallo-peptide-synthetase with an α/β hydrolase fold, including a central 8-strand β-sheet and a single metal (most likely zinc) ion coordinated by 3 conserved polar residues. Phosphoethanolamine is used as an adaptor to make the non-peptide GPI lipid anchor look chemically similar to the N terminus of a peptide.
Collapse
Affiliation(s)
- Birgit Eisenhaber
- Bioinformatics Institute (BII); A*STAR; Singapore, Republic of Singapore
| | - Stephan Eisenhaber
- Department of Physical Chemistry; University of Vienna; Wien/Vienna, Republic of Austria
| | - Toh Yew Kwang
- Bioinformatics Institute (BII); A*STAR; Singapore, Republic of Singapore
| | - Gerhard Grüber
- Bioinformatics Institute (BII); A*STAR; Singapore, Republic of Singapore; Nanyang Technological University; School of Biological Sciences; Singapore, Republic of Singapore
| | - Frank Eisenhaber
- Bioinformatics Institute (BII); A*STAR; Singapore, Republic of Singapore; Department of Biological Sciences (DBS); National University of Singapore (NUS); Singapore, Republic of Singapore; School of Computer Engineering (SCE); Nanyang Technological University (NTU); Singapore, Republic of Singapore
| |
Collapse
|
29
|
Joseph AP, Shingate P, Upadhyay AK, Sowdhamini R. 3PFDB+: improved search protocol and update for the identification of representatives of protein sequence domain families. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2014; 2014:bau026. [PMID: 24700812 PMCID: PMC3974335 DOI: 10.1093/database/bau026] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Protein domain families are usually classified on the basis of similarity of amino acid sequences. Selection of a single representative sequence for each family provides targets for structure determination or modeling and also enables fast sequence searches to associate new members to a family. Such a selection could be challenging since some of these domain families exhibit huge variation depending on the number of members in the family, the average family sequence length or the extent of sequence divergence within a family. We had earlier created 3PFDB database as a repository of best representative sequences, selected from each PFAM domain family on the basis of high coverage. In this study, we have improved the database using more efficient strategies for the initial generation of sequence profiles and implement two independent methods, FASSM and HMMER, for identifying family members. HMMER employs a global sequence similarity search, while FASSM relies on motif identification and matching. This improved and updated database, 3PFDB+ generated in this study, provides representative sequences and profiles for PFAM families, with 13 519 family representatives having more than 90% family coverage. The representative sequence is also highlighted in a two-dimensional plot, which reflects the relative divergence between family members. Representatives belonging to small families with short sequences are mainly associated with low coverage. The set of sequences not recognized by the family representative profiles, highlight several potential false or weak family associations in PFAM. Partial domains and fragments dominate such cases, along with sequences that are highly diverged or different from other family members. Some of these outliers were also predicted to have different secondary structure contents, which reflect different putative structure or functional roles for these domain sequences. Database URL: http://caps.ncbs.res.in/3pfdbplus/.
Collapse
Affiliation(s)
- Agnel P Joseph
- National Centre for Biological Sciences (TIFR), GKVK Campus, Bellary Road, Bangalore 560065, Karnataka, India
| | | | | | | |
Collapse
|
30
|
EISENHABER FRANK, SUNG WINGKIN, WONG LIMSOON. THE 24TH INTERNATIONAL CONFERENCE ON GENOME INFORMATICS, GIW2013, IN SINGAPORE. J Bioinform Comput Biol 2013. [DOI: 10.1142/s0219720013020034] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Affiliation(s)
- FRANK EISENHABER
- Bioinformatics Institute, Agency for Science, Technology and Research, 30 Biopolis Street #07-01, Matrix, Singapore 138671, Singapore
- Department of Biological Sciences, National University of Singapore, 8 Medical Drive, Singapore 117597, Singapore
- School of Computer Engineering, Nanyang Technological University, 50 Nanyang Drive, Singapore 637553, Singapore
| | - WING-KIN SUNG
- School of Computing, National University of Singapore, 13 Computing Drive, Singapore 117417, Singapore
- Genome Institute of Singapore, 60 Biopolis Street #02-01, Genome, Singapore 138672, Singapore
| | - LIMSOON WONG
- School of Computing, National University of Singapore, 13 Computing Drive, Singapore 117417, Singapore
| |
Collapse
|
31
|
Hadzipasic O, Wrabl JO, Hilser VJ. A horizontal alignment tool for numerical trend discovery in sequence data: application to protein hydropathy. PLoS Comput Biol 2013; 9:e1003247. [PMID: 24130469 PMCID: PMC3794901 DOI: 10.1371/journal.pcbi.1003247] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2013] [Accepted: 07/10/2013] [Indexed: 11/19/2022] Open
Abstract
An algorithm is presented that returns the optimal pairwise gapped alignment of two sets of signed numerical sequence values. One distinguishing feature of this algorithm is a flexible comparison engine (based on both relative shape and absolute similarity measures) that does not rely on explicit gap penalties. Additionally, an empirical probability model is developed to estimate the significance of the returned alignment with respect to randomized data. The algorithm's utility for biological hypothesis formulation is demonstrated with test cases including database search and pairwise alignment of protein hydropathy. However, the algorithm and probability model could possibly be extended to accommodate other diverse types of protein or nucleic acid data, including positional thermodynamic stability and mRNA translation efficiency. The algorithm requires only numerical values as input and will readily compare data other than protein hydropathy. The tool is therefore expected to complement, rather than replace, existing sequence and structure based tools and may inform medical discovery, as exemplified by proposed similarity between a chlamydial ORFan protein and bacterial colicin pore-forming domain. The source code, documentation, and a basic web-server application are available.
Collapse
Affiliation(s)
- Omar Hadzipasic
- Department of Biology, Johns Hopkins University, Baltimore, Maryland, United States of America
| | - James O. Wrabl
- Department of Biology, Johns Hopkins University, Baltimore, Maryland, United States of America
| | - Vincent J. Hilser
- Department of Biology, Johns Hopkins University, Baltimore, Maryland, United States of America
- T.C. Jenkins Department of Biophysics, Johns Hopkins University, Baltimore, Maryland, United States of America
- * E-mail:
| |
Collapse
|
32
|
Raffaele S, Perraki A, Mongrand S. The Remorin C-terminal Anchor was shaped by convergent evolution among membrane binding domains. PLANT SIGNALING & BEHAVIOR 2013; 8:e23207. [PMID: 23299327 PMCID: PMC3676492 DOI: 10.4161/psb.23207] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/07/2023]
Abstract
StREM1.3 Remorin is a well-established plant raftophilic protein, predominantly associated with sterol- and sphingolipid-rich membrane rafts. We recently identified a C-terminal domain (RemCA) required and sufficient for StREM1.3 anchoring to the plasma membrane. Here, we report a search for homologs and analogs of RemCA domain in publicly available protein sequence and structure databases. We could not identify RemCA homologous domains outside the Remorin family but we identified domains sharing bias in amino-acid composition and predicted structural fold with RemCA in bacterial, viral and animal proteins. These results suggest that RemCA emerged by convergent evolution among unrelated membrane binding domain.
Collapse
Affiliation(s)
- Sylvain Raffaele
- Laboratoire des Interactions Plantes-Microorganismes (LIPM); UMR441 INRA-CNRS; Castanet-Tolosan, France
- Correspondence to: Sylvain Raffaele,
| | - Artemis Perraki
- Laboratoire de Biogenese Membraniare; UMR 5200 CNRS; Université Bordeaux Segalen; INRA Bordeaux Aquitaine BP81; Villenave d'Ornon Cédex, France
| | - Sébastien Mongrand
- Laboratoire de Biogenese Membraniare; UMR 5200 CNRS; Université Bordeaux Segalen; INRA Bordeaux Aquitaine BP81; Villenave d'Ornon Cédex, France
| |
Collapse
|
33
|
Rekapalli B, Wuichet K, Peterson GD, Zhulin IB. Dynamics of domain coverage of the protein sequence universe. BMC Genomics 2012; 13:634. [PMID: 23157439 PMCID: PMC3557196 DOI: 10.1186/1471-2164-13-634] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2012] [Accepted: 11/11/2012] [Indexed: 01/14/2023] Open
Abstract
BACKGROUND The currently known protein sequence space consists of millions of sequences in public databases and is rapidly expanding. Assigning sequences to families leads to a better understanding of protein function and the nature of the protein universe. However, a large portion of the current protein space remains unassigned and is referred to as its "dark matter". RESULTS Here we suggest that true size of "dark matter" is much larger than stated by current definitions. We propose an approach to reducing the size of "dark matter" by identifying and subtracting regions in protein sequences that are not likely to contain any domain. CONCLUSIONS Recent improvements in computational domain modeling result in a decrease, albeit slowly, in the relative size of "dark matter"; however, its absolute size increases substantially with the growth of sequence data.
Collapse
Affiliation(s)
- Bhanu Rekapalli
- Joint Institute for Computational Sciences, Oak Ridge National Laboratory - University of Tennessee, Oak Ridge, TN 37831, USA
| | | | | | | |
Collapse
|
34
|
Škunca N, Altenhoff A, Dessimoz C. Quality of computationally inferred gene ontology annotations. PLoS Comput Biol 2012; 8:e1002533. [PMID: 22693439 PMCID: PMC3364937 DOI: 10.1371/journal.pcbi.1002533] [Citation(s) in RCA: 97] [Impact Index Per Article: 8.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2011] [Accepted: 04/01/2012] [Indexed: 01/10/2023] Open
Abstract
Gene Ontology (GO) has established itself as the undisputed standard for protein function annotation. Most annotations are inferred electronically, i.e. without individual curator supervision, but they are widely considered unreliable. At the same time, we crucially depend on those automated annotations, as most newly sequenced genomes are non-model organisms. Here, we introduce a methodology to systematically and quantitatively evaluate electronic annotations. By exploiting changes in successive releases of the UniProt Gene Ontology Annotation database, we assessed the quality of electronic annotations in terms of specificity, reliability, and coverage. Overall, we not only found that electronic annotations have significantly improved in recent years, but also that their reliability now rivals that of annotations inferred by curators when they use evidence other than experiments from primary literature. This work provides the means to identify the subset of electronic annotations that can be relied upon—an important outcome given that >98% of all annotations are inferred without direct curation. In the UniProt Gene Ontology Annotation database, the largest repository of functional annotations, over 98% of all function annotations are inferred in silico, without curator oversight. Yet these “electronic GO annotations” are generally perceived as unreliable; they are disregarded in many studies. In this article, we introduce novel methodology to systematically evaluate the quality of electronic annotations. We then provide the first comprehensive assessment of the reliability of electronic GO annotations. Overall, we found that electronic annotations are more reliable than generally believed, to an extent that they are competitive with annotations inferred by curators when they use evidence other than experiments from primary literature. But we also report significant variations among inference methods, types of annotations, and organisms. This work provides guidance for Gene Ontology users and lays the foundations for improving computational approaches to GO function inference.
Collapse
Affiliation(s)
- Nives Škunca
- Ruđer Bošković Institute, Division of Electronics, Zagreb, Croatia
- ETH Zurich, Computer Science, Zurich, Switzerland
| | - Adrian Altenhoff
- ETH Zurich, Computer Science, Zurich, Switzerland
- Swiss Institute of Bioinformatics, Zurich, Switzerland
| | - Christophe Dessimoz
- ETH Zurich, Computer Science, Zurich, Switzerland
- Swiss Institute of Bioinformatics, Zurich, Switzerland
- EMBL-European Bioinformatics Institute, Hinxton, Cambridge, United Kingdom
- * E-mail:
| |
Collapse
|
35
|
A topology-based metric for measuring term similarity in the gene ontology. Adv Bioinformatics 2012; 2012:975783. [PMID: 22666244 PMCID: PMC3361142 DOI: 10.1155/2012/975783] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2011] [Revised: 02/29/2012] [Accepted: 03/13/2012] [Indexed: 12/25/2022] Open
Abstract
The wide coverage and biological relevance of the Gene Ontology (GO), confirmed through its successful use in protein function prediction, have led to the growth in its popularity. In order to exploit the extent of biological knowledge that GO offers in describing genes or groups of genes, there is a need for an efficient, scalable similarity measure for GO terms and GO-annotated proteins. While several GO similarity measures exist, none adequately addresses all issues surrounding the design and usage of the ontology. We introduce a new metric for measuring the distance between two GO terms using the intrinsic topology of the GO-DAG, thus enabling the measurement of functional similarities between proteins based on their GO annotations. We assess the performance of this metric using a ROC analysis on human protein-protein interaction datasets and correlation coefficient analysis on the selected set of protein pairs from the CESSM online tool. This metric achieves good performance compared to the existing annotation-based GO measures. We used this new metric to assess functional similarity between orthologues, and show that it is effective at determining whether orthologues are annotated with similar functions and identifying cases where annotation is inconsistent between orthologues.
Collapse
|
36
|
Abstract
Transmembrane helical segments (TMs) can be classified into two groups of so-called ‘simple’ and ‘complex’ TMs. Whereas the first group represents mere hydrophobic anchors with an overrepresentation of aliphatic hydrophobic residues that are likely attributed to convergent evolution in many cases, the complex ones embody ancestral information and tend to have structural and functional roles beyond just membrane immersion. Hence, the sequence homology concept is not applicable on simple TMs. In practice, these simple TMs can attract statistically significant but evolutionarily unrelated hits during similarity searches (whether through BLAST- or HMM-based approaches). This is especially problematic for membrane proteins that contain both globular segments and TMs. As such, we have developed the transmembrane helix: simple or complex (TMSOC) webserver for the identification of simple and complex TMs. By masking simple TM segments in seed sequences prior to sequence similarity searches, the false-discovery rate decreases without sacrificing sensitivity. Therefore, TMSOC is a novel and necessary sequence analytic tool for both the experimentalists and the computational biology community working on membrane proteins. It is freely accessible at http://tmsoc.bii.a-star.edu.sg or available for download.
Collapse
Affiliation(s)
- Wing-Cheong Wong
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), 30 Biopolis Street, #07-01, Matrix, Singapore 138671, School of Biological Sciences (SBS), Nanyang Technological University (NTU), 60 Nanyang Drive, Singapore 637551, Department of Biological Sciences (DBS), National University of Singapore (NUS), 8 Medical Drive, Singapore 117597 and School of Computer Engineering (SCE), Nanyang Technological University (NTU), 50 Nanyang Drive, Singapore 637553
- *To whom correspondence should be addressed. Tel: +65 64788305; Fax: +65 64789047;
| | - Sebastian Maurer-Stroh
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), 30 Biopolis Street, #07-01, Matrix, Singapore 138671, School of Biological Sciences (SBS), Nanyang Technological University (NTU), 60 Nanyang Drive, Singapore 637551, Department of Biological Sciences (DBS), National University of Singapore (NUS), 8 Medical Drive, Singapore 117597 and School of Computer Engineering (SCE), Nanyang Technological University (NTU), 50 Nanyang Drive, Singapore 637553
| | - Georg Schneider
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), 30 Biopolis Street, #07-01, Matrix, Singapore 138671, School of Biological Sciences (SBS), Nanyang Technological University (NTU), 60 Nanyang Drive, Singapore 637551, Department of Biological Sciences (DBS), National University of Singapore (NUS), 8 Medical Drive, Singapore 117597 and School of Computer Engineering (SCE), Nanyang Technological University (NTU), 50 Nanyang Drive, Singapore 637553
| | - Frank Eisenhaber
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), 30 Biopolis Street, #07-01, Matrix, Singapore 138671, School of Biological Sciences (SBS), Nanyang Technological University (NTU), 60 Nanyang Drive, Singapore 637551, Department of Biological Sciences (DBS), National University of Singapore (NUS), 8 Medical Drive, Singapore 117597 and School of Computer Engineering (SCE), Nanyang Technological University (NTU), 50 Nanyang Drive, Singapore 637553
- *To whom correspondence should be addressed. Tel: +65 64788305; Fax: +65 64789047;
| |
Collapse
|
37
|
Attwood TK, Coletta A, Muirhead G, Pavlopoulou A, Philippou PB, Popov I, Romá-Mateo C, Theodosiou A, Mitchell AL. The PRINTS database: a fine-grained protein sequence annotation and analysis resource--its status in 2012. Database (Oxford) 2012; 2012:bas019. [PMID: 22508994 PMCID: PMC3326521 DOI: 10.1093/database/bas019] [Citation(s) in RCA: 106] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2012] [Accepted: 03/12/2012] [Indexed: 01/07/2023]
Abstract
The PRINTS database, now in its 21st year, houses a collection of diagnostic protein family 'fingerprints'. Fingerprints are groups of conserved motifs, evident in multiple sequence alignments, whose unique inter-relationships provide distinctive signatures for particular protein families and structural/functional domains. As such, they may be used to assign uncharacterized sequences to known families, and hence to infer tentative functional, structural and/or evolutionary relationships. The February 2012 release (version 42.0) includes 2156 fingerprints, encoding 12 444 individual motifs, covering a range of globular and membrane proteins, modular polypeptides and so on. Here, we report the current status of the database, and introduce a number of recent developments that help both to render a variety of our annotation and analysis tools easier to use and to make them more widely available. Database URL: www.bioinf.manchester.ac.uk/dbbrowser/PRINTS/.
Collapse
Affiliation(s)
- Teresa K Attwood
- Faculty of Life Sciences, The University of Manchester, Manchester M13 9PT, UK.
| | | | | | | | | | | | | | | | | |
Collapse
|
38
|
Williams AJ, Ekins S, Tkachenko V. Towards a gold standard: regarding quality in public domain chemistry databases and approaches to improving the situation. Drug Discov Today 2012; 17:685-701. [PMID: 22426180 DOI: 10.1016/j.drudis.2012.02.013] [Citation(s) in RCA: 79] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2011] [Revised: 01/17/2012] [Accepted: 02/28/2012] [Indexed: 01/25/2023]
Abstract
In recent years there has been a dramatic increase in the number of freely accessible online databases serving the chemistry community. The internet provides chemistry data that can be used for data-mining, for computer models, and integration into systems to aid drug discovery. There is however a responsibility to ensure that the data are high quality to ensure that time is not wasted in erroneous searches, that models are underpinned by accurate data and that improved discoverability of online resources is not marred by incorrect data. In this article we provide an overview of some of the experiences of the authors using online chemical compound databases, critique the approaches taken to assemble data and we suggest approaches to deliver definitive reference data sources.
Collapse
Affiliation(s)
- Antony J Williams
- Royal Society of Chemistry, US Office, 904 Tamaras Circle, Wake Forest, NC 27587, USA.
| | | | | |
Collapse
|
39
|
Neumann S, Hartmann H, Martin-Galiano AJ, Fuchs A, Frishman D. Camps 2.0: exploring the sequence and structure space of prokaryotic, eukaryotic, and viral membrane proteins. Proteins 2011; 80:839-57. [PMID: 22213543 DOI: 10.1002/prot.23242] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2011] [Revised: 10/01/2011] [Accepted: 11/04/2011] [Indexed: 12/20/2022]
Abstract
Structural bioinformatics of membrane proteins is still in its infancy, and the picture of their fold space is only beginning to emerge. Because only a handful of three-dimensional structures are available, sequence comparison and structure prediction remain the main tools for investigating sequence-structure relationships in membrane protein families. Here we present a comprehensive analysis of the structural families corresponding to α-helical membrane proteins with at least three transmembrane helices. The new version of our CAMPS database (CAMPS 2.0) covers nearly 1300 eukaryotic, prokaryotic, and viral genomes. Using an advanced classification procedure, which is based on high-order hidden Markov models and considers both sequence similarity as well as the number of transmembrane helices and loop lengths, we identified 1353 structurally homogeneous clusters roughly corresponding to membrane protein folds. Only 53 clusters are associated with experimentally determined three-dimensional structures, and for these clusters CAMPS is in reasonable agreement with structure-based classification approaches such as SCOP and CATH. We therefore estimate that ∼1300 structures would need to be determined to provide a sufficient structural coverage of polytopic membrane proteins. CAMPS 2.0 is available at http://webclu.bio.wzw.tum.de/CAMPS2.0/.
Collapse
Affiliation(s)
- Sindy Neumann
- Department of Genome Oriented Bioinformatics, Technische Universität München, Wissenschaftszentrum Weihenstephan, 85354 Freising, Germany
| | | | | | | | | |
Collapse
|
40
|
Sequence-divergent chordopoxvirus homologs of the o3 protein maintain functional interactions with components of the vaccinia virus entry-fusion complex. J Virol 2011; 86:1696-705. [PMID: 22114343 DOI: 10.1128/jvi.06069-11] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022] Open
Abstract
Composed of 35 amino acids, O3 is the smallest characterized protein encoded by vaccinia virus (VACV) and is an integral component of the entry-fusion complex (EFC). O3 is conserved with 100% identity in all orthopoxviruses except for monkeypox viruses, whose O3 homologs have 2 to 3 amino acid substitutions. Since O3 is part of the EFC, high conservation could suggest an immutable requirement for interaction with multiple proteins. Chordopoxviruses of other genera also encode small proteins with a characteristic predicted N-terminal α-helical hydrophobic domain followed by basic amino acids and proline in the same relative genome location as that of VACV O3. However, the statistical significance of their similarity to VACV O3 is low due to the large contribution of the transmembrane domain, their small size, and their sequence diversity. Nevertheless, trans-complementation experiments demonstrated the ability of a representative O3-like protein from each chordopoxvirus genus to rescue the infectivity of a VACV mutant that was unable to express endogenous O3. Moreover, recombinant viruses expressing O3 homologs in place of O3 replicated and formed plaques as well or nearly as well as wild-type VACV. The O3 homologs expressed by the recombinant VACVs were incorporated into the membranes of mature virions and, with one exception, remained stably associated with the detergent-extracted and affinity-purified EFC. The ability of the sequence-divergent O3 homologs to coordinate function with VACV entry proteins suggests the conservation of structural motifs. Analysis of chimeras formed by swapping domains of O3 with those of other proteins indicated that the N-terminal transmembrane segment was responsible for EFC interactions and for the complementation of infectivity.
Collapse
|
41
|
WONG WINGCHEONG, MAURER-STROH SEBASTIAN, EISENHABER FRANK. THE JANUS-FACED E-VALUES OF HMMER2: EXTREME VALUE DISTRIBUTION OR LOGISTIC FUNCTION? J Bioinform Comput Biol 2011. [DOI: 10.1142/s0219720011005264] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
E-value guided extrapolation of protein domain annotation from libraries such as Pfam with the HMMER suite is indispensable for hypothesizing about the function of experimentally uncharacterized protein sequences. Since the recent release of HMMER3 does not supersede all functions of HMMER2, the latter will remain relevant for ongoing research as well as for the evaluation of annotations that reside in databases and in the literature. In HMMER2, the E-value is computed from the score via a logistic function or via a domain model-specific extreme value distribution (EVD); the lower of the two is returned as E-value for the domain hit in the query sequence. We find that, for thousands of domain models, this treatment results in switching from the EVD to the statistical model with the logistic function when scores grow (for Pfam release 23, 99% in the global mode and 75% in the fragment mode). If the score corresponding to the breakpoint results in an E-value above a user-defined threshold (e.g. 0.1), a critical score region with conflicting E-values from the logistic function (below the threshold) and from EVD (above the threshold) does exist. Thus, this switch will affect E-value guided annotation decisions in an automated mode. To emphasize, switching in the fragment mode is of no practical relevance since it occurs only at E-values far below 0.1. Unfortunately, a critical score region does exist for 185 domain models in the hmmpfam and 1,748 domain models in the hmmsearch global-search mode. For 145 out the respective 185 models, the critical score region is indeed populated by actual sequences. In total, 24.4% of their hits have a logistic function-derived E-value < 0.1 when the EVD provides an E-value > 0.1. We provide examples of false annotations and critically discuss the appropriateness of a logistic function as alternative to the EVD.
Collapse
Affiliation(s)
- WING-CHEONG WONG
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A *STAR), 30 Biopolis Street, #07-01, Matrix, Singapore 138671, Singapore
| | - SEBASTIAN MAURER-STROH
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A *STAR), 30 Biopolis Street, #07-01, Matrix, Singapore 138671, Singapore
- School of Biological Sciences (SBS), Nanyang Technological University (NTU), 60 Nanyang Drive, Singapore 63755, Singapore
| | - FRANK EISENHABER
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A *STAR), 30 Biopolis Street, #07-01, Matrix, Singapore 138671, Singapore
- Department of Biological Sciences (DBS), National University of Singapore (NUS), 8 Medical Drive, Singapore 117597, Singapore
- School of Computer Engineering (SCE), Nanyang Technological University (NTU), 50 Nanyang Drive, Singapore 637553, Singapore
| |
Collapse
|
42
|
Parikesit AA, Stadler PF, Prohaska SJ. Evolution and quantitative comparison of genome-wide protein domain distributions. Genes (Basel) 2011; 2:912-24. [PMID: 24710298 PMCID: PMC3927604 DOI: 10.3390/genes2040912] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2011] [Revised: 10/07/2011] [Accepted: 10/25/2011] [Indexed: 02/01/2023] Open
Abstract
The metabolic and regulatory capabilities of an organism are implicit in its protein content. This is often hard to estimate, however, due to ascertainment biases inherent in the available genome annotations. Its complement of recognizable functional protein domains and their combinations convey essentially the same information and at the same time are much more readily accessible, although protein domain models trained for one phylogenetic group frequently fail on distantly related sequences. Pooling related domain models based on their GO-annotation in combination with de novo gene prediction methods provides estimates that seem to be less affected by phylogenetic biases. We show here for 18 diverse representatives from all eukaryotic kingdoms that a pooled analysis of the tendencies for co-occurrence or avoidance of protein domains is indeed feasible. This type of analysis can reveal general large-scale patterns in the domain co-occurrence and helps to identify lineage-specific variations in the evolution of protein domains. Somewhat surprisingly, we do not find strong ubiquitous patterns governing the evolutionary behavior of specific functional classes. Instead, there are strong variations between the major groups of Eukaryotes, pointing at systematic differences in their evolutionary constraints.
Collapse
Affiliation(s)
- Arli A Parikesit
- Computational EvoDevo Group, Department of Computer Science, University of Leipzig, Härtelstraße 16-18, D-04107 Leipzig, Germany.
| | - Peter F Stadler
- Interdisciplinary Center for Bioinformatics, University of Leipzig, Härtelstraße 16-18, D-04107 Leipzig, Germany.
| | - Sonja J Prohaska
- Computational EvoDevo Group, Department of Computer Science, University of Leipzig, Härtelstraße 16-18, D-04107 Leipzig, Germany.
| |
Collapse
|
43
|
Wong WC, Maurer-Stroh S, Eisenhaber F. Not all transmembrane helices are born equal: Towards the extension of the sequence homology concept to membrane proteins. Biol Direct 2011; 6:57. [PMID: 22024092 PMCID: PMC3217874 DOI: 10.1186/1745-6150-6-57] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2011] [Accepted: 10/25/2011] [Indexed: 12/17/2022] Open
Abstract
BACKGROUND Sequence homology considerations widely used to transfer functional annotation to uncharacterized protein sequences require special precautions in the case of non-globular sequence segments including membrane-spanning stretches composed of non-polar residues. Simple, quantitative criteria are desirable for identifying transmembrane helices (TMs) that must be included into or should be excluded from start sequence segments in similarity searches aimed at finding distant homologues. RESULTS We found that there are two types of TMs in membrane-associated proteins. On the one hand, there are so-called simple TMs with elevated hydrophobicity, low sequence complexity and extraordinary enrichment in long aliphatic residues. They merely serve as membrane-anchoring device. In contrast, so-called complex TMs have lower hydrophobicity, higher sequence complexity and some functional residues. These TMs have additional roles besides membrane anchoring such as intra-membrane complex formation, ligand binding or a catalytic role. Simple and complex TMs can occur both in single- and multi-membrane-spanning proteins essentially in any type of topology. Whereas simple TMs have the potential to confuse searches for sequence homologues and to generate unrelated hits with seemingly convincing statistical significance, complex TMs contain essential evolutionary information. CONCLUSION For extending the homology concept onto membrane proteins, we provide a necessary quantitative criterion to distinguish simple TMs (and a sufficient criterion for complex TMs) in query sequences prior to their usage in homology searches based on assessment of hydrophobicity and sequence complexity of the TM sequence segments.
Collapse
Affiliation(s)
- Wing-Cheong Wong
- Bioinformatics Institute, Agency for Science, Technology and Research, Matrix, Singapore
| | | | | |
Collapse
|
44
|
Overton IM, Barton GJ. Computational approaches to selecting and optimising targets for structural biology. Methods 2011; 55:3-11. [PMID: 21906678 PMCID: PMC3202631 DOI: 10.1016/j.ymeth.2011.08.014] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2011] [Revised: 08/18/2011] [Accepted: 08/22/2011] [Indexed: 11/29/2022] Open
Abstract
Selection of protein targets for study is central to structural biology and may be influenced by numerous factors. A key aim is to maximise returns for effort invested by identifying proteins with the balance of biophysical properties that are conducive to success at all stages (e.g. solubility, crystallisation) in the route towards a high resolution structural model. Selected targets can be optimised through construct design (e.g. to minimise protein disorder), switching to a homologous protein, and selection of experimental methodology (e.g. choice of expression system) to prime for efficient progress through the structural proteomics pipeline. Here we discuss computational techniques in target selection and optimisation, with more detailed focus on tools developed within the Scottish Structural Proteomics Facility (SSPF); namely XANNpred, ParCrys, OB-Score (target selection) and TarO (target optimisation). TarO runs a large number of algorithms, searching for homologues and annotating the pool of possible alternative targets. This pool of putative homologues is presented in a ranked, tabulated format and results are also visualised as an automatically generated and annotated multiple sequence alignment. The target selection algorithms each predict the propensity of a selected protein target to progress through the experimental stages leading to diffracting crystals. This single predictor approach has advantages for target selection, when compared with an approach using two or more predictors that each predict for success at a single experimental stage. The tools described here helped SSPF achieve a high (21%) success rate in progressing cloned targets to diffraction-quality crystals.
Collapse
Affiliation(s)
- Ian M Overton
- MRC Human Genetics Unit, Institute of Genetics and Molecular Medicine, Western General Hospital, Crewe Road, Edinburgh EH4 2XU, United Kingdom.
| | | |
Collapse
|
45
|
D'Angelo S, Velappan N, Mignone F, Santoro C, Sblattero D, Kiss C, Bradbury ARM. Filtering "genic" open reading frames from genomic DNA samples for advanced annotation. BMC Genomics 2011; 12 Suppl 1:S5. [PMID: 21810207 PMCID: PMC3223728 DOI: 10.1186/1471-2164-12-s1-s5] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Background In order to carry out experimental gene annotation, DNA encoding open reading frames (ORFs) derived from real genes (termed "genic") in the correct frame is required. When genes are correctly assigned, isolation of genic DNA for functional annotation can be carried out by PCR. However, not all genes are correctly assigned, and even when correctly assigned, gene products are often incorrectly folded when expressed in heterologous hosts. This is a problem that can sometimes be overcome by the expression of protein fragments encoding domains, rather than full-length proteins. One possible method to isolate DNA encoding such domains would to "filter" complex DNA (cDNA libraries, genomic and metagenomic DNA) for gene fragments that confer a selectable phenotype relying on correct folding, with all such domains present in a complex DNA sample, termed the “domainome”. Results In this paper we discuss the preparation of diverse genic ORF libraries from randomly fragmented genomic DNA using ß-lactamase to filter out the open reading frames. By cloning DNA fragments between leader sequences and the mature ß-lactamase gene, colonies can be selected for resistance to ampicillin, conferred by correct folding of the lactamase gene. Our experiments demonstrate that the majority of surviving colonies contain genic open reading frames, suggesting that ß-lactamase is acting as a selectable folding reporter. Furthermore, different leaders (Sec, TAT and SRP), normally translocating different protein classes, filter different genic fragment subsets, indicating that their use increases the fraction of the “domainone” that is accessible. Conclusions The availability of ORF libraries, obtained with the filtering method described here, combined with screening methods such as phage display and protein-protein interaction studies, or with protein structure determination projects, can lead to the identification and structural determination of functional genic ORFs. ORF libraries represent, moreover, a useful tool to proceed towards high-throughput functional annotation of newly sequenced genomes.
Collapse
Affiliation(s)
- Sara D'Angelo
- Bioscience Division, Los Alamos National Laboratory, Los Alamos, NM, USA
| | | | | | | | | | | | | |
Collapse
|
46
|
Studer RA, Person E, Robinson-Rechavi M, Rossier BC. Evolution of the epithelial sodium channel and the sodium pump as limiting factors of aldosterone action on sodium transport. Physiol Genomics 2011; 43:844-54. [PMID: 21558422 DOI: 10.1152/physiolgenomics.00002.2011] [Citation(s) in RCA: 38] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022] Open
Abstract
Despite large changes in salt intake, the mammalian kidney is able to maintain the extracellular sodium concentration and osmolarity within very narrow margins, thereby controlling blood volume and blood pressure. In the aldosterone-sensitive distal nephron (ASDN), aldosterone tightly controls the activities of epithelial sodium channel (ENaC) and Na,K-ATPase, the two limiting factors in establishing transepithelial sodium transport. It has been proposed that the ENaC/degenerin gene family is restricted to Metazoans, whereas the α- and β-subunits of Na,K-ATPase have homologous genes in prokaryotes. This raises the question of the emergence of osmolarity control. By exploring recent genomic data of diverse organisms, we found that: 1) ENaC/degenerin exists in all of the Metazoans screened, including nonbilaterians and, by extension, was already present in ancestors of Metazoa; 2) ENaC/degenerin is also present in Naegleria gruberi, an eukaryotic microbe, consistent with either a vertical inheritance from the last common ancestor of Eukaryotes or a lateral transfer between Naegleria and Metazoan ancestors; and 3) The Na,K-ATPase β-subunit is restricted to Holozoa, the taxon that includes animals and their closest single-cell relatives. Since the β-subunit of Na,K-ATPase plays a key role in targeting the α-subunit to the plasma membrane and has an additional function in the formation of cell junctions, we propose that the emergence of Na,K-ATPase, together with ENaC/degenerin, is linked to the development of multicellularity in the Metazoan kingdom. The establishment of multicellularity and the associated extracellular compartment ("internal milieu") precedes the emergence of other key elements of the aldosterone signaling pathway.
Collapse
Affiliation(s)
- Romain A Studer
- Department of Ecology and Evolution, Biophore, Lausanne, Switzerland.
| | | | | | | |
Collapse
|
47
|
Thompson JD, Linard B, Lecompte O, Poch O. A comprehensive benchmark study of multiple sequence alignment methods: current challenges and future perspectives. PLoS One 2011; 6:e18093. [PMID: 21483869 PMCID: PMC3069049 DOI: 10.1371/journal.pone.0018093] [Citation(s) in RCA: 129] [Impact Index Per Article: 9.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2010] [Accepted: 02/21/2011] [Indexed: 12/18/2022] Open
Abstract
Multiple comparison or alignmentof protein sequences has become a fundamental tool in many different domains in modern molecular biology, from evolutionary studies to prediction of 2D/3D structure, molecular function and inter-molecular interactions etc. By placing the sequence in the framework of the overall family, multiple alignments can be used to identify conserved features and to highlight differences or specificities. In this paper, we describe a comprehensive evaluation of many of the most popular methods for multiple sequence alignment (MSA), based on a new benchmark test set. The benchmark is designed to represent typical problems encountered when aligning the large protein sequence sets that result from today's high throughput biotechnologies. We show that alignmentmethods have significantly progressed and can now identify most of the shared sequence features that determine the broad molecular function(s) of a protein family, even for divergent sequences. However,we have identified a number of important challenges. First, the locally conserved regions, that reflect functional specificities or that modulate a protein's function in a given cellular context,are less well aligned. Second, motifs in natively disordered regions are often misaligned. Third, the badly predicted or fragmentary protein sequences, which make up a large proportion of today's databases, lead to a significant number of alignment errors. Based on this study, we demonstrate that the existing MSA methods can be exploited in combination to improve alignment accuracy, although novel approaches will still be needed to fully explore the most difficult regions. We then propose knowledge-enabled, dynamic solutions that will hopefully pave the way to enhanced alignment construction and exploitation in future evolutionary systems biology studies.
Collapse
Affiliation(s)
- Julie D Thompson
- Département de Biologie Structurale et Génomique, IGBMC (Institut de Génétique et de Biologie Moléculaire et Cellulaire), CNRS/INSERM/Université de Strasbourg, Illkirch, France.
| | | | | | | |
Collapse
|
48
|
Abstract
Biological sequences are often analyzed by detecting homologous regions between them. Homology search is confounded by simple repeats, which give rise to strong similarities that are not homologies. Standard repeat-masking methods fail to eliminate this problem, and they are especially ill-suited to AT-rich DNA such as malaria and slime-mould genomes. We present a new repeat-masking method, tantan, which is motivated by the mechanisms that create simple repeats. This method thoroughly eliminates spurious homology predictions for DNA–DNA, protein–protein and DNA–protein comparisons. Moreover, it enables accurate homology search for non-coding DNA with extreme A + T composition.
Collapse
Affiliation(s)
- Martin C Frith
- Computational Biology Research Center, Institute for Advanced Industrial Science and Technology, Sequence Analysis Team, 2-4-7 Aomi, Koto-ku, Tokyo 135-0064, Japan.
| |
Collapse
|