1
|
Stingl C, VanDuijn MM, Dejoie T, Sillevis Smitt PAE, Luider TM. Improved detection of tryptic immunoglobulin variable region peptides by chromatographic and gas-phase fractionation techniques. CELL REPORTS METHODS 2024; 4:100795. [PMID: 38861989 PMCID: PMC11228375 DOI: 10.1016/j.crmeth.2024.100795] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/03/2023] [Revised: 03/30/2024] [Accepted: 05/20/2024] [Indexed: 06/13/2024]
Abstract
The polyclonal repertoire of circulating antibodies potentially holds valuable information about an individual's humoral immune state. While bottom-up proteomics is well suited for serum proteomics, the vast number of antibodies and dynamic range of serum challenge this analysis. To acquire the serum proteome more comprehensively, we incorporated high-field asymmetric waveform ion-mobility spectrometry (FAIMS) or two-dimensional chromatography into standard trypsin-based bottom-up proteomics. Thereby, the number of variable region (VR)-related spectra increased 1.7-fold with FAIMS and 10-fold with chromatography fractionation. To match antibody VRs to spectra, we combined de novo searching and BLAST alignment. Validation of this approach showed that, as peptide length increased, the de novo accuracy decreased and BLAST performance increased. Through in silico calculations on antibody repository sequences, we determined the uniqueness of tryptic VR peptides and their suitability as antibody surrogate. Approximately one-third of these peptides were unique, and about one-third of all antibodies contained at least one unique peptide.
Collapse
Affiliation(s)
- Christoph Stingl
- Clinical and Cancer Proteomics, Department of Neurology, Erasmus MC, 3000 CA Rotterdam, the Netherlands.
| | - Martijn M VanDuijn
- Clinical and Cancer Proteomics, Department of Neurology, Erasmus MC, 3000 CA Rotterdam, the Netherlands
| | - Thomas Dejoie
- Laboratoire de Biochimie, Centre Hospitalier Universitaire (CHU), 44000 Nantes, France
| | - Peter A E Sillevis Smitt
- Clinical and Cancer Proteomics, Department of Neurology, Erasmus MC, 3000 CA Rotterdam, the Netherlands
| | - Theo M Luider
- Clinical and Cancer Proteomics, Department of Neurology, Erasmus MC, 3000 CA Rotterdam, the Netherlands
| |
Collapse
|
2
|
Price MN, Arkin AP. A fast comparative genome browser for diverse bacteria and archaea. PLoS One 2024; 19:e0301871. [PMID: 38593165 PMCID: PMC11003636 DOI: 10.1371/journal.pone.0301871] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2023] [Accepted: 03/22/2024] [Indexed: 04/11/2024] Open
Abstract
Genome sequencing has revealed an incredible diversity of bacteria and archaea, but there are no fast and convenient tools for browsing across these genomes. It is cumbersome to view the prevalence of homologs for a protein of interest, or the gene neighborhoods of those homologs, across the diversity of the prokaryotes. We developed a web-based tool, fast.genomics, that uses two strategies to support fast browsing across the diversity of prokaryotes. First, the database of genomes is split up. The main database contains one representative from each of the 6,377 genera that have a high-quality genome, and additional databases for each taxonomic order contain up to 10 representatives of each species. Second, homologs of proteins of interest are identified quickly by using accelerated searches, usually in a few seconds. Once homologs are identified, fast.genomics can quickly show their prevalence across taxa, view their neighboring genes, or compare the prevalence of two different proteins. Fast.genomics is available at https://fast.genomics.lbl.gov.
Collapse
Affiliation(s)
- Morgan N. Price
- Environmental Genomics and Systems Biology, Lawrence Berkeley National Lab, Berkeley, California, United States of America
| | - Adam P. Arkin
- Environmental Genomics and Systems Biology, Lawrence Berkeley National Lab, Berkeley, California, United States of America
| |
Collapse
|
3
|
Glidden-Handgis G, Wheeler TJ. WAS IT A MATch I SAW? Approximate palindromes lead to overstated false match rates in benchmarks using reversed sequences. BIOINFORMATICS ADVANCES 2024; 4:vbae052. [PMID: 38764475 PMCID: PMC11099658 DOI: 10.1093/bioadv/vbae052] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/21/2023] [Revised: 03/31/2024] [Accepted: 04/04/2024] [Indexed: 05/21/2024]
Abstract
Background Software for labeling biological sequences typically produces a theory-based statistic for each match (the E-value) that indicates the likelihood of seeing that match's score by chance. E-values accurately predict false match rate for comparisons of random (shuffled) sequences, and thus provide a reasoned mechanism for setting score thresholds that enable high sensitivity with low expected false match rate. This threshold-setting strategy is challenged by real biological sequences, which contain regions of local repetition and low sequence complexity that cause excess matches between non-homologous sequences. Knowing this, tool developers often develop benchmarks that use realistic-seeming decoy sequences to explore empirical tradeoffs between sensitivity and false match rate. A recent trend has been to employ reversed biological sequences as realistic decoys, because these preserve the distribution of letters and the existence of local repeats, while disrupting the original sequence's functional properties. However, we and others have observed that sequences appear to produce high scoring alignments to their reversals with surprising frequency, leading to overstatement of false match risk that may negatively affect downstream analysis. Results We demonstrate that an alignment between a sequence S and its (possibly mutated) reversal tends to produce higher scores than alignment between truly unrelated sequences, even when S is a shuffled string with no notable repetitive or low-complexity regions. This phenomenon is due to the unintuitive fact that (even randomly shuffled) sequences contain palindromes that are on average longer than the longest common substrings (LCS) shared between permuted variants of the same sequence. Though the expected palindrome length is only slightly larger than the expected LCS, the distribution of alignment scores involving reversed sequences is strongly right-shifted, leading to greatly increased frequency of high-scoring alignments to reversed sequences. Impact Overestimates of false match risk can motivate unnecessarily high score thresholds, leading to potentially reduced true match sensitivity. Also, when tool sensitivity is only reported up to the score of the first matched decoy sequence, a large decoy set consisting of reversed sequences can obscure sensitivity differences between tools. As a result of these observations, we advise that reversed biological sequences be used as decoys only when care is taken to remove positive matches in the original (un-reversed) sequences, or when overstatement of false labeling is not a concern. Though the primary focus of the analysis is on sequence annotation, we also demonstrate that the prevalence of internal palindromes may lead to an overstatement of the rate of false labels in protein identification with mass spectrometry.
Collapse
Affiliation(s)
| | - Travis J Wheeler
- R. Ken Coit College of Pharmacy, University of Arizona, Tucson, AZ 85721, United States
| |
Collapse
|
4
|
Hauswedell H, Hetzel S, Gottlieb SG, Kretzmer H, Meissner A, Reinert K. Lambda3: homology search for protein, nucleotide, and bisulfite-converted sequences. Bioinformatics 2024; 40:btae097. [PMID: 38485699 PMCID: PMC10955267 DOI: 10.1093/bioinformatics/btae097] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2023] [Revised: 12/22/2023] [Accepted: 03/13/2024] [Indexed: 03/22/2024] Open
Abstract
MOTIVATION Local alignments of query sequences in large databases represent a core part of metagenomic studies and facilitate homology search. Following the development of NCBI Blast, many applications aimed to provide faster and equally sensitive local alignment frameworks. Most applications focus on protein alignments, while only few also facilitate DNA-based searches. None of the established programs allow searching DNA sequences from bisulfite sequencing experiments commonly used for DNA methylation profiling, for which specific alignment strategies need to be implemented. RESULTS Here, we introduce Lambda3, a new version of the local alignment application Lambda. Lambda3 is the first solution that enables the search of protein, nucleotide as well as bisulfite-converted nucleotide query sequences. Its protein mode achieves comparable performance to that of the highly optimized protein alignment application Diamond, while the nucleotide mode consistently outperforms established local nucleotide aligners. Combined, Lambda3 presents a universal local alignment framework that enables fast and sensitive homology searches for a wide range of use-cases. AVAILABILITY AND IMPLEMENTATION Lambda3 is free and open-source software publicly available at https://github.com/seqan/lambda/.
Collapse
Affiliation(s)
| | - Sara Hetzel
- Department of Genome Regulation, Max Planck Institute for Molecular Genetics, Berlin 14195, Germany
| | - Simon G Gottlieb
- Department of Mathematics and Computer Science, Freie Universität Berlin, Berlin 14195, Germany
- Institute for Bio- and Geosciences, Forschungszentrum Jülich GmbH, Jülich 52428, Germany
| | - Helene Kretzmer
- Department of Genome Regulation, Max Planck Institute for Molecular Genetics, Berlin 14195, Germany
| | - Alexander Meissner
- Department of Genome Regulation, Max Planck Institute for Molecular Genetics, Berlin 14195, Germany
- Department of Biology, Chemistry and Pharmacy, Freie Universität Berlin, Berlin 14195, Germany
| | - Knut Reinert
- Department of Mathematics and Computer Science, Freie Universität Berlin, Berlin 14195, Germany
- Efficient Algorithms for Omics Data Group, Max Planck Institute for Molecular Genetics, Berlin 14195, Germany
| |
Collapse
|
5
|
Kouros CE, Makri V, Ouzounis CA, Chasapi A. Disease association and comparative genomics of compositional bias in human proteins. F1000Res 2023; 12:198. [PMID: 37082000 PMCID: PMC10111144 DOI: 10.12688/f1000research.129929.1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 02/02/2023] [Indexed: 02/22/2023] Open
Abstract
Background: The evolutionary rate of disordered proteins varies greatly due to the lack of structural constraints. So far, few studies have investigated the presence/absence patterns of intrinsically disordered regions (IDRs) across phylogenies in conjunction with human disease. In this study, we report a genome-wide analysis of compositional bias association with disease in human proteins and their taxonomic distribution. Methods: The human genome protein set provided by the Ensembl database was annotated and analysed with respect to both disease associations and the detection of compositional bias. The Uniprot Reference Proteome dataset, containing 11297 proteomes was used as target dataset for the comparative genomics of a well-defined subset of the Human Genome, including 100 characteristic, compositionally biased proteins, some linked to disease. Results: Cross-evaluation of compositional bias and disease-association in the human genome reveals a significant bias towards low complexity regions in disease-associated genes, with charged, hydrophilic amino acids appearing as over-represented. The phylogenetic profiling of 17 disease-associated, low complexity proteins across 11297 proteomes captures characteristic taxonomic distribution patterns. Conclusions: This is the first time that a combined genome-wide analysis of low complexity, disease-association and taxonomic distribution of human proteins is reported, covering structural, functional, and evolutionary properties. The reported framework can form the basis for large-scale, follow-up projects, encompassing the entire human genome and all known gene-disease associations.
Collapse
Affiliation(s)
- Christos E. Kouros
- BCCB-AIIA, School of Informatics, Aristotle University of Thessaloniki, Thessaloniki, Greece
| | - Vasiliki Makri
- BCCB-AIIA, School of Informatics, Aristotle University of Thessaloniki, Thessaloniki, Greece
| | - Christos A. Ouzounis
- BCCB-AIIA, School of Informatics, Aristotle University of Thessaloniki, Thessaloniki, Greece
- BCPL, Chemical Process & Energy Resources Institute, Centre for Research & Technology Hellas (CERTH), Thessaloniki, Greece
| | - Anastasia Chasapi
- BCPL, Chemical Process & Energy Resources Institute, Centre for Research & Technology Hellas (CERTH), Thessaloniki, Greece
| |
Collapse
|
6
|
Kouros CE, Makri V, Ouzounis CA, Chasapi A. Disease association and comparative genomics of compositional bias in human proteins. F1000Res 2023; 12:198. [PMID: 37082000 PMCID: PMC10111144.2 DOI: 10.12688/f1000research.129929.2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 04/12/2023] [Indexed: 04/25/2023] Open
Abstract
Background: The evolutionary rate of disordered protein regions varies greatly due to the lack of structural constraints. So far, few studies have investigated the presence/absence patterns of compositional bias, indicative of disorder, across phylogenies in conjunction with human disease. In this study, we report a genome-wide analysis of compositional bias association with disease in human proteins and their taxonomic distribution. Methods: The human genome protein set provided by the Ensembl database was annotated and analysed with respect to both disease associations and the detection of compositional bias. The Uniprot Reference Proteome dataset, containing 11297 proteomes was used as target dataset for the comparative genomics of a well-defined subset of the Human Genome, including 100 characteristic, compositionally biased proteins, some linked to disease. Results: Cross-evaluation of compositional bias and disease-association in the human genome reveals a significant bias towards biased regions in disease-associated genes, with charged, hydrophilic amino acids appearing as over-represented. The phylogenetic profiling of 17 disease-associated, proteins with compositional bias across 11297 proteomes captures characteristic taxonomic distribution patterns. Conclusions: This is the first time that a combined genome-wide analysis of compositional bias, disease-association and taxonomic distribution of human proteins is reported, covering structural, functional, and evolutionary properties. The reported framework can form the basis for large-scale, follow-up projects, encompassing the entire human genome and all known gene-disease associations.
Collapse
Affiliation(s)
- Christos E. Kouros
- BCCB-AIIA, School of Informatics, Aristotle University of Thessaloniki, Thessaloniki, Greece
| | - Vasiliki Makri
- BCCB-AIIA, School of Informatics, Aristotle University of Thessaloniki, Thessaloniki, Greece
| | - Christos A. Ouzounis
- BCCB-AIIA, School of Informatics, Aristotle University of Thessaloniki, Thessaloniki, Greece
- BCPL, Chemical Process & Energy Resources Institute, Centre for Research & Technology Hellas (CERTH), Thessaloniki, Greece
| | - Anastasia Chasapi
- BCPL, Chemical Process & Energy Resources Institute, Centre for Research & Technology Hellas (CERTH), Thessaloniki, Greece
| |
Collapse
|
7
|
TwinCons: Conservation score for uncovering deep sequence similarity and divergence. PLoS Comput Biol 2021; 17:e1009541. [PMID: 34714829 PMCID: PMC8580257 DOI: 10.1371/journal.pcbi.1009541] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2021] [Revised: 11/10/2021] [Accepted: 10/06/2021] [Indexed: 11/19/2022] Open
Abstract
We have developed the program TwinCons, to detect noisy signals of deep ancestry of proteins or nucleic acids. As input, the program uses a composite alignment containing pre-defined groups, and mathematically determines a 'cost' of transforming one group to the other at each position of the alignment. The output distinguishes conserved, variable and signature positions. A signature is conserved within groups but differs between groups. The method automatically detects continuous characteristic stretches (segments) within alignments. TwinCons provides a convenient representation of conserved, variable and signature positions as a single score, enabling the structural mapping and visualization of these characteristics. Structure is more conserved than sequence. TwinCons highlights alternative sequences of conserved structures. Using TwinCons, we detected highly similar segments between proteins from the translation and transcription systems. TwinCons detects conserved residues within regions of high functional importance for the ribosomal RNA (rRNA) and demonstrates that signatures are not confined to specific regions but are distributed across the rRNA structure. The ability to evaluate both nucleic acid and protein alignments allows TwinCons to be used in combined sequence and structural analysis of signatures and conservation in rRNA and in ribosomal proteins (rProteins). TwinCons detects a strong sequence conservation signal between bacterial and archaeal rProteins related by circular permutation. This conserved sequence is structurally colocalized with conserved rRNA, indicated by TwinCons scores of rRNA alignments of bacterial and archaeal groups. This combined analysis revealed deep co-evolution of rRNA and rProtein buried within the deepest branching points in the tree of life.
Collapse
|
8
|
Carey KM, Patterson G, Wheeler TJ. Transposable element subfamily annotation has a reproducibility problem. Mob DNA 2021; 12:4. [PMID: 33485368 PMCID: PMC7827986 DOI: 10.1186/s13100-021-00232-4] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2020] [Accepted: 01/11/2021] [Indexed: 11/24/2022] Open
Abstract
Background Transposable element (TE) sequences are classified into families based on the reconstructed history of replication, and into subfamilies based on more fine-grained features that are often intended to capture family history. We evaluate the reliability of annotation with common subfamilies by assessing the extent to which subfamily annotation is reproducible in replicate copies created by segmental duplications in the human genome, and in homologous copies shared by human and chimpanzee. Results We find that standard methods annotate over 10% of replicates as belonging to different subfamilies, despite the fact that they are expected to be annotated as belonging to the same subfamily. Point mutations and homologous recombination appear to be responsible for some of this discordant annotation (particularly in the young Alu family), but are unlikely to fully explain the annotation unreliability. Conclusions The surprisingly high level of disagreement in subfamily annotation of homologous sequences highlights a need for further research into definition of TE subfamilies, methods for representing subfamily annotation confidence of TE instances, and approaches to better utilizing such nuanced annotation data in downstream analysis.
Collapse
Affiliation(s)
- Kaitlin M Carey
- Department of Computer Science, University of Montana, 32 Campus Drive, Missoula, MT, USA
| | - Gilia Patterson
- Department of Computer Science, University of Montana, 32 Campus Drive, Missoula, MT, USA.,Institute of Ecology and Evolution, University of Oregon, 272 Onyx Bridge, Eugene, OR, USA
| | - Travis J Wheeler
- Department of Computer Science, University of Montana, 32 Campus Drive, Missoula, MT, USA.
| |
Collapse
|
9
|
Trivedi R, Nagarajaram HA. Substitution scoring matrices for proteins - An overview. Protein Sci 2020; 29:2150-2163. [PMID: 32954566 DOI: 10.1002/pro.3954] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2020] [Revised: 09/17/2020] [Accepted: 09/18/2020] [Indexed: 01/17/2023]
Abstract
Sequence analysis is the primary and simplest approach to discover structural, functional and evolutionary details of related proteins. All the alignment based approaches of sequence analysis make use of amino acid substitution matrices, and the accuracy of the results largely depends on the type of scoring matrices used to perform alignment tasks. An amino acid substitution matrix is a 20 × 20 matrix in which the individual elements encapsulate the rates at which each of the 20 amino acid residues in proteins are substituted by other amino acid residues over time. In contrast to most globular/ordered proteins whose amino acids composition is considered as standard, there are several classes of proteins (e.g., transmembrane proteins) in which certain types of amino acid (e.g., hydrophobic residues) are enriched. These compositional differences among various classes of proteins are manifested in their underlying residue substitution frequencies. Therefore, each of the compositionally distinct class of proteins or protein segments should be studied using specific scoring matrices that reflect their distinct residue substitution pattern. In this review, we describe the development and application of various substitution scoring matrices peculiar to proteins with standard and biased compositions. Along with most commonly used standard matrices (PAM, BLOSUM, MD and VTML) that act as default parameters in various homologs search and alignment tools, different substitution scoring matrices specific to compositionally distinct class of proteins are discussed in detail.
Collapse
Affiliation(s)
- Rakesh Trivedi
- Laboratory of Computational Biology, Centre for DNA Fingerprinting and Diagnostics, Uppal, Hyderabad, Telangana, India.,Graduate School, Manipal Academy of Higher Education, Manipal, Karnataka, India
| | - Hampapathalu Adimurthy Nagarajaram
- Laboratory of Computational Biology, Department of Systems and Computational Biology, School of Life Sciences, University of Hyderabad, Hyderabad, Telangana, India.,Centre for Modelling, Simulation and Design, University of Hyderabad, Hyderabad, Telangana, India
| |
Collapse
|
10
|
ProtPCV: A Fixed Dimensional Numerical Representation of Protein Sequence to Significantly Reduce Sequence Search Time. Interdiscip Sci 2020; 12:276-287. [PMID: 32524529 DOI: 10.1007/s12539-020-00380-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2019] [Revised: 05/19/2020] [Accepted: 06/02/2020] [Indexed: 10/24/2022]
Abstract
Protein sequence is a wealth of experimental information which is yet to be exploited to extract information on protein homologues. Consequently, it is observed from publications that dynamic programming, heuristics and HMM profile-based alignment techniques along with the alignment free techniques do not directly utilize ordered profile of physicochemical properties of a protein to identify its homologue. Also, it is found that these works lack crucial bench-marking or validation in absence of which their incorporation in search engines may appears to be questionable. In this direction this research approach offers fixed dimensional numerical representation of protein sequences extending the concept of periodicity count value of nucleotide types (2017) to accommodate Euclidean distance as direct similarity measure between two proteins. Instead of bench-marking with BLAST and PSI-BLAST only, this new similarity measure was also compared with Needleman-Wunsch and Smith-Waterman. For enhancing the strength of comparison, this work for the first time introduces two novel benchmarking methods based on correlation of "similarity scores" and "proximity of ranked outputs from a standard sequence alignment method" between all possible pairs of search techniques including the new one presented in this paper. It is found that the novel and unique numerical representation of a protein can reduce computational complexity of protein sequence search to the tune of O(log(n)). It may also help implementation of various other similarity-based operation possible, such as clustering, phylogenetic analysis and classification of proteins on the basis of the properties used to build this numerical representation of protein.
Collapse
|
11
|
Frith MC. How sequence alignment scores correspond to probability models. Bioinformatics 2019; 36:408-415. [PMID: 31329241 PMCID: PMC9883716 DOI: 10.1093/bioinformatics/btz576] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2019] [Revised: 05/31/2019] [Accepted: 07/17/2019] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Sequence alignment remains fundamental in bioinformatics. Pair-wise alignment is traditionally based on ad hoc scores for substitutions, insertions and deletions, but can also be based on probability models (pair hidden Markov models: PHMMs). PHMMs enable us to: fit the parameters to each kind of data, calculate the reliability of alignment parts and measure sequence similarity integrated over possible alignments. RESULTS This study shows how multiple models correspond to one set of scores. Scores can be converted to probabilities by partition functions with a 'temperature' parameter: for any temperature, this corresponds to some PHMM. There is a special class of models with balanced length probability, i.e. no bias toward either longer or shorter alignments. The best way to score alignments and assess their significance depends on the aim: judging whether whole sequences are related versus finding related parts. This clarifies the statistical basis of sequence alignment. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
|
12
|
Bode K, O'Halloran DM. NCX-DB: a unified resource for integrative analysis of the sodium calcium exchanger super-family. BMC Neurosci 2018; 19:19. [PMID: 29649983 PMCID: PMC5898058 DOI: 10.1186/s12868-018-0423-2] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2017] [Accepted: 03/28/2018] [Indexed: 12/20/2022] Open
Abstract
Na+/Ca2+ exchangers are low-affinity high-capacity transporters that mediate Ca2+ extrusion by coupling Ca2+ efflux to the influx of Na+ ions. The Na+/Ca2+ exchangers form a super-family comprised of three branches each differing in ion-substrate selectivity: Na+/Ca2+ exchangers (NCX), Na+/Ca2+/K+ exchangers, and Ca2+/cation exchangers. Their primary function is to maintain Ca2+ homeostasis and play a particularly important role in excitable cells that experience transient Ca2+ fluxes. Research into the role and activity of Na+/Ca2+ exchangers has focused extensively on the cardio-vascular system, however, growing evidence suggests that Na+/Ca2+ exchangers play a key role in neuronal processes such as memory formation, learning, oligodendrocyte differentiation, neuroprotection during brain ischemia and axon guidance. They have also been implicated in pathologies such as Alzheimer’s disease, Parkinson’s disease, Multiple Sclerosis and Epilepsy, however, a clear understanding of their mechanism during disease is lacking. To date, there has never been a central resource or database for Na+/Ca2+ exchangers. With clear disease relevance and ever-increasing research on Na+/Ca2+ exchangers from both model and non-model species, a database that unifies the data on Na+/Ca2+ exchangers is needed for future research. NCX-DB is a publicly available database with a web interface that enables users to explore various Na+/Ca2+ exchangers, perform cross-species sequence comparison, identify new exchangers, and stay-up to date with recent literature. NCX-DB is available on the web via an interactive user interface with an intuitive design, which is applicable for the identification and comparison of Na+/Ca2+ exchanger proteins across diverse species.
Collapse
Affiliation(s)
- Katrin Bode
- Department of Biological Sciences, The George Washington University, Science and Engineering Hall 6000, 800 22nd St. N.W., Washington, DC, 20052, USA.,Institute for Neuroscience, The George Washington University, 636A Ross Hall, 2300 I St. N.W., Washington, DC, 20052, USA
| | - Damien M O'Halloran
- Department of Biological Sciences, The George Washington University, Science and Engineering Hall 6000, 800 22nd St. N.W., Washington, DC, 20052, USA. .,Institute for Neuroscience, The George Washington University, 636A Ross Hall, 2300 I St. N.W., Washington, DC, 20052, USA.
| |
Collapse
|
13
|
Affiliation(s)
- Rahul Kaushik
- Kusuma
School of Biological Sciences, Indian Institute of Technology, Delhi, India
- Supercomputing Facility for Bioinformatics & Computational Biology, Indian Institute of Technology, Delhi, India
| | - Ankita Singh
- Supercomputing Facility for Bioinformatics & Computational Biology, Indian Institute of Technology, Delhi, India
- Department
of Bioinformatics, Banasthali Vidyapith, Banasthali, India
| | - B. Jayaram
- Kusuma
School of Biological Sciences, Indian Institute of Technology, Delhi, India
- Supercomputing Facility for Bioinformatics & Computational Biology, Indian Institute of Technology, Delhi, India
- Department
of Chemistry, Indian Institute of Technology, Delhi, India
| |
Collapse
|
14
|
Pearson WR, Li W, Lopez R. Query-seeded iterative sequence similarity searching improves selectivity 5-20-fold. Nucleic Acids Res 2017; 45:e46. [PMID: 27923999 PMCID: PMC5605230 DOI: 10.1093/nar/gkw1207] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2016] [Accepted: 11/18/2016] [Indexed: 11/13/2022] Open
Abstract
Iterative similarity search programs, like psiblast, jackhmmer, and psisearch, are much more sensitive than pairwise similarity search methods like blast and ssearch because they build a position specific scoring model (a PSSM or HMM) that captures the pattern of sequence conservation characteristic to a protein family. But models are subject to contamination; once an unrelated sequence has been added to the model, homologs of the unrelated sequence will also produce high scores, and the model can diverge from the original protein family. Examination of alignment errors during psiblast PSSM contamination suggested a simple strategy for dramatically reducing PSSM contamination. psiblast PSSMs are built from the query-based multiple sequence alignment (MSA) implied by the pairwise alignments between the query model (PSSM, HMM) and the subject sequences in the library. When the original query sequence residues are inserted into gapped positions in the aligned subject sequence, the resulting PSSM rarely produces alignment over-extensions or alignments to unrelated sequences. This simple step, which tends to anchor the PSSM to the original query sequence and slightly increase target percent identity, can reduce the frequency of false-positive alignments more than 20-fold compared with psiblast and jackhmmer, with little loss in search sensitivity.
Collapse
Affiliation(s)
- William R Pearson
- Dept. of Biochemistry and Molecular Genetics, University of Virginia, School of Medicine, Charlottesville, VA 22908, USA
| | - Weizhong Li
- European Bioinformatics Institute, EMBL Outstation, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Rodrigo Lopez
- European Bioinformatics Institute, EMBL Outstation, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| |
Collapse
|
15
|
Barlowe S, Coan HB, Youker RT. SubVis: an interactive R package for exploring the effects of multiple substitution matrices on pairwise sequence alignment. PeerJ 2017; 5:e3492. [PMID: 28674656 PMCID: PMC5490468 DOI: 10.7717/peerj.3492] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2017] [Accepted: 05/27/2017] [Indexed: 01/13/2023] Open
Abstract
Understanding how proteins mutate is critical to solving a host of biological problems. Mutations occur when an amino acid is substituted for another in a protein sequence. The set of likelihoods for amino acid substitutions is stored in a matrix and input to alignment algorithms. The quality of the resulting alignment is used to assess the similarity of two or more sequences and can vary according to assumptions modeled by the substitution matrix. Substitution strategies with minor parameter variations are often grouped together in families. For example, the BLOSUM and PAM matrix families are commonly used because they provide a standard, predefined way of modeling substitutions. However, researchers often do not know if a given matrix family or any individual matrix within a family is the most suitable. Furthermore, predefined matrix families may inaccurately reflect a particular hypothesis that a researcher wishes to model or otherwise result in unsatisfactory alignments. In these cases, the ability to compare the effects of one or more custom matrices may be needed. This laborious process is often performed manually because the ability to simultaneously load multiple matrices and then compare their effects on alignments is not readily available in current software tools. This paper presents SubVis, an interactive R package for loading and applying multiple substitution matrices to pairwise alignments. Users can simultaneously explore alignments resulting from multiple predefined and custom substitution matrices. SubVis utilizes several of the alignment functions found in R, a common language among protein scientists. Functions are tied together with the Shiny platform which allows the modification of input parameters. Information regarding alignment quality and individual amino acid substitutions is displayed with the JavaScript language which provides interactive visualizations for revealing both high-level and low-level alignment information.
Collapse
Affiliation(s)
- Scott Barlowe
- Department of Mathematics and Computer Science, Western Carolina University, Cullowhee, NC, United States of America
| | - Heather B Coan
- Department of Biology, Western Carolina University, Cullowhee, NC, United States of America
| | - Robert T Youker
- Department of Biology, Western Carolina University, Cullowhee, NC, United States of America
| |
Collapse
|
16
|
Vences-Guzmán MÁ, Paula Goetting-Minesky M, Guan Z, Castillo-Ramirez S, Córdoba-Castro LA, López-Lara IM, Geiger O, Sohlenkamp C, Christopher Fenno J. 1,2-Diacylglycerol choline phosphotransferase catalyzes the final step in the unique Treponema denticola phosphatidylcholine biosynthesis pathway. Mol Microbiol 2017; 103:896-912. [PMID: 28009086 DOI: 10.1111/mmi.13596] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/03/2016] [Indexed: 01/09/2023]
Abstract
Treponema denticola synthesizes phosphatidylcholine through a licCA-dependent CDP-choline pathway identified only in the genus Treponema. However, the mechanism of conversion of CDP-choline to phosphatidylcholine remained unclear. We report here characterization of TDE0021 (herein designated cpt) encoding a 1,2-diacylglycerol choline phosphotransferase homologous to choline phosphotransferases that catalyze the final step of the highly conserved Kennedy pathway for phosphatidylcholine synthesis in eukaryotes. T. denticola Cpt catalyzed in vitro phosphatidylcholine formation from CDP-choline and diacylglycerol, and full activity required divalent manganese. Allelic replacement mutagenesis of cpt in T. denticola resulted in abrogation of phosphatidylcholine synthesis. T. denticola Cpt complemented a Saccharomyces cerevisiae CPT1 mutant, and expression of the entire T. denticola LicCA-Cpt pathway in E. coli resulted in phosphatidylcholine biosynthesis. Our findings show that T. denticola possesses a unique phosphatidylcholine synthesis pathway combining conserved prokaryotic choline kinase and CTP:phosphocholine cytidylyltransferase activities with a 1,2-diacylglycerol choline phosphotransferase that is common in eukaryotes. Other than in a subset of mammalian host-associated Treponema that includes T. pallidum, this pathway is found in neither bacteria nor Archaea. Molecular dating analysis of the Cpt gene family suggests that a horizontal gene transfer event introduced this gene into an ancestral Treponema well after its divergence from other spirochetes.
Collapse
Affiliation(s)
- Miguel Ángel Vences-Guzmán
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Av. Universidad s/n, Apdo. Postal 565-A, Cuernavaca, Morelos, CP62210, Mexico
| | - M Paula Goetting-Minesky
- Department of Biologic and Materials Sciences, University of Michigan School of Dentistry, Ann Arbor, Michigan, 48109, USA
| | - Ziqiang Guan
- Department of Biochemistry, Duke University Medical Center, Durham, North Carolina, 27710, USA
| | - Santiago Castillo-Ramirez
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Av. Universidad s/n, Apdo. Postal 565-A, Cuernavaca, Morelos, CP62210, Mexico
| | - Luz América Córdoba-Castro
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Av. Universidad s/n, Apdo. Postal 565-A, Cuernavaca, Morelos, CP62210, Mexico
| | - Isabel M López-Lara
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Av. Universidad s/n, Apdo. Postal 565-A, Cuernavaca, Morelos, CP62210, Mexico
| | - Otto Geiger
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Av. Universidad s/n, Apdo. Postal 565-A, Cuernavaca, Morelos, CP62210, Mexico
| | - Christian Sohlenkamp
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Av. Universidad s/n, Apdo. Postal 565-A, Cuernavaca, Morelos, CP62210, Mexico
| | - J Christopher Fenno
- Department of Biologic and Materials Sciences, University of Michigan School of Dentistry, Ann Arbor, Michigan, 48109, USA
| |
Collapse
|
17
|
Zhang J, Misra S, Wang H, Feng WC. muBLASTP: database-indexed protein sequence search on multicore CPUs. BMC Bioinformatics 2016; 17:443. [PMID: 27809763 PMCID: PMC5096327 DOI: 10.1186/s12859-016-1302-4] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2016] [Accepted: 10/21/2016] [Indexed: 11/16/2022] Open
Abstract
Background The Basic Local Alignment Search Tool (BLAST) is a fundamental program in the life sciences that searches databases for sequences that are most similar to a query sequence. Currently, the BLAST algorithm utilizes a query-indexed approach. Although many approaches suggest that sequence search with a database index can achieve much higher throughput (e.g., BLAT, SSAHA, and CAFE), they cannot deliver the same level of sensitivity as the query-indexed BLAST, i.e., NCBI BLAST, or they can only support nucleotide sequence search, e.g., MegaBLAST. Due to different challenges and characteristics between query indexing and database indexing, the existing techniques for query-indexed search cannot be used into database indexed search. Results muBLASTP, a novel database-indexed BLAST for protein sequence search, delivers identical hits returned to NCBI BLAST. On Intel Haswell multicore CPUs, for a single query, the single-threaded muBLASTP achieves up to a 4.41-fold speedup for alignment stages, and up to a 1.75-fold end-to-end speedup over single-threaded NCBI BLAST. For a batch of queries, the multithreaded muBLASTP achieves up to a 5.7-fold speedups for alignment stages, and up to a 4.56-fold end-to-end speedup over multithreaded NCBI BLAST. Conclusions With a newly designed index structure for protein database and associated optimizations in BLASTP algorithm, we re-factored BLASTP algorithm for modern multicore processors that achieves much higher throughput with acceptable memory footprint for the database index. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1302-4) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Jing Zhang
- Department of Computer Science, Virginia Tech, 225 Stanger Street, Blacksburg, 24060, VA, USA.
| | - Sanchit Misra
- Parallel Computing Lab, Intel Corporation, Bengaluru, Karnataka, 560102, India
| | - Hao Wang
- Department of Computer Science, Virginia Tech, 225 Stanger Street, Blacksburg, 24060, VA, USA
| | - Wu-Chun Feng
- Department of Computer Science, Virginia Tech, 225 Stanger Street, Blacksburg, 24060, VA, USA
| |
Collapse
|
18
|
Mapping the Geometric Evolution of Protein Folding Motor. PLoS One 2016; 11:e0163993. [PMID: 27716851 PMCID: PMC5055333 DOI: 10.1371/journal.pone.0163993] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2016] [Accepted: 09/19/2016] [Indexed: 11/19/2022] Open
Abstract
Polypeptide chain has an invariant main-chain and a variant side-chain sequence. How the side-chain sequence determines fold in terms of its chemical constitution has been scrutinized extensively and verified periodically. However, a focussed investigation on the directive effect of side-chain geometry may provide important insights supplementing existing algorithms in mapping the geometrical evolution of protein chains and its structural preferences. Geometrically, folding of protein structure may be envisaged as the evolution of its geometric variables: ϕ, and ψ dihedral angles of polypeptide main-chain directed by χ1, and χ2 of side chain. In this work, protein molecule is metaphorically modelled as a machine with 4 rotors ϕ, ψ, χ1 and χ2, with its evolution to the functional fold is directed by combinations of its rotor directions. We observe that differential rotor motions lead to different secondary structure formations and the combinatorial pattern is unique and consistent for particular secondary structure type. Further, we found that combination of rotor geometries of each amino acid is unique which partly explains how different amino acid sequence combinations have unique structural evolution and functional adaptation. Quantification of these amino acid rotor preferences, resulted in the generation of 3 substitution matrices, which later on plugged in the BLAST tool, for evaluating their efficiency in aligning sequences. We have employed BLOSUM62 and PAM30 as standard for primary evaluation. Generation of substitution matrices is a logical extension of the conceptual framework we attempted to build during the development of this work. Optimization of matrices following the conventional routines and possible application with biologically relevant data sets are beyond the scope of this manuscript, though it is a part of the larger project design.
Collapse
|
19
|
Margelevičius M. Bayesian nonparametrics in protein remote homology search. Bioinformatics 2016; 32:2744-52. [DOI: 10.1093/bioinformatics/btw213] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2015] [Accepted: 04/14/2016] [Indexed: 11/14/2022] Open
|
20
|
Chrysostomou C, Seker H. Novel protein weight matrix generated from amino acid indices. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2016; 2015:8181-4. [PMID: 26738193 DOI: 10.1109/embc.2015.7320293] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
In recent years, numerous protein weight matrices have been developed that include physical characteristics of proteins, such as local sequence-structure information, alpha-helix information, secondary structure information and solvent accessibility states. These protein weight matrices are shown to have generally improved protein sequence alignments over classical protein weight matrices, like Point Accepted Mutation (PAM), Blocks of Amino Acid Substitution (BLOSUM), and GONNET matrices, where important limitations have been observe in recent works. In this paper, a novel protein weight matrix is constructed and presented. This protein weight matrix is not considered based on the mutation rate, like PAM or BLOSUM matrices, but on the physicochemical properties of each amino acid. In the literature, over 500 amino acid indices exist, each one representing a unique biological protein feature. For this study, 25 amino acid indices were selected. These amino acid indices represent general and widely accepted features of the amino acids. By using the proposed protein weight matrix the following advantages can be obtained compared to the classical protein weight matrices. The proposed protein weight matrix is not biased to specific groups of protein sequences as the values are calculated from the amino acid indices, and not from the protein sequences. Additionally, for the proposed protein weight matrix, the same matrix can be considered regardless of the protein sequence's homology to be aligned or the mutation rate presented. A correlation to the physical characterisations of the amino acids that the protein weight matrix derived from can be achieved. Different similarity matrices can be generated when different physical characterisations of amino acids are considered.
Collapse
|
21
|
Wong WC, Yap CK, Eisenhaber B, Eisenhaber F. dissectHMMER: a HMMER-based score dissection framework that statistically evaluates fold-critical sequence segments for domain fold similarity. Biol Direct 2015; 10:39. [PMID: 26228544 PMCID: PMC4521371 DOI: 10.1186/s13062-015-0068-3] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2015] [Accepted: 07/20/2015] [Indexed: 11/10/2022] Open
Abstract
Background Annotation transfer for function and structure within the sequence homology concept essentially requires protein sequence similarity for the secondary structural blocks forming the fold of a protein. A simplistic similarity approach in the case of non-globular segments (coiled coils, low complexity regions, transmembrane regions, long loops, etc.) is not justified and a pertinent source for mistaken homologies. The latter is either due to positional sequence conservation as a result of a very simple, physically induced pattern or integral sequence properties that are critical for function. Furthermore, against the backdrop that the number of well-studied proteins continues to grow at a slow rate, it necessitates for a search methodology to dive deeper into the sequence similarity space to connect the unknown sequences to the well-studied ones, albeit more distant, for biological function postulations. Results Based on our previous work of dissecting the hidden markov model (HMMER) based similarity score into fold-critical and the non-globular contributions to improve homology inference, we propose a framework-dissectHMMER, that identifies more fold-related domain hits from standard HMMER searches. Subsequent statistical stratification of the fold-related hits into cohorts of functionally-related domains allows for the function postulation of the query sequence. Briefly, the technical problems as to how to recognize non-globular parts in the domain model, resolve contradictory HMMER2/HMMER3 results and evaluate fold-related domain hits for homology, are addressed in this work. The framework is benchmarked against a set of SCOP-to-Pfam domain models. Despite being a sequence-to-profile method, dissectHMMER performs favorably against a profile-to-profile based method-HHsuite/HHsearch. Examples of function annotation using dissectHMMER, including the function discovery of an uncharacterized membrane protein Q9K8K1_BACHD (WP_010899149.1) as a lactose/H+ symporter, are presented. Finally, dissectHMMER webserver is made publicly available at http://dissecthmmer.bii.a-star.edu.sg. Conclusions The proposed framework-dissectHMMER, is faithful to the original inception of the sequence homology concept while improving upon the existing HMMER search tool through the rescue of statistically evaluated false-negative yet fold-related domain hits to the query sequence. Overall, this translates into an opportunity for any novel protein sequence to be functionally characterized. Reviewers This article was reviewed by Masanori Arita, Shamil Sunyaev and L. Aravind. Electronic supplementary material The online version of this article (doi:10.1186/s13062-015-0068-3) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Wing-Cheong Wong
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), 30 Biopolis Street, #07-01, Matrix, Singapore, 138671, Singapore.
| | - Choon-Kong Yap
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), 30 Biopolis Street, #07-01, Matrix, Singapore, 138671, Singapore.
| | - Birgit Eisenhaber
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), 30 Biopolis Street, #07-01, Matrix, Singapore, 138671, Singapore.
| | - Frank Eisenhaber
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), 30 Biopolis Street, #07-01, Matrix, Singapore, 138671, Singapore. .,Department of Biological Sciences (DBS), National University of Singapore (NUS), 8 Medical Drive, Singapore, 117597, Singapore. .,School of Computer Engineering (SCE), Nanyang Technological University (NTU), 50 Nanyang Drive, Singapore, 637553, Singapore.
| |
Collapse
|
22
|
Rios S, Fernandez MF, Caltabiano G, Campillo M, Pardo L, Gonzalez A. GPCRtm: An amino acid substitution matrix for the transmembrane region of class A G Protein-Coupled Receptors. BMC Bioinformatics 2015; 16:206. [PMID: 26134144 PMCID: PMC4489126 DOI: 10.1186/s12859-015-0639-4] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2015] [Accepted: 06/06/2015] [Indexed: 01/08/2023] Open
Abstract
Background Protein sequence alignments and database search methods use standard scoring matrices calculated from amino acid substitution frequencies in general sets of proteins. These general-purpose matrices are not optimal to align accurately sequences with marked compositional biases, such as hydrophobic transmembrane regions found in membrane proteins. In this work, an amino acid substitution matrix (GPCRtm) is calculated for the membrane spanning segments of the G protein-coupled receptor (GPCR) rhodopsin family; one of the largest transmembrane protein family in humans with great importance in health and disease. Results The GPCRtm matrix reveals the amino acid compositional bias distinctive of the GPCR rhodopsin family and differs from other standard substitution matrices. These membrane receptors, as expected, are characterized by a high content of hydrophobic residues with regard to globular proteins. On the other hand, the presence of polar and charged residues is higher than in average membrane proteins, displaying high frequencies of replacement within themselves. Conclusions Analysis of amino acid frequencies and values obtained from the GPCRtm matrix reveals patterns of residue replacements different from other standard substitution matrices. GPCRs prioritize the reactivity properties of the amino acids over their bulkiness in the transmembrane regions. A distinctive role is that charged and polar residues seem to evolve at different rates than other amino acids. This observation is related to the role of the transmembrane bundle in the binding of ligands, that in many cases involve electrostatic and hydrogen bond interactions. This new matrix can be useful in database search and for the construction of more accurate sequence alignments of GPCRs. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0639-4) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Santiago Rios
- Laboratori de Medicina Computacional, Unitat de Bioestadística, Facultat de Medicina, Universitat Autònoma de Barcelona, 08193, Bellaterra, Barcelona, Spain
| | - Marta F Fernandez
- Laboratori de Medicina Computacional, Unitat de Bioestadística, Facultat de Medicina, Universitat Autònoma de Barcelona, 08193, Bellaterra, Barcelona, Spain
| | - Gianluigi Caltabiano
- Laboratori de Medicina Computacional, Unitat de Bioestadística, Facultat de Medicina, Universitat Autònoma de Barcelona, 08193, Bellaterra, Barcelona, Spain
| | - Mercedes Campillo
- Laboratori de Medicina Computacional, Unitat de Bioestadística, Facultat de Medicina, Universitat Autònoma de Barcelona, 08193, Bellaterra, Barcelona, Spain
| | - Leonardo Pardo
- Laboratori de Medicina Computacional, Unitat de Bioestadística, Facultat de Medicina, Universitat Autònoma de Barcelona, 08193, Bellaterra, Barcelona, Spain
| | - Angel Gonzalez
- Laboratori de Medicina Computacional, Unitat de Bioestadística, Facultat de Medicina, Universitat Autònoma de Barcelona, 08193, Bellaterra, Barcelona, Spain.
| |
Collapse
|
23
|
Frith MC, Kawaguchi R. Split-alignment of genomes finds orthologies more accurately. Genome Biol 2015; 16:106. [PMID: 25994148 PMCID: PMC4464727 DOI: 10.1186/s13059-015-0670-9] [Citation(s) in RCA: 65] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2015] [Accepted: 05/08/2015] [Indexed: 04/29/2023] Open
Abstract
We present a new pair-wise genome alignment method, based on a simple concept of finding an optimal set of local alignments. It gains accuracy by not masking repeats, and by using a statistical model to quantify the (un)ambiguity of each alignment part. Compared to previous animal genome alignments, it aligns thousands of locations differently and with much higher similarity, strongly suggesting that the previous alignments are non-orthologous. The previous methods suffer from an overly-strong assumption of long un-rearranged blocks. The new alignments should help find interesting and unusual features, such as fast-evolving elements and micro-rearrangements, which are confounded by alignment errors.
Collapse
Affiliation(s)
- Martin C Frith
- Computational Biology Research Center (CBRC), National Institute of Advanced Industrial Science and Technology (AIST), 2-4-7 Aomi, Koto-ku, Tokyo, 135-0064, Japan.
| | - Risa Kawaguchi
- Computational Biology Research Center (CBRC), National Institute of Advanced Industrial Science and Technology (AIST), 2-4-7 Aomi, Koto-ku, Tokyo, 135-0064, Japan. .,Department of Computational Biology, Faculty of Frontier Sciences, The University of Tokyo, 5-1-5 Kashiwanoha, Kashiwa, Chiba, 277-8561, Japan.
| |
Collapse
|
24
|
Sonnhammer ELL, Östlund G. InParanoid 8: orthology analysis between 273 proteomes, mostly eukaryotic. Nucleic Acids Res 2014; 43:D234-9. [PMID: 25429972 PMCID: PMC4383983 DOI: 10.1093/nar/gku1203] [Citation(s) in RCA: 345] [Impact Index Per Article: 34.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
The InParanoid database (http://InParanoid.sbc.su.se) provides a user interface to orthologs inferred by the InParanoid algorithm. As there are now international efforts to curate and standardize complete proteomes, we have switched to using these resources rather than gathering and curating the proteomes ourselves. InParanoid release 8 is based on the 66 reference proteomes that the ‘Quest for Orthologs’ community has agreed on using, plus 207 additional proteomes from the UniProt complete proteomes—in total 273 species. These represent 246 eukaryotes, 20 bacteria and seven archaea. Compared to the previous release, this increases the number of species by 173% and the number of pairwise species comparisons by 650%. In turn, the number of ortholog groups has increased by 423%. We present the contents and usages of InParanoid 8, and a detailed analysis of how the proteome content has changed since the previous release.
Collapse
Affiliation(s)
- Erik L L Sonnhammer
- Stockholm Bioinformatics Center, Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Box 1031, SE-17121 Solna, Sweden
| | - Gabriel Östlund
- Stockholm Bioinformatics Center, Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Box 1031, SE-17121 Solna, Sweden
| |
Collapse
|
25
|
Frith MC, Noé L. Improved search heuristics find 20,000 new alignments between human and mouse genomes. Nucleic Acids Res 2014; 42:e59. [PMID: 24493737 PMCID: PMC3985675 DOI: 10.1093/nar/gku104] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
Sequence similarity search is a fundamental way of analyzing nucleotide sequences. Despite decades of research, this is not a solved problem because there exist many similarities that are not found by current methods. Search methods are typically based on a seed-and-extend approach, which has many variants (e.g. spaced seeds, transition seeds), and it remains unclear how to optimize this approach. This study designs and tests seeding methods for inter-mammal and inter-insect genome comparison. By considering substitution patterns of real genomes, we design sets of multiple complementary transition seeds, which have better performance (sensitivity per run time) than previous seeding strategies. Often the best seed patterns have more transition positions than those used previously. We also point out that recent computer memory sizes (e.g. 60 GB) make it feasible to use multiple (e.g. eight) seeds for whole mammal genomes. Interestingly, the most sensitive settings achieve diminishing returns for human–dog and melanogaster–pseudoobscura comparisons, but not for human–mouse, which suggests that we still miss many human–mouse alignments. Our optimized heuristics find ∼20 000 new human–mouse alignments that are missing from the standard UCSC alignments. We tabulate seed patterns and parameters that work well so they can be used in future research.
Collapse
Affiliation(s)
- Martin C Frith
- Computational Biology Research Center, AIST, 2-4-7 Aomi, Koto-ku, Tokyo 135-0064, Japan and LIFL (UMR 8022)/INRIA Lille Nord-Europe, bat M3 Ext., Université Lille1 59655 Villeneuve d'Ascq Cedex, France
| | | |
Collapse
|
26
|
Arnold R, Goldenberg F, Mewes HW, Rattei T. SIMAP--the database of all-against-all protein sequence similarities and annotations with new interfaces and increased coverage. Nucleic Acids Res 2013; 42:D279-84. [PMID: 24165881 PMCID: PMC3965014 DOI: 10.1093/nar/gkt970] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022] Open
Abstract
The Similarity Matrix of Proteins (SIMAP, http://mips.gsf.de/simap/) database has been designed to massively accelerate computationally expensive protein sequence analysis tasks in bioinformatics. It provides pre-calculated sequence similarities interconnecting the entire known protein sequence universe, complemented by pre-calculated protein features and domains, similarity clusters and functional annotations. SIMAP covers all major public protein databases as well as many consistently re-annotated metagenomes from different repositories. As of September 2013, SIMAP contains >163 million proteins corresponding to ∼70 million non-redundant sequences. SIMAP uses the sensitive FASTA search heuristics, the Smith-Waterman alignment algorithm, the InterPro database of protein domain models and the BLAST2GO functional annotation algorithm. SIMAP assists biologists by facilitating the interactive exploration of the protein sequence universe. Web-Service and DAS interfaces allow connecting SIMAP with any other bioinformatic tool and resource. All-against-all protein sequence similarity matrices of project-specific protein collections are generated on request. Recent improvements allow SIMAP to cover the rapidly growing sequenced protein sequence universe. New Web-Service interfaces enhance the connectivity of SIMAP. Novel tools for interactive extraction of protein similarity networks have been added. Open access to SIMAP is provided through the web portal; the portal also contains instructions and links for software access and flat file downloads.
Collapse
Affiliation(s)
- Roland Arnold
- Terrence Donnelly Centre for Cellular and Biomolecular Research, Kim Lab, University of Toronto, Toronto, ON M5S 3E1, Canada, CUBE-Division of Computational Systems Biology, Department of Microbiology and Ecosystem Science, University of Vienna, 1090 Vienna, Austria and Institute of Bioinformatics and Systems Biology, Helmholtz Zentrum München, Technische Universität München, Wissenschaftszentrum Weihenstephan, 85764 Neuherberg, Germany
| | | | | | | |
Collapse
|
27
|
Abstract
Background The adaptive immune response is antigen-specific and triggered by pathogen recognition through T cells. Although the interactions and mechanisms of TCR-peptide-MHC (TCR-pMHC) have been studied over three decades, the biological basis for these processes remains controversial. As an increasing number of high-throughput binding epitopes and available TCR-pMHC complex structures, a fast genome-wide structural modelling of TCR-pMHC interactions is an emergent task for understanding immune interactions and developing peptide vaccines. Results We first constructed the PPI matrices and iMatrix, using 621 non-redundant PPI interfaces and 398 non-redundant antigen-antibody interfaces, respectively, for modelling the MHC-peptide and TCR-peptide interfaces, respectively. The iMatrix consists of four knowledge-based scoring matrices to evaluate the hydrogen bonds and van der Waals forces between sidechains or backbones, respectively. The predicted energies of iMatrix are high correlated (Pearson's correlation coefficient is 0.6) to 70 experimental free energies on antigen-antibody interfaces. To further investigate iMatrix and PPI matrices, we inferred the 701,897 potential peptide antigens with significant statistic from 389 pathogen genomes and modelled the TCR-pMHC interactions using available TCR-pMHC complex structures. These identified peptide antigens keep hydrogen-bond energies and consensus interactions and our TCR-pMHC models can provide detailed interacting models and crucial binding regions. Conclusions Experimental results demonstrate that our method can achieve high precision for predicting binding affinity and potential peptide antigens. We believe that iMatrix and our template-based method can be useful for the binding mechanisms of TCR-pMHC complexes and peptide vaccine designs.
Collapse
|
28
|
Sateriale A, Bessoff K, Sarkar IN, Huston CD. Drug repurposing: mining protozoan proteomes for targets of known bioactive compounds. J Am Med Inform Assoc 2013; 21:238-44. [PMID: 23757409 DOI: 10.1136/amiajnl-2013-001700] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022] Open
Abstract
OBJECTIVE To identify potential opportunities for drug repurposing by developing an automated approach to pre-screen the predicted proteomes of any organism against databases of known drug targets using only freely available resources. MATERIALS AND METHODS We employed a combination of Ruby scripts that leverage data from the DrugBank and ChEMBL databases, MySQL, and BLAST to predict potential drugs and their targets from 13 published genomes. Results from a previous cell-based screen to identify inhibitors of Cryptosporidium parvum growth were used to validate our in-silico prediction method. RESULTS In-vitro validation of these results, using a cell-based C parvum growth assay, showed that the predicted inhibitors were significantly more likely than expected by chance to have confirmed activity, with 8.9-15.6% of predicted inhibitors confirmed depending on the drug target database used. This method was then used to predict inhibitors for the following 13 disease-causing protozoan parasites, including: C parvum, Entamoeba histolytica, Giardia intestinalis, Leishmania braziliensis, Leishmania donovani, Leishmania major, Naegleria gruberi (in proxy of Naegleria fowleri), Plasmodium falciparum, Plasmodium vivax, Toxoplasma gondii, Trichomonas vaginalis, Trypanosoma brucei and Trypanosoma cruzi. CONCLUSIONS Although proteome-wide screens for drug targets have disadvantages, in-silico methods can be developed that are fast, broad, inexpensive, and effective. In-vitro validation of our results for C parvum indicate that the method presented here can be used to construct a library for more directed small molecule screening, or pipelined into structural modeling and docking programs to facilitate target-based drug development.
Collapse
Affiliation(s)
- Adam Sateriale
- Cell, Molecular, and Biomedical Sciences Graduate Program, University of Vermont College of Medicine, Burlington, Vermont, USA
| | | | | | | |
Collapse
|
29
|
Jeong CS, Kim D. Reliable and robust detection of coevolving protein residues†. Protein Eng Des Sel 2012; 25:705-13. [DOI: 10.1093/protein/gzs081] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
|
30
|
Joseph AP, Valadié H, Srinivasan N, de Brevern AG. Local structural differences in homologous proteins: specificities in different SCOP classes. PLoS One 2012; 7:e38805. [PMID: 22745680 PMCID: PMC3382195 DOI: 10.1371/journal.pone.0038805] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2011] [Accepted: 05/10/2012] [Indexed: 11/19/2022] Open
Abstract
The constant increase in the number of solved protein structures is of great help in understanding the basic principles behind protein folding and evolution. 3-D structural knowledge is valuable in designing and developing methods for comparison, modelling and prediction of protein structures. These approaches for structure analysis can be directly implicated in studying protein function and for drug design. The backbone of a protein structure favours certain local conformations which include α-helices, β-strands and turns. Libraries of limited number of local conformations (Structural Alphabets) were developed in the past to obtain a useful categorization of backbone conformation. Protein Block (PB) is one such Structural Alphabet that gave a reasonable structure approximation of 0.42 Å. In this study, we use PB description of local structures to analyse conformations that are preferred sites for structural variations and insertions, among group of related folds. This knowledge can be utilized in improving tools for structure comparison that work by analysing local structure similarities. Conformational differences between homologous proteins are known to occur often in the regions comprising turns and loops. Interestingly, these differences are found to have specific preferences depending upon the structural classes of proteins. Such class-specific preferences are mainly seen in the all-β class with changes involving short helical conformations and hairpin turns. A test carried out on a benchmark dataset also indicates that the use of knowledge on the class specific variations can improve the performance of a PB based structure comparison approach. The preference for the indel sites also seem to be confined to a few backbone conformations involving β-turns and helix C-caps. These are mainly associated with short loops joining the regular secondary structures that mediate a reversal in the chain direction. Rare β-turns of type I’ and II’ are also identified as preferred sites for insertions.
Collapse
Affiliation(s)
- Agnel Praveen Joseph
- INSERM, UMR-S 665, Dynamique des Structures et Interactions des Macromolécules Biologiques (DSIMB), Paris, France
- Univ Paris Diderot, Sorbonne Paris Cité, UMR 665, Paris, France
- Institut National de la Transfusion Sanguine (INTS), Paris, France
| | - Hélène Valadié
- INSERM UMR-S 726, DSIMB, Université Paris Diderot - Paris 7, Paris, France
| | | | - Alexandre G. de Brevern
- INSERM, UMR-S 665, Dynamique des Structures et Interactions des Macromolécules Biologiques (DSIMB), Paris, France
- Univ Paris Diderot, Sorbonne Paris Cité, UMR 665, Paris, France
- Institut National de la Transfusion Sanguine (INTS), Paris, France
- * E-mail:
| |
Collapse
|
31
|
Zhang Y, Misra S, Agrawal A, Patwary MMA, Liao WK, Qin Z, Choudhary A. Accelerating pairwise statistical significance estimation for local alignment by harvesting GPU's power. BMC Bioinformatics 2012; 13 Suppl 5:S3. [PMID: 22537007 PMCID: PMC3318904 DOI: 10.1186/1471-2105-13-s5-s3] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022] Open
Abstract
Background Pairwise statistical significance has been recognized to be able to accurately identify related sequences, which is a very important cornerstone procedure in numerous bioinformatics applications. However, it is both computationally and data intensive, which poses a big challenge in terms of performance and scalability. Results We present a GPU implementation to accelerate pairwise statistical significance estimation of local sequence alignment using standard substitution matrices. By carefully studying the algorithm's data access characteristics, we developed a tile-based scheme that can produce a contiguous data access in the GPU global memory and sustain a large number of threads to achieve a high GPU occupancy. We further extend the parallelization technique to estimate pairwise statistical significance using position-specific substitution matrices, which has earlier demonstrated significantly better sequence comparison accuracy than using standard substitution matrices. The implementation is also extended to take advantage of dual-GPUs. We observe end-to-end speedups of nearly 250 (370) × using single-GPU Tesla C2050 GPU (dual-Tesla C2050) over the CPU implementation using Intel© Core™i7 CPU 920 processor. Conclusions Harvesting the high performance of modern GPUs is a promising approach to accelerate pairwise statistical significance estimation for local sequence alignment.
Collapse
Affiliation(s)
- Yuhong Zhang
- School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China.
| | | | | | | | | | | | | |
Collapse
|
32
|
Lemaitre C, Barré A, Citti C, Tardy F, Thiaucourt F, Sirand-Pugnet P, Thébault P. A novel substitution matrix fitted to the compositional bias in Mollicutes improves the prediction of homologous relationships. BMC Bioinformatics 2011; 12:457. [PMID: 22115330 PMCID: PMC3248887 DOI: 10.1186/1471-2105-12-457] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2011] [Accepted: 11/24/2011] [Indexed: 11/10/2022] Open
Abstract
Background Substitution matrices are key parameters for the alignment of two protein sequences, and consequently for most comparative genomics studies. The composition of biological sequences can vary importantly between species and groups of species, and classical matrices such as those in the BLOSUM series fail to accurately estimate alignment scores and statistical significance with sequences sharing marked compositional biases. Results We present a general and simple methodology to build matrices that are especially fitted to the compositional bias of proteins. Our approach is inspired from the one used to build the BLOSUM matrices and is based on learning substitution and amino acid frequencies on real sequences with the corresponding compositional bias. We applied it to the large scale comparison of Mollicute AT-rich genomes. The new matrix, MOLLI60, was used to predict pairwise orthology relationships, as well as homolog families among 24 Mollicute genomes. We show that this new matrix enables to better discriminate between true and false orthologs and improves the clustering of homologous proteins, with respect to the use of the classical matrix BLOSUM62. Conclusions We show in this paper that well-fitted matrices can improve the predictions of orthologous and homologous relationships among proteins with a similar compositional bias. With the ever-increasing number of sequenced genomes, our approach could prove valuable in numerous comparative studies focusing on atypical genomes.
Collapse
Affiliation(s)
- Claire Lemaitre
- Université de Bordeaux, Centre de Bioinformatique et Génomique Fonctionnelle Bordeaux, F-33000 Bordeaux, France.
| | | | | | | | | | | | | |
Collapse
|
33
|
WONG WINGCHEONG, MAURER-STROH SEBASTIAN, EISENHABER FRANK. THE JANUS-FACED E-VALUES OF HMMER2: EXTREME VALUE DISTRIBUTION OR LOGISTIC FUNCTION? J Bioinform Comput Biol 2011. [DOI: 10.1142/s0219720011005264] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
E-value guided extrapolation of protein domain annotation from libraries such as Pfam with the HMMER suite is indispensable for hypothesizing about the function of experimentally uncharacterized protein sequences. Since the recent release of HMMER3 does not supersede all functions of HMMER2, the latter will remain relevant for ongoing research as well as for the evaluation of annotations that reside in databases and in the literature. In HMMER2, the E-value is computed from the score via a logistic function or via a domain model-specific extreme value distribution (EVD); the lower of the two is returned as E-value for the domain hit in the query sequence. We find that, for thousands of domain models, this treatment results in switching from the EVD to the statistical model with the logistic function when scores grow (for Pfam release 23, 99% in the global mode and 75% in the fragment mode). If the score corresponding to the breakpoint results in an E-value above a user-defined threshold (e.g. 0.1), a critical score region with conflicting E-values from the logistic function (below the threshold) and from EVD (above the threshold) does exist. Thus, this switch will affect E-value guided annotation decisions in an automated mode. To emphasize, switching in the fragment mode is of no practical relevance since it occurs only at E-values far below 0.1. Unfortunately, a critical score region does exist for 185 domain models in the hmmpfam and 1,748 domain models in the hmmsearch global-search mode. For 145 out the respective 185 models, the critical score region is indeed populated by actual sequences. In total, 24.4% of their hits have a logistic function-derived E-value < 0.1 when the EVD provides an E-value > 0.1. We provide examples of false annotations and critically discuss the appropriateness of a logistic function as alternative to the EVD.
Collapse
Affiliation(s)
- WING-CHEONG WONG
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A *STAR), 30 Biopolis Street, #07-01, Matrix, Singapore 138671, Singapore
| | - SEBASTIAN MAURER-STROH
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A *STAR), 30 Biopolis Street, #07-01, Matrix, Singapore 138671, Singapore
- School of Biological Sciences (SBS), Nanyang Technological University (NTU), 60 Nanyang Drive, Singapore 63755, Singapore
| | - FRANK EISENHABER
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A *STAR), 30 Biopolis Street, #07-01, Matrix, Singapore 138671, Singapore
- Department of Biological Sciences (DBS), National University of Singapore (NUS), 8 Medical Drive, Singapore 117597, Singapore
- School of Computer Engineering (SCE), Nanyang Technological University (NTU), 50 Nanyang Drive, Singapore 637553, Singapore
| |
Collapse
|
34
|
Hamada M, Wijaya E, Frith MC, Asai K. Probabilistic alignments with quality scores: an application to short-read mapping toward accurate SNP/indel detection. ACTA ACUST UNITED AC 2011; 27:3085-92. [PMID: 21976422 DOI: 10.1093/bioinformatics/btr537] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/20/2023]
Abstract
MOTIVATION Recent studies have revealed the importance of considering quality scores of reads generated by next-generation sequence (NGS) platforms in various downstream analyses. It is also known that probabilistic alignments based on marginal probabilities (e.g. aligned-column and/or gap probabilities) provide more accurate alignment than conventional maximum score-based alignment. There exists, however, no study about probabilistic alignment that considers quality scores explicitly, although the method is expected to be useful in SNP/indel callers and bisulfite mapping, because accurate estimation of aligned columns or gaps is important in those analyses. RESULTS In this study, we propose methods of probabilistic alignment that consider quality scores of (one of) the sequences as well as a usual score matrix. The method is based on posterior decoding techniques in which various marginal probabilities are computed from a probabilistic model of alignments with quality scores, and can arbitrarily trade-off sensitivity and positive predictive value (PPV) of prediction (aligned columns and gaps). The method is directly applicable to read mapping (alignment) toward accurate detection of SNPs and indels. Several computational experiments indicated that probabilistic alignments can estimate aligned columns and gaps accurately, compared with other mapping algorithms e.g. SHRiMP2, Stampy, BWA and Novoalign. The study also suggested that our approach yields favorable precision for SNP/indel calling.
Collapse
Affiliation(s)
- Michiaki Hamada
- Graduate School of Frontier Sciences, University of Tokyo, Kashiwa 277-8562, Japan.
| | | | | | | |
Collapse
|
35
|
Ye X, Yu YK, Altschul SF. Compositional adjustment of Dirichlet mixture priors. J Comput Biol 2011; 17:1607-20. [PMID: 21128852 DOI: 10.1089/cmb.2010.0117] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Dirichlet mixture priors provide a Bayesian formalism for scoring alignments of protein profiles to individual sequences, which can be generalized to constructing scores for multiple-alignment columns. A Dirichlet mixture is a probability distribution over multinomial space, each of whose components can be thought of as modeling a type of protein position. Applied to the simplest case of pairwise sequence alignment, a Dirichlet mixture is equivalent to an implied symmetric substitution matrix. For alphabets of even size L, Dirichlet mixtures with L/2 components and symmetric substitution matrices have an identical number of free parameters. Although this suggests the possibility of a one-to-one mapping between the two formalisms, we show that there are some symmetric matrices no Dirichlet mixture can imply, and others implied by many distinct Dirichlet mixtures. Dirichlet mixtures are derived empirically from curated sets of multiple alignments. They imply "background" amino acid frequencies characteristic of these sets, and should thus be non-optimal for comparing proteins with non-standard composition. Given a mixture Θ, we seek an adjusted Θ' that implies the desired composition, but that minimizes an appropriate relative-entropy-based distance function. To render the problem tractable, we fix the mixture parameter as well as the sum of the Dirichlet parameters for each component, allowing only its center of mass to vary. This linearizes the constraints on the remaining parameters. An approach to finding Θ' may be based on small consecutive parameter adjustments. The relative entropy of two Dirichlet distributions separated by a small change in their parameter values implies a quadratic cost function for such changes. For a small change in implied background frequencies, this function can be minimized using the Lagrange-Newton method. We have implemented this method, and can compositionally adjust to good precision a 20-component Dirichlet mixture prior for proteins in under half a second on a standard workstation.
Collapse
Affiliation(s)
- Xugang Ye
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA
| | | | | |
Collapse
|
36
|
Wolfsheimer S, Herms I, Rahmann S, Hartmann AK. Accurate statistics for local sequence alignment with position-dependent scoring by rare-event sampling. BMC Bioinformatics 2011; 12:47. [PMID: 21291566 PMCID: PMC3042914 DOI: 10.1186/1471-2105-12-47] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2010] [Accepted: 02/03/2011] [Indexed: 12/03/2022] Open
Abstract
Background Molecular database search tools need statistical models to assess the significance for the resulting hits. In the classical approach one asks the question how probable a certain score is observed by pure chance. Asymptotic theories for such questions are available for two random i.i.d. sequences. Some effort had been made to include effects of finite sequence lengths and to account for specific compositions of the sequences. In many applications, such as a large-scale database homology search for transmembrane proteins, these models are not the most appropriate ones. Search sensitivity and specificity benefit from position-dependent scoring schemes or use of Hidden Markov Models. Additional, one may wish to go beyond the assumption that the sequences are i.i.d. Despite their practical importance, the statistical properties of these settings have not been well investigated yet. Results In this paper, we discuss an efficient and general method to compute the score distribution to any desired accuracy. The general approach may be applied to different sequence models and and various similarity measures that satisfy a few weak assumptions. We have access to the low-probability region ("tail") of the distribution where scores are larger than expected by pure chance and therefore relevant for practical applications. Our method uses recent ideas from rare-event simulations, combining Markov chain Monte Carlo simulations with importance sampling and generalized ensembles. We present results for the score statistics of fixed and random queries against random sequences. In a second step, we extend the approach to a model of transmembrane proteins, which can hardly be described as i.i.d. sequences. For this case, we compare the statistical properties of a fixed query model as well as a hidden Markov sequence model in connection with a position based scoring scheme against the classical approach. Conclusions The results illustrate that the sensitivity and specificity strongly depend on the underlying scoring and sequence model. A specific ROC analysis for the case of transmembrane proteins supports our observation.
Collapse
Affiliation(s)
- Stefan Wolfsheimer
- Laboratoire MAP5 (UMR CNRS 8145), Université Paris Descartes, Paris, France.
| | | | | | | |
Collapse
|
37
|
Agrawal A, Huang X. Pairwise statistical significance of local sequence alignment using sequence-specific and position-specific substitution matrices. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2011; 8:194-205. [PMID: 21071807 DOI: 10.1109/tcbb.2009.69] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Abstract
Pairwise sequence alignment is a central problem in bioinformatics, which forms the basis of various other applications. Two related sequences are expected to have a high alignment score, but relatedness is usually judged by statistical significance rather than by alignment score. Recently, it was shown that pairwise statistical significance gives promising results as an alternative to database statistical significance for getting individual significance estimates of pairwise alignment scores. The improvement was mainly attributed to making the statistical significance estimation process more sequence-specific and database-independent. In this paper, we use sequence-specific and position-specific substitution matrices to derive the estimates of pairwise statistical significance, which is expected to use more sequence-specific information in estimating pairwise statistical significance. Experiments on a benchmark database with sequence-specific substitution matrices at different levels of sequence-specific contribution were conducted, and results confirm that using sequence-specific substitution matrices for estimating pairwise statistical significance is significantly better than using a standard matrix like BLOSUM62, and than database statistical significance estimates reported by popular database search programs like BLAST, PSI-BLAST (without pretrained PSSMs), and SSEARCH on a benchmark database, but with pretrained PSSMs, PSI-BLAST results are significantly better. Further, using position-specific substitution matrices for estimating pairwise statistical significance gives significantly better results even than PSI-BLAST using pretrained PSSMs.
Collapse
Affiliation(s)
- Ankit Agrawal
- Department of Computer Science, Iowa State University, 226 Atanasoff Hall, Ames, IA 50011-1041, USA.
| | | |
Collapse
|
38
|
Machado-Lima A, Kashiwabara AY, Durham AM. Decreasing the number of false positives in sequence classification. BMC Genomics 2010; 11 Suppl 5:S10. [PMID: 21210966 PMCID: PMC3045793 DOI: 10.1186/1471-2164-11-s5-s10] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
Abstract
Background A large number of probabilistic models used in sequence analysis assign non-zero probability values to most input sequences. To decide when a given probability is sufficient the most common way is bayesian binary classification, where the probability of the model characterizing the sequence family of interest is compared to that of an alternative probability model. We can use as alternative model a null model. This is the scoring technique used by sequence analysis tools such as HMMER, SAM and INFERNAL. The most prevalent null models are position-independent residue distributions that include: the uniform distribution, genomic distribution, family-specific distribution and the target sequence distribution. This paper presents a study to evaluate the impact of the choice of a null model in the final result of classifications. In particular, we are interested in minimizing the number of false predictions in a classification. This is a crucial issue to reduce costs of biological validation. Results For all the tests, the target null model presented the lowest number of false positives, when using random sequences as a test. The study was performed in DNA sequences using GC content as the measure of content bias, but the results should be valid also for protein sequences. To broaden the application of the results, the study was performed using randomly generated sequences. Previous studies were performed on aminoacid sequences, using only one probabilistic model (HMM) and on a specific benchmark, and lack more general conclusions about the performance of null models. Finally, a benchmark test with P. falciparum confirmed these results. Conclusions Of the evaluated models the best suited for classification are the uniform model and the target model. However, the use of the uniform model presents a GC bias that can cause more false positives for candidate sequences with extreme compositional bias, a characteristic not described in previous studies. In these cases the target model is more dependable for biological validation due to its higher specificity.
Collapse
Affiliation(s)
- Ariane Machado-Lima
- Escola de Artes, Ciências e Humanidades, Universidade de São Paulo, Rua Arlindo Béttio, 1000, 03828-000, São Paulo, SP, Brazil
| | | | | |
Collapse
|
39
|
Cao MD, Dix TI, Allison L. A genome alignment algorithm based on compression. BMC Bioinformatics 2010; 11:599. [PMID: 21159205 PMCID: PMC3022628 DOI: 10.1186/1471-2105-11-599] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2010] [Accepted: 12/16/2010] [Indexed: 11/26/2022] Open
Abstract
BACKGROUND Traditional genome alignment methods consider sequence alignment as a variation of the string edit distance problem, and perform alignment by matching characters of the two sequences. They are often computationally expensive and unable to deal with low information regions. Furthermore, they lack a well-principled objective function to measure the performance of sets of parameters. Since genomic sequences carry genetic information, this article proposes that the information content of each nucleotide in a position should be considered in sequence alignment. An information-theoretic approach for pairwise genome local alignment, namely XMAligner, is presented. Instead of comparing sequences at the character level, XMAligner considers a pair of nucleotides from two sequences to be related if their mutual information in context is significant. The information content of nucleotides in sequences is measured by a lossless compression technique. RESULTS Experiments on both simulated data and real data show that XMAligner is superior to conventional methods especially on distantly related sequences and statistically biased data. XMAligner can align sequences of eukaryote genome size with only a modest hardware requirement. Importantly, the method has an objective function which can obviate the need to choose parameter values for high quality alignment. The alignment results from XMAligner can be integrated into a visualisation tool for viewing purpose. CONCLUSIONS The information-theoretic approach for sequence alignment is shown to overcome the mentioned problems of conventional character matching alignment methods. The article shows that, as genomic sequences are meant to carry information, considering the information content of nucleotides is helpful for genomic sequence alignment. AVAILABILITY Downloadable binaries, documentation and data can be found at ftp://ftp.infotech.monash.edu.au/software/DNAcompress-XM/XMAligner/.
Collapse
Affiliation(s)
- Minh Duc Cao
- Clayton School of Information Technology, Monash University, Clayton 3800, Australia
| | - Trevor I Dix
- Clayton School of Information Technology, Monash University, Clayton 3800, Australia
| | - Lloyd Allison
- Clayton School of Information Technology, Monash University, Clayton 3800, Australia
| |
Collapse
|
40
|
Abstract
Biological sequences are often analyzed by detecting homologous regions between them. Homology search is confounded by simple repeats, which give rise to strong similarities that are not homologies. Standard repeat-masking methods fail to eliminate this problem, and they are especially ill-suited to AT-rich DNA such as malaria and slime-mould genomes. We present a new repeat-masking method, tantan, which is motivated by the mechanisms that create simple repeats. This method thoroughly eliminates spurious homology predictions for DNA–DNA, protein–protein and DNA–protein comparisons. Moreover, it enables accurate homology search for non-coding DNA with extreme A + T composition.
Collapse
Affiliation(s)
- Martin C Frith
- Computational Biology Research Center, Institute for Advanced Industrial Science and Technology, Sequence Analysis Team, 2-4-7 Aomi, Koto-ku, Tokyo 135-0064, Japan.
| |
Collapse
|
41
|
Altschul SF, Wootton JC, Zaslavsky E, Yu YK. The construction and use of log-odds substitution scores for multiple sequence alignment. PLoS Comput Biol 2010; 6:e1000852. [PMID: 20657661 PMCID: PMC2904766 DOI: 10.1371/journal.pcbi.1000852] [Citation(s) in RCA: 53] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2009] [Accepted: 06/03/2010] [Indexed: 01/18/2023] Open
Abstract
Most pairwise and multiple sequence alignment programs seek alignments with optimal scores. Central to defining such scores is selecting a set of substitution scores for aligned amino acids or nucleotides. For local pairwise alignment, substitution scores are implicitly of log-odds form. We now extend the log-odds formalism to multiple alignments, using Bayesian methods to construct "BILD" ("Bayesian Integral Log-odds") substitution scores from prior distributions describing columns of related letters. This approach has been used previously only to define scores for aligning individual sequences to sequence profiles, but it has much broader applicability. We describe how to calculate BILD scores efficiently, and illustrate their uses in Gibbs sampling optimization procedures, gapped alignment, and the construction of hidden Markov model profiles. BILD scores enable automated selection of optimal motif and domain model widths, and can inform the decision of whether to include a sequence in a multiple alignment, and the selection of insertion and deletion locations. Other applications include the classification of related sequences into subfamilies, and the definition of profile-profile alignment scores. Although a fully realized multiple alignment program must rely upon more than substitution scores, many existing multiple alignment programs can be modified to employ BILD scores. We illustrate how simple BILD score based strategies can enhance the recognition of DNA binding domains, including the Api-AP2 domain in Toxoplasma gondii and Plasmodium falciparum.
Collapse
Affiliation(s)
- Stephen F Altschul
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, United States of America.
| | | | | | | |
Collapse
|
42
|
Considering scores between unrelated proteins in the search database improves profile comparison. BMC Bioinformatics 2009; 10:399. [PMID: 19961610 PMCID: PMC3087343 DOI: 10.1186/1471-2105-10-399] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2009] [Accepted: 12/04/2009] [Indexed: 12/02/2022] Open
Abstract
Background Profile-based comparison of multiple sequence alignments is a powerful methodology for the detection remote protein sequence similarity, which is essential for the inference and analysis of protein structure, function, and evolution. Accurate estimation of statistical significance of detected profile similarities is essential for further development of this methodology. Here we analyze a novel approach to estimate the statistical significance of profile similarity: the explicit consideration of background score distributions for each database template (subject). Results Using a simple scheme to combine and analytically approximate query- and subject-based distributions, we show that (i) inclusion of background distributions for the subjects increases the quality of homology detection; (ii) this increase is higher when the distributions are based on the scores to all known non-homologs of the subject rather than a small calibration subset of the database representatives; and (iii) these all known non-homolog distributions of scores for the subject make the dominant contribution to the improved performance: adding the calibration distribution of the query has a negligible additional effect. Conclusion The construction of distributions based on the complete sets of non-homologs for each subject is particularly relevant in the setting of structure prediction where the database consists of proteins with solved 3D structure (PDB, SCOP, CATH, etc.) and therefore structural relationships between proteins are known. These results point to a potential new direction in the development of more powerful methods for remote homology detection.
Collapse
|
43
|
Park Y, Sheetlin S, Spouge JL. ESTIMATING THE GUMBEL SCALE PARAMETER FOR LOCAL ALIGNMENT OF RANDOM SEQUENCES BY IMPORTANCE SAMPLING WITH STOPPING TIMES. Ann Stat 2009; 37:3697. [PMID: 20148197 DOI: 10.1214/08-aos663] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
The gapped local alignment score of two random sequences follows a Gumbel distribution. If computers could estimate the parameters of the Gumbel distribution within one second, the use of arbitrary alignment scoring schemes could increase the sensitivity of searching biological sequence databases over the web. Accordingly, this article gives a novel equation for the scale parameter of the relevant Gumbel distribution. We speculate that the equation is exact, although present numerical evidence is limited. The equation involves ascending ladder variates in the global alignment of random sequences. In global alignment simulations, the ladder variates yield stopping times specifying random sequence lengths. Because of the random lengths, and because our trial distribution for importance sampling occurs on a different sample space from our target distribution, our study led to a mapping theorem, which led naturally in turn to an efficient dynamic programming algorithm for the importance sampling weights. Numerical studies using several popular alignment scoring schemes then examined the efficiency and accuracy of the resulting simulations.
Collapse
Affiliation(s)
- Yonil Park
- National Center for Biotechnology Information National Library of Medicine National Institutes of Health 8600 Rockville Pike Bethesda, Maryland 20894 USA
| | | | | |
Collapse
|
44
|
Ostlund G, Schmitt T, Forslund K, Köstler T, Messina DN, Roopra S, Frings O, Sonnhammer ELL. InParanoid 7: new algorithms and tools for eukaryotic orthology analysis. Nucleic Acids Res 2009; 38:D196-203. [PMID: 19892828 PMCID: PMC2808972 DOI: 10.1093/nar/gkp931] [Citation(s) in RCA: 461] [Impact Index Per Article: 30.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/04/2022] Open
Abstract
The InParanoid project gathers proteomes of completely sequenced eukaryotic species plus Escherichia coli and calculates pairwise ortholog relationships among them. The new release 7.0 of the database has grown by an order of magnitude over the previous version and now includes 100 species and their collective 1.3 million proteins organized into 42.7 million pairwise ortholog groups. The InParanoid algorithm itself has been revised and is now both more specific and sensitive. Based on results from our recent benchmarking of low-complexity filters in homology assignment, a two-pass BLAST approach was developed that makes use of high-precision compositional score matrix adjustment, but avoids the alignment truncation that sometimes follows. We have also updated the InParanoid web site (http://InParanoid.sbc.su.se). Several features have been added, the response times have been improved and the site now sports a new, clearer look. As the number of ortholog databases has grown, it has become difficult to compare among these resources due to a lack of standardized source data and incompatible representations of ortholog relationships. To facilitate data exchange and comparisons among ortholog databases, we have developed and are making available two XML schemas: SeqXML for the input sequences and OrthoXML for the output ortholog clusters.
Collapse
Affiliation(s)
- Gabriel Ostlund
- Department of Biochemistry and Biophysics, Stockholm Bioinformatics Centre, AlbaNova University Centre, Stockholm University, SE-10691 Stockholm, Sweden.
| | | | | | | | | | | | | | | |
Collapse
|
45
|
Nguyen VH, Lavenier D. PLAST: parallel local alignment search tool for database comparison. BMC Bioinformatics 2009; 10:329. [PMID: 19821978 PMCID: PMC2770072 DOI: 10.1186/1471-2105-10-329] [Citation(s) in RCA: 44] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2009] [Accepted: 10/12/2009] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Sequence similarity searching is an important and challenging task in molecular biology and next-generation sequencing should further strengthen the need for faster algorithms to process such vast amounts of data. At the same time, the internal architecture of current microprocessors is tending towards more parallelism, leading to the use of chips with two, four and more cores integrated on the same die. The main purpose of this work was to design an effective algorithm to fit with the parallel capabilities of modern microprocessors. RESULTS A parallel algorithm for comparing large genomic banks and targeting middle-range computers has been developed and implemented in PLAST software. The algorithm exploits two key parallel features of existing and future microprocessors: the SIMD programming model (SSE instruction set) and the multithreading concept (multicore). Compared to multithreaded BLAST software, tests performed on an 8-processor server have shown speedup ranging from 3 to 6 with a similar level of accuracy. CONCLUSION A parallel algorithmic approach driven by the knowledge of the internal microprocessor architecture allows significant speedup to be obtained while preserving standard sensitivity for similarity search problems.
Collapse
Affiliation(s)
- Van Hoa Nguyen
- INRIA/IRISA, Campus de Beaulieu, 35042 Rennes Cedex, France.
| | | |
Collapse
|
46
|
Forslund K, Sonnhammer ELL. Benchmarking homology detection procedures with low complexity filters. ACTA ACUST UNITED AC 2009; 25:2500-5. [PMID: 19620098 DOI: 10.1093/bioinformatics/btp446] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
BACKGROUND Low-complexity sequence regions present a common problem in finding true homologs to a protein query sequence. Several solutions to this have been suggested, but a detailed comparison between these on challenging data has so far been lacking. A common benchmark for homology detection procedures is to use SCOP/ASTRAL domain sequences belonging to the same or different superfamilies, but these contain almost no low complexity sequences. RESULTS We here introduce an alternative benchmarking strategy based around Pfam domains and clans on whole-proteome data sets. This gives a realistic level of low complexity sequences. We used it to evaluate all six built-in BLAST low complexity filter settings as well as a range of settings in the MSPcrunch post-processing filter. The effect on alignment length was also assessed. CONCLUSION Score matrix adjustment methods provide a low false positive rate at a relatively small loss in sensitivity relative to no filtering, across the range of test conditions we apply. MSPcrunch achieved even less loss in sensitivity, but at a higher false positive rate. A drawback of the score matrix adjustment methods is however that the alignments often become truncated. AVAILABILITY Perl scripts for MSPcrunch BLAST filtering and for generating the benchmark dataset are available at http://sonnhammer.sbc.su.se/download/software/MSPcrunch+Blixem/benchmark.tar.gz
Collapse
Affiliation(s)
- Kristoffer Forslund
- Stockholm Bioinformatics Center, Stockholm University, SE-10691 Stockholm, Sweden.
| | | |
Collapse
|
47
|
Jimenez-Morales D, Adamian L, Liang J. Detecting remote homologues using scoring matrices calculated from the estimation of amino acid substitution rates of beta-barrel membrane proteins. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2009; 2008:1347-50. [PMID: 19162917 DOI: 10.1109/iembs.2008.4649414] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Abstract
Beta-barrel membrane proteins (MP) are found in Gram-negative bacteria, mitochondria and chloroplasts. They play important roles in metabolism of bacteria, where they are involved in transport of solutes in and out of the cell. Beta-barrel proteins may also act as proteases, lipases and may be important for cell-cell adhesion. Currently, there are about 30 non-redundant solved structures of beta-barrels. Although the number of b-barrel folds is fairly small, it is possible to expand the amount of available structural information by homology modeling using existing structures as templates. The scope of structure prediction may be widened by finding remote homologues of the existing structures. To improve the sensitivity of the database searches and the quality of sequence alignments, we first study evolutionary history of transmembrane segments of 7 beta-barrel membrane proteins by estimating substitution rates with a Bayesian Monte Carlo approach. Next, we calculate amino acid substitution matrices, beta-barrel Transmembrane scoring Matrices (bbTM), specifically tuned for TM regions, which can be used to detect remote homologues. We then test bbTM matrices by comparing their performance with membrane-protein derived scoring matrices PHAT and SLIM. Our results demonstrate that bbTM matrices have higher selectivity towards transmembrane beta-barrel proteins and may be used with higher confidence in database searches for remote homologues of this class of proteins.
Collapse
|
48
|
Poleksic A. Island method for estimating the statistical significance of profile-profile alignment scores. BMC Bioinformatics 2009; 10:112. [PMID: 19379500 PMCID: PMC2678096 DOI: 10.1186/1471-2105-10-112] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/25/2008] [Accepted: 04/20/2009] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND In the last decade, a significant improvement in detecting remote similarity between protein sequences has been made by utilizing alignment profiles in place of amino-acid strings. Unfortunately, no analytical theory is available for estimating the significance of a gapped alignment of two profiles. Many experiments suggest that the distribution of local profile-profile alignment scores is of the Gumbel form. However, estimating distribution parameters by random simulations turns out to be computationally very expensive. RESULTS We demonstrate that the background distribution of profile-profile alignment scores heavily depends on profiles' composition and thus the distribution parameters must be estimated independently, for each pair of profiles of interest. We also show that accurate estimates of statistical parameters can be obtained using the "island statistics" for profile-profile alignments. CONCLUSION The island statistics can be generalized to profile-profile alignments to provide an efficient method for the alignment score normalization. Since multiple island scores can be extracted from a single comparison of two profiles, the island method has a clear speed advantage over the direct shuffling method for comparable accuracy in parameter estimates.
Collapse
Affiliation(s)
- Aleksandar Poleksic
- Department of Computer Science, University of Northern Iowa, Cedar Falls, IA 50614, USA.
| |
Collapse
|
49
|
Abstract
We introduce a geometric framework suitable for studying the relationships among biological sequences. In contrast to previous works, our formulation allows asymmetric distances (quasi-metrics), originating from uneven weighting of strings, which may induce non-trivial partial orders on sets of biosequences. The distances considered are more general than traditional generalized string edit distances. In particular, our framework enables non-trivial conversion between sequence similarities, both local and global, and distances. Our constructions apply to a wide class of scoring schemes and require much less restrictive gap penalties than the ones regularly used. Numerous examples are provided to illustrate the concepts introduced and their potential applications.
Collapse
Affiliation(s)
- Aleksandar Stojmirović
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA
| | | |
Collapse
|
50
|
Agrawal A, Huang X. PSIBLAST_PairwiseStatSig: reordering PSI-BLAST hits using pairwise statistical significance. Bioinformatics 2009; 25:1082-3. [DOI: 10.1093/bioinformatics/btp089] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
|