Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Yu YK, Altschul SF. The construction of amino acid substitution matrices for the comparison of proteins with non-standard compositions. Bioinformatics 2004;21:902-11. [PMID: 15509610 DOI: 10.1093/bioinformatics/bti070] [Citation(s) in RCA: 71] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open

For:	Yu YK, Altschul SF. The construction of amino acid substitution matrices for the comparison of proteins with non-standard compositions. Bioinformatics 2004;21:902-11. [PMID: 15509610 DOI: 10.1093/bioinformatics/bti070] [Citation(s) in RCA: 71] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open

Number

Cited by Other Article(s)

Stingl C, VanDuijn MM, Dejoie T, Sillevis Smitt PAE, Luider TM. Improved detection of tryptic immunoglobulin variable region peptides by chromatographic and gas-phase fractionation techniques. CELL REPORTS METHODS 2024;4:100795. [PMID: 38861989 PMCID: PMC11228375 DOI: 10.1016/j.crmeth.2024.100795] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/03/2023] [Revised: 03/30/2024] [Accepted: 05/20/2024] [Indexed: 06/13/2024]

Price MN, Arkin AP. A fast comparative genome browser for diverse bacteria and archaea. PLoS One 2024;19:e0301871. [PMID: 38593165 PMCID: PMC11003636 DOI: 10.1371/journal.pone.0301871] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2023] [Accepted: 03/22/2024] [Indexed: 04/11/2024] Open

Glidden-Handgis G, Wheeler TJ. WAS IT A MATch I SAW? Approximate palindromes lead to overstated false match rates in benchmarks using reversed sequences. BIOINFORMATICS ADVANCES 2024;4:vbae052. [PMID: 38764475 PMCID: PMC11099658 DOI: 10.1093/bioadv/vbae052] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/21/2023] [Revised: 03/31/2024] [Accepted: 04/04/2024] [Indexed: 05/21/2024]

Abstract

Background

Software for labeling biological sequences typically produces a theory-based statistic for each match (the E-value) that indicates the likelihood of seeing that match's score by chance. E-values accurately predict false match rate for comparisons of random (shuffled) sequences, and thus provide a reasoned mechanism for setting score thresholds that enable high sensitivity with low expected false match rate. This threshold-setting strategy is challenged by real biological sequences, which contain regions of local repetition and low sequence complexity that cause excess matches between non-homologous sequences. Knowing this, tool developers often develop benchmarks that use realistic-seeming decoy sequences to explore empirical tradeoffs between sensitivity and false match rate. A recent trend has been to employ reversed biological sequences as realistic decoys, because these preserve the distribution of letters and the existence of local repeats, while disrupting the original sequence's functional properties. However, we and others have observed that sequences appear to produce high scoring alignments to their reversals with surprising frequency, leading to overstatement of false match risk that may negatively affect downstream analysis.

Results

We demonstrate that an alignment between a sequence S and its (possibly mutated) reversal tends to produce higher scores than alignment between truly unrelated sequences, even when S is a shuffled string with no notable repetitive or low-complexity regions. This phenomenon is due to the unintuitive fact that (even randomly shuffled) sequences contain palindromes that are on average longer than the longest common substrings (LCS) shared between permuted variants of the same sequence. Though the expected palindrome length is only slightly larger than the expected LCS, the distribution of alignment scores involving reversed sequences is strongly right-shifted, leading to greatly increased frequency of high-scoring alignments to reversed sequences.

Impact

Overestimates of false match risk can motivate unnecessarily high score thresholds, leading to potentially reduced true match sensitivity. Also, when tool sensitivity is only reported up to the score of the first matched decoy sequence, a large decoy set consisting of reversed sequences can obscure sensitivity differences between tools. As a result of these observations, we advise that reversed biological sequences be used as decoys only when care is taken to remove positive matches in the original (un-reversed) sequences, or when overstatement of false labeling is not a concern. Though the primary focus of the analysis is on sequence annotation, we also demonstrate that the prevalence of internal palindromes may lead to an overstatement of the rate of false labels in protein identification with mass spectrometry.

Collapse

Hauswedell H, Hetzel S, Gottlieb SG, Kretzmer H, Meissner A, Reinert K. Lambda3: homology search for protein, nucleotide, and bisulfite-converted sequences. Bioinformatics 2024;40:btae097. [PMID: 38485699 PMCID: PMC10955267 DOI: 10.1093/bioinformatics/btae097] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2023] [Revised: 12/22/2023] [Accepted: 03/13/2024] [Indexed: 03/22/2024] Open

Kouros CE, Makri V, Ouzounis CA, Chasapi A. Disease association and comparative genomics of compositional bias in human proteins. F1000Res 2023;12:198. [PMID: 37082000 PMCID: PMC10111144 DOI: 10.12688/f1000research.129929.1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 02/02/2023] [Indexed: 02/22/2023] Open

Kouros CE, Makri V, Ouzounis CA, Chasapi A. Disease association and comparative genomics of compositional bias in human proteins. F1000Res 2023;12:198. [PMID: 37082000 PMCID: PMC10111144.2 DOI: 10.12688/f1000research.129929.2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 04/12/2023] [Indexed: 04/25/2023] Open

TwinCons: Conservation score for uncovering deep sequence similarity and divergence. PLoS Comput Biol 2021;17:e1009541. [PMID: 34714829 PMCID: PMC8580257 DOI: 10.1371/journal.pcbi.1009541] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2021] [Revised: 11/10/2021] [Accepted: 10/06/2021] [Indexed: 11/19/2022] Open

Carey KM, Patterson G, Wheeler TJ. Transposable element subfamily annotation has a reproducibility problem. Mob DNA 2021;12:4. [PMID: 33485368 PMCID: PMC7827986 DOI: 10.1186/s13100-021-00232-4] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2020] [Accepted: 01/11/2021] [Indexed: 11/24/2022] Open

Trivedi R, Nagarajaram HA. Substitution scoring matrices for proteins - An overview. Protein Sci 2020;29:2150-2163. [PMID: 32954566 DOI: 10.1002/pro.3954] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2020] [Revised: 09/17/2020] [Accepted: 09/18/2020] [Indexed: 01/17/2023]

ProtPCV: A Fixed Dimensional Numerical Representation of Protein Sequence to Significantly Reduce Sequence Search Time. Interdiscip Sci 2020;12:276-287. [PMID: 32524529 DOI: 10.1007/s12539-020-00380-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2019] [Revised: 05/19/2020] [Accepted: 06/02/2020] [Indexed: 10/24/2022]

Frith MC. How sequence alignment scores correspond to probability models. Bioinformatics 2019;36:408-415. [PMID: 31329241 PMCID: PMC9883716 DOI: 10.1093/bioinformatics/btz576] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2019] [Revised: 05/31/2019] [Accepted: 07/17/2019] [Indexed: 02/03/2023] Open

Bode K, O'Halloran DM. NCX-DB: a unified resource for integrative analysis of the sodium calcium exchanger super-family. BMC Neurosci 2018;19:19. [PMID: 29649983 PMCID: PMC5898058 DOI: 10.1186/s12868-018-0423-2] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2017] [Accepted: 03/28/2018] [Indexed: 12/20/2022] Open

Kaushik R, Singh A, Jayaram B. Where Informatics Lags Chemistry Leads. Biochemistry 2017;57:503-506. [DOI: 10.1021/acs.biochem.7b01073] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]

Pearson WR, Li W, Lopez R. Query-seeded iterative sequence similarity searching improves selectivity 5-20-fold. Nucleic Acids Res 2017;45:e46. [PMID: 27923999 PMCID: PMC5605230 DOI: 10.1093/nar/gkw1207] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2016] [Accepted: 11/18/2016] [Indexed: 11/13/2022] Open

Barlowe S, Coan HB, Youker RT. SubVis: an interactive R package for exploring the effects of multiple substitution matrices on pairwise sequence alignment. PeerJ 2017;5:e3492. [PMID: 28674656 PMCID: PMC5490468 DOI: 10.7717/peerj.3492] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2017] [Accepted: 05/27/2017] [Indexed: 01/13/2023] Open

Abstract

Understanding how proteins mutate is critical to solving a host of biological problems. Mutations occur when an amino acid is substituted for another in a protein sequence. The set of likelihoods for amino acid substitutions is stored in a matrix and input to alignment algorithms. The quality of the resulting alignment is used to assess the similarity of two or more sequences and can vary according to assumptions modeled by the substitution matrix. Substitution strategies with minor parameter variations are often grouped together in families. For example, the BLOSUM and PAM matrix families are commonly used because they provide a standard, predefined way of modeling substitutions. However, researchers often do not know if a given matrix family or any individual matrix within a family is the most suitable. Furthermore, predefined matrix families may inaccurately reflect a particular hypothesis that a researcher wishes to model or otherwise result in unsatisfactory alignments. In these cases, the ability to compare the effects of one or more custom matrices may be needed. This laborious process is often performed manually because the ability to simultaneously load multiple matrices and then compare their effects on alignments is not readily available in current software tools. This paper presents SubVis, an interactive R package for loading and applying multiple substitution matrices to pairwise alignments. Users can simultaneously explore alignments resulting from multiple predefined and custom substitution matrices. SubVis utilizes several of the alignment functions found in R, a common language among protein scientists. Functions are tied together with the Shiny platform which allows the modification of input parameters. Information regarding alignment quality and individual amino acid substitutions is displayed with the JavaScript language which provides interactive visualizations for revealing both high-level and low-level alignment information.

Collapse

Vences-Guzmán MÁ, Paula Goetting-Minesky M, Guan Z, Castillo-Ramirez S, Córdoba-Castro LA, López-Lara IM, Geiger O, Sohlenkamp C, Christopher Fenno J. 1,2-Diacylglycerol choline phosphotransferase catalyzes the final step in the unique Treponema denticola phosphatidylcholine biosynthesis pathway. Mol Microbiol 2017;103:896-912. [PMID: 28009086 DOI: 10.1111/mmi.13596] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/03/2016] [Indexed: 01/09/2023]

Zhang J, Misra S, Wang H, Feng WC. muBLASTP: database-indexed protein sequence search on multicore CPUs. BMC Bioinformatics 2016;17:443. [PMID: 27809763 PMCID: PMC5096327 DOI: 10.1186/s12859-016-1302-4] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2016] [Accepted: 10/21/2016] [Indexed: 11/16/2022] Open

Mapping the Geometric Evolution of Protein Folding Motor. PLoS One 2016;11:e0163993. [PMID: 27716851 PMCID: PMC5055333 DOI: 10.1371/journal.pone.0163993] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2016] [Accepted: 09/19/2016] [Indexed: 11/19/2022] Open

Abstract

Polypeptide chain has an invariant main-chain and a variant side-chain sequence. How the side-chain sequence determines fold in terms of its chemical constitution has been scrutinized extensively and verified periodically. However, a focussed investigation on the directive effect of side-chain geometry may provide important insights supplementing existing algorithms in mapping the geometrical evolution of protein chains and its structural preferences. Geometrically, folding of protein structure may be envisaged as the evolution of its geometric variables: ϕ, and ψ dihedral angles of polypeptide main-chain directed by χ1, and χ2 of side chain. In this work, protein molecule is metaphorically modelled as a machine with 4 rotors ϕ, ψ, χ1 and χ2, with its evolution to the functional fold is directed by combinations of its rotor directions. We observe that differential rotor motions lead to different secondary structure formations and the combinatorial pattern is unique and consistent for particular secondary structure type. Further, we found that combination of rotor geometries of each amino acid is unique which partly explains how different amino acid sequence combinations have unique structural evolution and functional adaptation. Quantification of these amino acid rotor preferences, resulted in the generation of 3 substitution matrices, which later on plugged in the BLAST tool, for evaluating their efficiency in aligning sequences. We have employed BLOSUM62 and PAM30 as standard for primary evaluation. Generation of substitution matrices is a logical extension of the conceptual framework we attempted to build during the development of this work. Optimization of matrices following the conventional routines and possible application with biologically relevant data sets are beyond the scope of this manuscript, though it is a part of the larger project design.

Collapse

Margelevičius M. Bayesian nonparametrics in protein remote homology search. Bioinformatics 2016;32:2744-52. [DOI: 10.1093/bioinformatics/btw213] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2015] [Accepted: 04/14/2016] [Indexed: 11/14/2022] Open

Chrysostomou C, Seker H. Novel protein weight matrix generated from amino acid indices. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2016;2015:8181-4. [PMID: 26738193 DOI: 10.1109/embc.2015.7320293] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]

Wong WC, Yap CK, Eisenhaber B, Eisenhaber F. dissectHMMER: a HMMER-based score dissection framework that statistically evaluates fold-critical sequence segments for domain fold similarity. Biol Direct 2015;10:39. [PMID: 26228544 PMCID: PMC4521371 DOI: 10.1186/s13062-015-0068-3] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2015] [Accepted: 07/20/2015] [Indexed: 11/10/2022] Open

Abstract

Background

Annotation transfer for function and structure within the sequence homology concept essentially requires protein sequence similarity for the secondary structural blocks forming the fold of a protein. A simplistic similarity approach in the case of non-globular segments (coiled coils, low complexity regions, transmembrane regions, long loops, etc.) is not justified and a pertinent source for mistaken homologies. The latter is either due to positional sequence conservation as a result of a very simple, physically induced pattern or integral sequence properties that are critical for function. Furthermore, against the backdrop that the number of well-studied proteins continues to grow at a slow rate, it necessitates for a search methodology to dive deeper into the sequence similarity space to connect the unknown sequences to the well-studied ones, albeit more distant, for biological function postulations.

Results

Based on our previous work of dissecting the hidden markov model (HMMER) based similarity score into fold-critical and the non-globular contributions to improve homology inference, we propose a framework-dissectHMMER, that identifies more fold-related domain hits from standard HMMER searches. Subsequent statistical stratification of the fold-related hits into cohorts of functionally-related domains allows for the function postulation of the query sequence. Briefly, the technical problems as to how to recognize non-globular parts in the domain model, resolve contradictory HMMER2/HMMER3 results and evaluate fold-related domain hits for homology, are addressed in this work. The framework is benchmarked against a set of SCOP-to-Pfam domain models. Despite being a sequence-to-profile method, dissectHMMER performs favorably against a profile-to-profile based method-HHsuite/HHsearch. Examples of function annotation using dissectHMMER, including the function discovery of an uncharacterized membrane protein Q9K8K1_BACHD (WP_010899149.1) as a lactose/H+ symporter, are presented. Finally, dissectHMMER webserver is made publicly available at http://dissecthmmer.bii.a-star.edu.sg.

Conclusions

The proposed framework-dissectHMMER, is faithful to the original inception of the sequence homology concept while improving upon the existing HMMER search tool through the rescue of statistically evaluated false-negative yet fold-related domain hits to the query sequence. Overall, this translates into an opportunity for any novel protein sequence to be functionally characterized.

Reviewers

This article was reviewed by Masanori Arita, Shamil Sunyaev and L. Aravind.

Electronic supplementary material

The online version of this article (doi:10.1186/s13062-015-0068-3) contains supplementary material, which is available to authorized users.

Collapse

Rios S, Fernandez MF, Caltabiano G, Campillo M, Pardo L, Gonzalez A. GPCRtm: An amino acid substitution matrix for the transmembrane region of class A G Protein-Coupled Receptors. BMC Bioinformatics 2015;16:206. [PMID: 26134144 PMCID: PMC4489126 DOI: 10.1186/s12859-015-0639-4] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2015] [Accepted: 06/06/2015] [Indexed: 01/08/2023] Open

Abstract

Background

Protein sequence alignments and database search methods use standard scoring matrices calculated from amino acid substitution frequencies in general sets of proteins. These general-purpose matrices are not optimal to align accurately sequences with marked compositional biases, such as hydrophobic transmembrane regions found in membrane proteins. In this work, an amino acid substitution matrix (GPCRtm) is calculated for the membrane spanning segments of the G protein-coupled receptor (GPCR) rhodopsin family; one of the largest transmembrane protein family in humans with great importance in health and disease.

Results

The GPCRtm matrix reveals the amino acid compositional bias distinctive of the GPCR rhodopsin family and differs from other standard substitution matrices. These membrane receptors, as expected, are characterized by a high content of hydrophobic residues with regard to globular proteins. On the other hand, the presence of polar and charged residues is higher than in average membrane proteins, displaying high frequencies of replacement within themselves.

Conclusions

Analysis of amino acid frequencies and values obtained from the GPCRtm matrix reveals patterns of residue replacements different from other standard substitution matrices. GPCRs prioritize the reactivity properties of the amino acids over their bulkiness in the transmembrane regions. A distinctive role is that charged and polar residues seem to evolve at different rates than other amino acids. This observation is related to the role of the transmembrane bundle in the binding of ligands, that in many cases involve electrostatic and hydrogen bond interactions. This new matrix can be useful in database search and for the construction of more accurate sequence alignments of GPCRs.

Electronic supplementary material

The online version of this article (doi:10.1186/s12859-015-0639-4) contains supplementary material, which is available to authorized users.

Collapse

Frith MC, Kawaguchi R. Split-alignment of genomes finds orthologies more accurately. Genome Biol 2015;16:106. [PMID: 25994148 PMCID: PMC4464727 DOI: 10.1186/s13059-015-0670-9] [Citation(s) in RCA: 65] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2015] [Accepted: 05/08/2015] [Indexed: 04/29/2023] Open

Sonnhammer ELL, Östlund G. InParanoid 8: orthology analysis between 273 proteomes, mostly eukaryotic. Nucleic Acids Res 2014;43:D234-9. [PMID: 25429972 PMCID: PMC4383983 DOI: 10.1093/nar/gku1203] [Citation(s) in RCA: 345] [Impact Index Per Article: 34.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open

Frith MC, Noé L. Improved search heuristics find 20,000 new alignments between human and mouse genomes. Nucleic Acids Res 2014;42:e59. [PMID: 24493737 PMCID: PMC3985675 DOI: 10.1093/nar/gku104] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open

Arnold R, Goldenberg F, Mewes HW, Rattei T. SIMAP--the database of all-against-all protein sequence similarities and annotations with new interfaces and increased coverage. Nucleic Acids Res 2013;42:D279-84. [PMID: 24165881 PMCID: PMC3965014 DOI: 10.1093/nar/gkt970] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022] Open

Liu IH, Lo YS, Yang JM. Genome-wide structural modelling of TCR-pMHC interactions. BMC Genomics 2013;14 Suppl 5:S5. [PMID: 24564684 PMCID: PMC3852114 DOI: 10.1186/1471-2164-14-s5-s5] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023] Open

Sateriale A, Bessoff K, Sarkar IN, Huston CD. Drug repurposing: mining protozoan proteomes for targets of known bioactive compounds. J Am Med Inform Assoc 2013;21:238-44. [PMID: 23757409 DOI: 10.1136/amiajnl-2013-001700] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022] Open

Jeong CS, Kim D. Reliable and robust detection of coevolving protein residues†. Protein Eng Des Sel 2012;25:705-13. [DOI: 10.1093/protein/gzs081] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open

Joseph AP, Valadié H, Srinivasan N, de Brevern AG. Local structural differences in homologous proteins: specificities in different SCOP classes. PLoS One 2012;7:e38805. [PMID: 22745680 PMCID: PMC3382195 DOI: 10.1371/journal.pone.0038805] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2011] [Accepted: 05/10/2012] [Indexed: 11/19/2022] Open

Abstract

The constant increase in the number of solved protein structures is of great help in understanding the basic principles behind protein folding and evolution. 3-D structural knowledge is valuable in designing and developing methods for comparison, modelling and prediction of protein structures. These approaches for structure analysis can be directly implicated in studying protein function and for drug design. The backbone of a protein structure favours certain local conformations which include α-helices, β-strands and turns. Libraries of limited number of local conformations (Structural Alphabets) were developed in the past to obtain a useful categorization of backbone conformation. Protein Block (PB) is one such Structural Alphabet that gave a reasonable structure approximation of 0.42 Å. In this study, we use PB description of local structures to analyse conformations that are preferred sites for structural variations and insertions, among group of related folds. This knowledge can be utilized in improving tools for structure comparison that work by analysing local structure similarities. Conformational differences between homologous proteins are known to occur often in the regions comprising turns and loops. Interestingly, these differences are found to have specific preferences depending upon the structural classes of proteins. Such class-specific preferences are mainly seen in the all-β class with changes involving short helical conformations and hairpin turns. A test carried out on a benchmark dataset also indicates that the use of knowledge on the class specific variations can improve the performance of a PB based structure comparison approach. The preference for the indel sites also seem to be confined to a few backbone conformations involving β-turns and helix C-caps. These are mainly associated with short loops joining the regular secondary structures that mediate a reversal in the chain direction. Rare β-turns of type I’ and II’ are also identified as preferred sites for insertions.

Collapse

Zhang Y, Misra S, Agrawal A, Patwary MMA, Liao WK, Qin Z, Choudhary A. Accelerating pairwise statistical significance estimation for local alignment by harvesting GPU's power. BMC Bioinformatics 2012;13 Suppl 5:S3. [PMID: 22537007 PMCID: PMC3318904 DOI: 10.1186/1471-2105-13-s5-s3] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022] Open

Lemaitre C, Barré A, Citti C, Tardy F, Thiaucourt F, Sirand-Pugnet P, Thébault P. A novel substitution matrix fitted to the compositional bias in Mollicutes improves the prediction of homologous relationships. BMC Bioinformatics 2011;12:457. [PMID: 22115330 PMCID: PMC3248887 DOI: 10.1186/1471-2105-12-457] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2011] [Accepted: 11/24/2011] [Indexed: 11/10/2022] Open

WONG WINGCHEONG, MAURER-STROH SEBASTIAN, EISENHABER FRANK. THE JANUS-FACED E-VALUES OF HMMER2: EXTREME VALUE DISTRIBUTION OR LOGISTIC FUNCTION? J Bioinform Comput Biol 2011. [DOI: 10.1142/s0219720011005264] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]

Abstract E-value guided extrapolation of protein domain annotation from libraries such as Pfam with the HMMER suite is indispensable for hypothesizing about the function of experimentally uncharacterized protein sequences. Since the recent release of HMMER3 does not supersede all functions of HMMER2, the latter will remain relevant for ongoing research as well as for the evaluation of annotations that reside in databases and in the literature. In HMMER2, the E-value is computed from the score via a logistic function or via a domain model-specific extreme value distribution (EVD); the lower of the two is returned as E-value for the domain hit in the query sequence. We find that, for thousands of domain models, this treatment results in switching from the EVD to the statistical model with the logistic function when scores grow (for Pfam release 23, 99% in the global mode and 75% in the fragment mode). If the score corresponding to the breakpoint results in an E-value above a user-defined threshold (e.g. 0.1), a critical score region with conflicting E-values from the logistic function (below the threshold) and from EVD (above the threshold) does exist. Thus, this switch will affect E-value guided annotation decisions in an automated mode. To emphasize, switching in the fragment mode is of no practical relevance since it occurs only at E-values far below 0.1. Unfortunately, a critical score region does exist for 185 domain models in the hmmpfam and 1,748 domain models in the hmmsearch global-search mode. For 145 out the respective 185 models, the critical score region is indeed populated by actual sequences. In total, 24.4% of their hits have a logistic function-derived E-value < 0.1 when the EVD provides an E-value > 0.1. We provide examples of false annotations and critically discuss the appropriateness of a logistic function as alternative to the EVD. Collapse

Hamada M, Wijaya E, Frith MC, Asai K. Probabilistic alignments with quality scores: an application to short-read mapping toward accurate SNP/indel detection. ACTA ACUST UNITED AC 2011;27:3085-92. [PMID: 21976422 DOI: 10.1093/bioinformatics/btr537] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/20/2023]

Ye X, Yu YK, Altschul SF. Compositional adjustment of Dirichlet mixture priors. J Comput Biol 2011;17:1607-20. [PMID: 21128852 DOI: 10.1089/cmb.2010.0117] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open

Abstract

Dirichlet mixture priors provide a Bayesian formalism for scoring alignments of protein profiles to individual sequences, which can be generalized to constructing scores for multiple-alignment columns. A Dirichlet mixture is a probability distribution over multinomial space, each of whose components can be thought of as modeling a type of protein position. Applied to the simplest case of pairwise sequence alignment, a Dirichlet mixture is equivalent to an implied symmetric substitution matrix. For alphabets of even size L, Dirichlet mixtures with L/2 components and symmetric substitution matrices have an identical number of free parameters. Although this suggests the possibility of a one-to-one mapping between the two formalisms, we show that there are some symmetric matrices no Dirichlet mixture can imply, and others implied by many distinct Dirichlet mixtures. Dirichlet mixtures are derived empirically from curated sets of multiple alignments. They imply "background" amino acid frequencies characteristic of these sets, and should thus be non-optimal for comparing proteins with non-standard composition. Given a mixture Θ, we seek an adjusted Θ' that implies the desired composition, but that minimizes an appropriate relative-entropy-based distance function. To render the problem tractable, we fix the mixture parameter as well as the sum of the Dirichlet parameters for each component, allowing only its center of mass to vary. This linearizes the constraints on the remaining parameters. An approach to finding Θ' may be based on small consecutive parameter adjustments. The relative entropy of two Dirichlet distributions separated by a small change in their parameter values implies a quadratic cost function for such changes. For a small change in implied background frequencies, this function can be minimized using the Lagrange-Newton method. We have implemented this method, and can compositionally adjust to good precision a 20-component Dirichlet mixture prior for proteins in under half a second on a standard workstation.

Collapse

Wolfsheimer S, Herms I, Rahmann S, Hartmann AK. Accurate statistics for local sequence alignment with position-dependent scoring by rare-event sampling. BMC Bioinformatics 2011;12:47. [PMID: 21291566 PMCID: PMC3042914 DOI: 10.1186/1471-2105-12-47] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2010] [Accepted: 02/03/2011] [Indexed: 12/03/2022] Open

Abstract

Background

Molecular database search tools need statistical models to assess the significance for the resulting hits. In the classical approach one asks the question how probable a certain score is observed by pure chance. Asymptotic theories for such questions are available for two random i.i.d. sequences. Some effort had been made to include effects of finite sequence lengths and to account for specific compositions of the sequences. In many applications, such as a large-scale database homology search for transmembrane proteins, these models are not the most appropriate ones. Search sensitivity and specificity benefit from position-dependent scoring schemes or use of Hidden Markov Models. Additional, one may wish to go beyond the assumption that the sequences are i.i.d. Despite their practical importance, the statistical properties of these settings have not been well investigated yet.

Results

In this paper, we discuss an efficient and general method to compute the score distribution to any desired accuracy. The general approach may be applied to different sequence models and and various similarity measures that satisfy a few weak assumptions. We have access to the low-probability region ("tail") of the distribution where scores are larger than expected by pure chance and therefore relevant for practical applications. Our method uses recent ideas from rare-event simulations, combining Markov chain Monte Carlo simulations with importance sampling and generalized ensembles. We present results for the score statistics of fixed and random queries against random sequences. In a second step, we extend the approach to a model of transmembrane proteins, which can hardly be described as i.i.d. sequences. For this case, we compare the statistical properties of a fixed query model as well as a hidden Markov sequence model in connection with a position based scoring scheme against the classical approach.

Conclusions

The results illustrate that the sensitivity and specificity strongly depend on the underlying scoring and sequence model. A specific ROC analysis for the case of transmembrane proteins supports our observation.

Collapse

Agrawal A, Huang X. Pairwise statistical significance of local sequence alignment using sequence-specific and position-specific substitution matrices. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2011;8:194-205. [PMID: 21071807 DOI: 10.1109/tcbb.2009.69] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]

Machado-Lima A, Kashiwabara AY, Durham AM. Decreasing the number of false positives in sequence classification. BMC Genomics 2010;11 Suppl 5:S10. [PMID: 21210966 PMCID: PMC3045793 DOI: 10.1186/1471-2164-11-s5-s10] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open

Abstract

Background

A large number of probabilistic models used in sequence analysis assign non-zero probability values to most input sequences. To decide when a given probability is sufficient the most common way is bayesian binary classification, where the probability of the model characterizing the sequence family of interest is compared to that of an alternative probability model. We can use as alternative model a null model. This is the scoring technique used by sequence analysis tools such as HMMER, SAM and INFERNAL. The most prevalent null models are position-independent residue distributions that include: the uniform distribution, genomic distribution, family-specific distribution and the target sequence distribution. This paper presents a study to evaluate the impact of the choice of a null model in the final result of classifications. In particular, we are interested in minimizing the number of false predictions in a classification. This is a crucial issue to reduce costs of biological validation.

Results

For all the tests, the target null model presented the lowest number of false positives, when using random sequences as a test. The study was performed in DNA sequences using GC content as the measure of content bias, but the results should be valid also for protein sequences. To broaden the application of the results, the study was performed using randomly generated sequences. Previous studies were performed on aminoacid sequences, using only one probabilistic model (HMM) and on a specific benchmark, and lack more general conclusions about the performance of null models. Finally, a benchmark test with P. falciparum confirmed these results.

Conclusions

Of the evaluated models the best suited for classification are the uniform model and the target model. However, the use of the uniform model presents a GC bias that can cause more false positives for candidate sequences with extreme compositional bias, a characteristic not described in previous studies. In these cases the target model is more dependable for biological validation due to its higher specificity.

Collapse

Cao MD, Dix TI, Allison L. A genome alignment algorithm based on compression. BMC Bioinformatics 2010;11:599. [PMID: 21159205 PMCID: PMC3022628 DOI: 10.1186/1471-2105-11-599] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2010] [Accepted: 12/16/2010] [Indexed: 11/26/2022] Open

Abstract

BACKGROUND

Traditional genome alignment methods consider sequence alignment as a variation of the string edit distance problem, and perform alignment by matching characters of the two sequences. They are often computationally expensive and unable to deal with low information regions. Furthermore, they lack a well-principled objective function to measure the performance of sets of parameters. Since genomic sequences carry genetic information, this article proposes that the information content of each nucleotide in a position should be considered in sequence alignment. An information-theoretic approach for pairwise genome local alignment, namely XMAligner, is presented. Instead of comparing sequences at the character level, XMAligner considers a pair of nucleotides from two sequences to be related if their mutual information in context is significant. The information content of nucleotides in sequences is measured by a lossless compression technique.

RESULTS

Experiments on both simulated data and real data show that XMAligner is superior to conventional methods especially on distantly related sequences and statistically biased data. XMAligner can align sequences of eukaryote genome size with only a modest hardware requirement. Importantly, the method has an objective function which can obviate the need to choose parameter values for high quality alignment. The alignment results from XMAligner can be integrated into a visualisation tool for viewing purpose.

CONCLUSIONS

The information-theoretic approach for sequence alignment is shown to overcome the mentioned problems of conventional character matching alignment methods. The article shows that, as genomic sequences are meant to carry information, considering the information content of nucleotides is helpful for genomic sequence alignment.

AVAILABILITY

Downloadable binaries, documentation and data can be found at ftp://ftp.infotech.monash.edu.au/software/DNAcompress-XM/XMAligner/.

Collapse

Frith MC. A new repeat-masking method enables specific detection of homologous sequences. Nucleic Acids Res 2010;39:e23. [PMID: 21109538 PMCID: PMC3045581 DOI: 10.1093/nar/gkq1212] [Citation(s) in RCA: 112] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023] Open

Altschul SF, Wootton JC, Zaslavsky E, Yu YK. The construction and use of log-odds substitution scores for multiple sequence alignment. PLoS Comput Biol 2010;6:e1000852. [PMID: 20657661 PMCID: PMC2904766 DOI: 10.1371/journal.pcbi.1000852] [Citation(s) in RCA: 53] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2009] [Accepted: 06/03/2010] [Indexed: 01/18/2023] Open

Considering scores between unrelated proteins in the search database improves profile comparison. BMC Bioinformatics 2009;10:399. [PMID: 19961610 PMCID: PMC3087343 DOI: 10.1186/1471-2105-10-399] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2009] [Accepted: 12/04/2009] [Indexed: 12/02/2022] Open

Park Y, Sheetlin S, Spouge JL. ESTIMATING THE GUMBEL SCALE PARAMETER FOR LOCAL ALIGNMENT OF RANDOM SEQUENCES BY IMPORTANCE SAMPLING WITH STOPPING TIMES. Ann Stat 2009;37:3697. [PMID: 20148197 DOI: 10.1214/08-aos663] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]

Ostlund G, Schmitt T, Forslund K, Köstler T, Messina DN, Roopra S, Frings O, Sonnhammer ELL. InParanoid 7: new algorithms and tools for eukaryotic orthology analysis. Nucleic Acids Res 2009;38:D196-203. [PMID: 19892828 PMCID: PMC2808972 DOI: 10.1093/nar/gkp931] [Citation(s) in RCA: 461] [Impact Index Per Article: 30.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/04/2022] Open

Nguyen VH, Lavenier D. PLAST: parallel local alignment search tool for database comparison. BMC Bioinformatics 2009;10:329. [PMID: 19821978 PMCID: PMC2770072 DOI: 10.1186/1471-2105-10-329] [Citation(s) in RCA: 44] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2009] [Accepted: 10/12/2009] [Indexed: 11/10/2022] Open

Forslund K, Sonnhammer ELL. Benchmarking homology detection procedures with low complexity filters. ACTA ACUST UNITED AC 2009;25:2500-5. [PMID: 19620098 DOI: 10.1093/bioinformatics/btp446] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]

Jimenez-Morales D, Adamian L, Liang J. Detecting remote homologues using scoring matrices calculated from the estimation of amino acid substitution rates of beta-barrel membrane proteins. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2009;2008:1347-50. [PMID: 19162917 DOI: 10.1109/iembs.2008.4649414] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]

Poleksic A. Island method for estimating the statistical significance of profile-profile alignment scores. BMC Bioinformatics 2009;10:112. [PMID: 19379500 PMCID: PMC2678096 DOI: 10.1186/1471-2105-10-112] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/25/2008] [Accepted: 04/20/2009] [Indexed: 11/10/2022] Open

Stojmirović A, Yu YK. Geometric aspects of biological sequence comparison. J Comput Biol 2009;16:579-610. [PMID: 19361329 DOI: 10.1089/cmb.2008.0100] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open

Agrawal A, Huang X. PSIBLAST_PairwiseStatSig: reordering PSI-BLAST hits using pairwise statistical significance. Bioinformatics 2009;25:1082-3. [DOI: 10.1093/bioinformatics/btp089] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open