1
|
Vicedomini R, Bouly JP, Laine E, Falciatore A, Carbone A. Multiple profile models extract features from protein sequence data and resolve functional diversity of very different protein families. Mol Biol Evol 2022; 39:6556147. [PMID: 35353898 PMCID: PMC9016551 DOI: 10.1093/molbev/msac070] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Functional classification of proteins from sequences alone has become a critical bottleneck in understanding the myriad of protein sequences that accumulate in our databases. The great diversity of homologous sequences hides, in many cases, a variety of functional activities that cannot be anticipated. Their identification appears critical for a fundamental understanding of the evolution of living organisms and for biotechnological applications. ProfileView is a sequence-based computational method, designed to functionally classify sets of homologous sequences. It relies on two main ideas: the use of multiple profile models whose construction explores evolutionary information in available databases, and a novel definition of a representation space in which to analyse sequences with multiple profile models combined together. ProfileView classifies protein families by enriching known functional groups with new sequences and discovering new groups and subgroups. We validate ProfileView on seven classes of widespread proteins involved in the interaction with nucleic acids, amino acids and small molecules, and in a large variety of functions and enzymatic reactions. Profile-View agrees with the large set of functional data collected for these proteins from the literature regarding the organisation into functional subgroups and residues that characterise the functions. In addition, ProfileView resolves undefined functional classifications and extracts the molecular determinants underlying protein functional diversity, showing its potential to select sequences towards accurate experimental design and discovery of novel biological functions. On protein families with complex domain architecture, ProfileView functional classification reconciles domain combinations, unlike phylogenetic reconstruction. ProfileView proves to outperform the functional classification approach PANTHER, the two k-mer based methods CUPP and eCAMI and a neural network approach based on Restricted Boltzmann Machines. It overcomes time complexity limitations of the latter.
Collapse
Affiliation(s)
- R Vicedomini
- Sorbonne Université, CNRS, IBPS, Laboratoire de Biologie Computationnelle et Quantitative - UMR 7238, 4 place Jussieu, 75005 Paris, France.,Sorbonne Université, Institut des Sciences du Calcul et des Données
| | - J P Bouly
- Sorbonne Université, CNRS, IBPS, Laboratoire de Biologie Computationnelle et Quantitative - UMR 7238, 4 place Jussieu, 75005 Paris, France.,CNRS, Sorbonne Université Institut de Biologie Physico-Chimique, Laboratory of Chloroplast Biology and Light Sensing in Microalgae - UMR7141, Paris, France
| | - E Laine
- Sorbonne Université, CNRS, IBPS, Laboratoire de Biologie Computationnelle et Quantitative - UMR 7238, 4 place Jussieu, 75005 Paris, France
| | - A Falciatore
- Sorbonne Université, CNRS, IBPS, Laboratoire de Biologie Computationnelle et Quantitative - UMR 7238, 4 place Jussieu, 75005 Paris, France.,CNRS, Sorbonne Université Institut de Biologie Physico-Chimique, Laboratory of Chloroplast Biology and Light Sensing in Microalgae - UMR7141, Paris, France
| | - A Carbone
- Sorbonne Université, CNRS, IBPS, Laboratoire de Biologie Computationnelle et Quantitative - UMR 7238, 4 place Jussieu, 75005 Paris, France.,Institut Universitaire de France, Paris 75005, France
| |
Collapse
|
2
|
Tang H, Thomas PD. Tools for Predicting the Functional Impact of Nonsynonymous Genetic Variation. Genetics 2016; 203:635-47. [PMID: 27270698 PMCID: PMC4896183 DOI: 10.1534/genetics.116.190033] [Citation(s) in RCA: 75] [Impact Index Per Article: 9.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2015] [Accepted: 04/01/2016] [Indexed: 01/09/2023] Open
Abstract
As personal genome sequencing becomes a reality, understanding the effects of genetic variants on phenotype-particularly the impact of germline variants on disease risk and the impact of somatic variants on cancer development and treatment-continues to increase in importance. Because of their clear potential for affecting phenotype, nonsynonymous genetic variants (variants that cause a change in the amino acid sequence of a protein encoded by a gene) have long been the target of efforts to predict the effects of genetic variation. Whole-genome sequencing is identifying large numbers of nonsynonymous variants in each genome, intensifying the need for computational methods that accurately predict which of these are likely to impact disease phenotypes. This review focuses on nonsynonymous variant prediction with two aims in mind: (1) to review the prioritization methods that have been developed to date and the principles on which they are based and (2) to discuss the challenges to further improving these methods.
Collapse
Affiliation(s)
- Haiming Tang
- Division of Bioinformatics, Department of Preventive Medicine, University of Southern California, Los Angeles, California 90033
| | - Paul D Thomas
- Division of Bioinformatics, Department of Preventive Medicine, University of Southern California, Los Angeles, California 90033
| |
Collapse
|
3
|
Ochoa A, Storey JD, Llinás M, Singh M. Beyond the E-Value: Stratified Statistics for Protein Domain Prediction. PLoS Comput Biol 2015; 11:e1004509. [PMID: 26575353 PMCID: PMC4648515 DOI: 10.1371/journal.pcbi.1004509] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2014] [Accepted: 08/03/2015] [Indexed: 01/25/2023] Open
Abstract
E-values have been the dominant statistic for protein sequence analysis for the past two decades: from identifying statistically significant local sequence alignments to evaluating matches to hidden Markov models describing protein domain families. Here we formally show that for “stratified” multiple hypothesis testing problems—that is, those in which statistical tests can be partitioned naturally—controlling the local False Discovery Rate (lFDR) per stratum, or partition, yields the most predictions across the data at any given threshold on the FDR or E-value over all strata combined. For the important problem of protein domain prediction, a key step in characterizing protein structure, function and evolution, we show that stratifying statistical tests by domain family yields excellent results. We develop the first FDR-estimating algorithms for domain prediction, and evaluate how well thresholds based on q-values, E-values and lFDRs perform in domain prediction using five complementary approaches for estimating empirical FDRs in this context. We show that stratified q-value thresholds substantially outperform E-values. Contradicting our theoretical results, q-values also outperform lFDRs; however, our tests reveal a small but coherent subset of domain families, biased towards models for specific repetitive patterns, for which weaknesses in random sequence models yield notably inaccurate statistical significance measures. Usage of lFDR thresholds outperform q-values for the remaining families, which have as-expected noise, suggesting that further improvements in domain predictions can be achieved with improved modeling of random sequences. Overall, our theoretical and empirical findings suggest that the use of stratified q-values and lFDRs could result in improvements in a host of structured multiple hypothesis testing problems arising in bioinformatics, including genome-wide association studies, orthology prediction, and motif scanning. Despite decades of research, it remains a challenge to distinguish homologous relationships between proteins from sequence similarities arising due to chance alone. This is an increasingly important problem as sequence database sizes continue to grow, and even today many computational analyses require that the statistics of billions of sequence comparisons be assessed automatically. Here we explore statistical significance evaluation on data that is stratified—that is, naturally partitioned into subsets that may differ in their amount of signal—and find a theoretically optimal criterion for automatically setting thresholds of significance for each stratum. For the task of domain prediction, an important component of efforts to annotate protein sequences and identify remote sequence homologs, we empirically show that our stratified analysis of statistical significance greatly improves upon a combined analysis. Further, we identify weaknesses in the prevailing random sequence model for assessing statistical significance for a small subset of domain families with repetitive sequence patterns and known biological, structural, and evolutionary properties. Our theoretical findings in statistics are relevant not only for identifying protein domains, but for arbitrary stratified problems in genomics and beyond.
Collapse
Affiliation(s)
- Alejandro Ochoa
- Department of Molecular Biology, Princeton University, Princeton, New Jersey, United States of America
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America
- Center for Statistics and Machine Learning, Princeton University, Princeton, New Jersey, United States of America
| | - John D. Storey
- Department of Molecular Biology, Princeton University, Princeton, New Jersey, United States of America
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America
- Center for Statistics and Machine Learning, Princeton University, Princeton, New Jersey, United States of America
| | - Manuel Llinás
- Department of Biochemistry and Molecular Biology, and the Huck Institutes of the Life Sciences, Pennsylvania State University, University Park, Pennsylvania, United States of America
| | - Mona Singh
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America
- Department of Computer Science, Princeton University, Princeton, New Jersey, United States of America
- * E-mail:
| |
Collapse
|
4
|
Liu M, Watson LT, Zhang L. Quantitative prediction of the effect of genetic variation using hidden Markov models. BMC Bioinformatics 2014; 15:5. [PMID: 24405700 PMCID: PMC3893606 DOI: 10.1186/1471-2105-15-5] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2013] [Accepted: 01/02/2014] [Indexed: 11/10/2022] Open
Abstract
Background With the development of sequencing technologies, more and more sequence variants are available for investigation. Different classes of variants in the human genome have been identified, including single nucleotide substitutions, insertion and deletion, and large structural variations such as duplications and deletions. Insertion and deletion (indel) variants comprise a major proportion of human genetic variation. However, little is known about their effects on humans. The absence of understanding is largely due to the lack of both biological data and computational resources. Results This paper presents a new indel functional prediction method HMMvar based on HMM profiles, which capture the conservation information in sequences. The results demonstrate that a scoring strategy based on HMM profiles can achieve good performance in identifying deleterious or neutral variants for different data sets, and can predict the protein functional effects of both single and multiple mutations. Conclusions This paper proposed a quantitative prediction method, HMMvar, to predict the effect of genetic variation using hidden Markov models. The HMM based pipeline program implementing the method HMMvar is freely available at
https://bioinformatics.cs.vt.edu/zhanglab/hmm.
Collapse
Affiliation(s)
| | | | - Liqing Zhang
- Department of Computer Science, Virginia Polytechnic Institute and State University, Blacksburg, VA, USA.
| |
Collapse
|
5
|
Pleiotropic functions of catabolite control protein CcpA in Butanol-producing Clostridium acetobutylicum. BMC Genomics 2012; 13:349. [PMID: 22846451 PMCID: PMC3507653 DOI: 10.1186/1471-2164-13-349] [Citation(s) in RCA: 54] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2012] [Accepted: 06/28/2012] [Indexed: 12/24/2022] Open
Abstract
Background Clostridium acetobutylicum has been used to produce butanol in industry. Catabolite control protein A (CcpA), known to mediate carbon catabolite repression (CCR) in low GC gram-positive bacteria, has been identified and characterized in C. acetobutylicum by our previous work (Ren, C. et al. 2010, Metab Eng 12:446–54). To further dissect its regulatory function in C. acetobutylicum, CcpA was investigated using DNA microarray followed by phenotypic, genetic and biochemical validation. Results CcpA controls not only genes in carbon metabolism, but also those genes in solvent production and sporulation of the life cycle in C. acetobutylicum: i) CcpA directly repressed transcription of genes related to transport and metabolism of non-preferred carbon sources such as d-xylose and l-arabinose, and activated expression of genes responsible for d-glucose PTS system; ii) CcpA is involved in positive regulation of the key solventogenic operon sol (adhE1-ctfA-ctfB) and negative regulation of acidogenic gene bukII; and iii) transcriptional alterations were observed for several sporulation-related genes upon ccpA inactivation, which may account for the lower sporulation efficiency in the mutant, suggesting CcpA may be necessary for efficient sporulation of C. acetobutylicum, an important trait adversely affecting the solvent productivity. Conclusions This study provided insights to the pleiotropic functions that CcpA displayed in butanol-producing C. acetobutylicum. The information could be valuable for further dissecting its pleiotropic regulatory mechanism in C. acetobutylicum, and for genetic modification in order to obtain more effective butanol-producing Clostridium strains.
Collapse
|
6
|
Ochoa A, Llinás M, Singh M. Using context to improve protein domain identification. BMC Bioinformatics 2011; 12:90. [PMID: 21453511 PMCID: PMC3090354 DOI: 10.1186/1471-2105-12-90] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2010] [Accepted: 03/31/2011] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Identifying domains in protein sequences is an important step in protein structural and functional annotation. Existing domain recognition methods typically evaluate each domain prediction independently of the rest. However, the majority of proteins are multidomain, and pairwise domain co-occurrences are highly specific and non-transitive. RESULTS Here, we demonstrate how to exploit domain co-occurrence to boost weak domain predictions that appear in previously observed combinations, while penalizing higher confidence domains if such combinations have never been observed. Our framework, Domain Prediction Using Context (dPUC), incorporates pairwise "context" scores between domains, along with traditional domain scores and thresholds, and improves domain prediction across a variety of organisms from bacteria to protozoa and metazoa. Among the genomes we tested, dPUC is most successful at improving predictions for the poorly-annotated malaria parasite Plasmodium falciparum, for which over 38% of the genome is currently unannotated. Our approach enables high-confidence annotations in this organism and the identification of orthologs to many core machinery proteins conserved in all eukaryotes, including those involved in ribosomal assembly and other RNA processing events, which surprisingly had not been previously known. CONCLUSIONS Overall, our results demonstrate that this new context-based approach will provide significant improvements in domain and function prediction, especially for poorly understood genomes for which the need for additional annotations is greatest. Source code for the algorithm is available under a GPL open source license at http://compbio.cs.princeton.edu/dpuc/. Pre-computed results for our test organisms and a web server are also available at that location.
Collapse
Affiliation(s)
- Alejandro Ochoa
- Department of Molecular Biology, Princeton University, Princeton, NJ, USA
| | | | | |
Collapse
|
7
|
Machado-Lima A, Kashiwabara AY, Durham AM. Decreasing the number of false positives in sequence classification. BMC Genomics 2010; 11 Suppl 5:S10. [PMID: 21210966 PMCID: PMC3045793 DOI: 10.1186/1471-2164-11-s5-s10] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
Abstract
Background A large number of probabilistic models used in sequence analysis assign non-zero probability values to most input sequences. To decide when a given probability is sufficient the most common way is bayesian binary classification, where the probability of the model characterizing the sequence family of interest is compared to that of an alternative probability model. We can use as alternative model a null model. This is the scoring technique used by sequence analysis tools such as HMMER, SAM and INFERNAL. The most prevalent null models are position-independent residue distributions that include: the uniform distribution, genomic distribution, family-specific distribution and the target sequence distribution. This paper presents a study to evaluate the impact of the choice of a null model in the final result of classifications. In particular, we are interested in minimizing the number of false predictions in a classification. This is a crucial issue to reduce costs of biological validation. Results For all the tests, the target null model presented the lowest number of false positives, when using random sequences as a test. The study was performed in DNA sequences using GC content as the measure of content bias, but the results should be valid also for protein sequences. To broaden the application of the results, the study was performed using randomly generated sequences. Previous studies were performed on aminoacid sequences, using only one probabilistic model (HMM) and on a specific benchmark, and lack more general conclusions about the performance of null models. Finally, a benchmark test with P. falciparum confirmed these results. Conclusions Of the evaluated models the best suited for classification are the uniform model and the target model. However, the use of the uniform model presents a GC bias that can cause more false positives for candidate sequences with extreme compositional bias, a characteristic not described in previous studies. In these cases the target model is more dependable for biological validation due to its higher specificity.
Collapse
Affiliation(s)
- Ariane Machado-Lima
- Escola de Artes, Ciências e Humanidades, Universidade de São Paulo, Rua Arlindo Béttio, 1000, 03828-000, São Paulo, SP, Brazil
| | | | | |
Collapse
|
8
|
Riley T, Yu X, Sontag E, Levine A. The p53HMM algorithm: using profile hidden markov models to detect p53-responsive genes. BMC Bioinformatics 2009; 10:111. [PMID: 19379484 PMCID: PMC2685388 DOI: 10.1186/1471-2105-10-111] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2008] [Accepted: 04/20/2009] [Indexed: 12/03/2022] Open
Abstract
Background A computational method (called p53HMM) is presented that utilizes Profile Hidden Markov Models (PHMMs) to estimate the relative binding affinities of putative p53 response elements (REs), both p53 single-sites and cluster-sites. These models incorporate a novel "Corresponded Baum-Welch" training algorithm that provides increased predictive power by exploiting the redundancy of information found in the repeated, palindromic p53-binding motif. The predictive accuracy of these new models are compared against other predictive models, including position specific score matrices (PSSMs, or weight matrices). We also present a new dynamic acceptance threshold, dependent upon a putative binding site's distance from the Transcription Start Site (TSS) and its estimated binding affinity. This new criteria for classifying putative p53-binding sites increases predictive accuracy by reducing the false positive rate. Results Training a Profile Hidden Markov Model with corresponding positions matching a combined-palindromic p53-binding motif creates the best p53-RE predictive model. The p53HMM algorithm is available on-line: Conclusion Using Profile Hidden Markov Models with training methods that exploit the redundant information of the homotetramer p53 binding site provides better predictive models than weight matrices (PSSMs). These methods may also boost performance when applied to other transcription factor binding sites.
Collapse
Affiliation(s)
- Todd Riley
- The Institute for Advanced Study, Princeton, NJ, USA.
| | | | | | | |
Collapse
|
9
|
Stojmirović A, Gertz EM, Altschul SF, Yu YK. The effectiveness of position- and composition-specific gap costs for protein similarity searches. Bioinformatics 2008; 24:i15-23. [PMID: 18586708 PMCID: PMC2718649 DOI: 10.1093/bioinformatics/btn171] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
Motivation: The flexibility in gap cost enjoyed by hidden Markov models (HMMs) is expected to afford them better retrieval accuracy than position-specific scoring matrices (PSSMs). We attempt to quantify the effect of more general gap parameters by separately examining the influence of position- and composition-specific gap scores, as well as by comparing the retrieval accuracy of the PSSMs constructed using an iterative procedure to that of the HMMs provided by Pfam and SUPERFAMILY, curated ensembles of multiple alignments. Results: We found that position-specific gap penalties have an advantage over uniform gap costs. We did not explore optimizing distinct uniform gap costs for each query. For Pfam, PSSMs iteratively constructed from seeds based on HMM consensus sequences perform equivalently to HMMs that were adjusted to have constant gap transition probabilities, albeit with much greater variance. We observed no effect of composition-specific gap costs on retrieval performance. These results suggest possible improvements to the PSI-BLAST protein database search program. Availability: The scripts for performing evaluations are available upon request from the authors. Contact:yyu@ncbi.nlm.nih.gov
Collapse
Affiliation(s)
- Aleksandar Stojmirović
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | | | | | | |
Collapse
|
10
|
Eddy SR. A probabilistic model of local sequence alignment that simplifies statistical significance estimation. PLoS Comput Biol 2008; 4:e1000069. [PMID: 18516236 PMCID: PMC2396288 DOI: 10.1371/journal.pcbi.1000069] [Citation(s) in RCA: 229] [Impact Index Per Article: 14.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2007] [Accepted: 03/26/2008] [Indexed: 11/19/2022] Open
Abstract
Sequence database searches require accurate estimation of the statistical significance of scores. Optimal local sequence alignment scores follow Gumbel distributions, but determining an important parameter of the distribution (λ) requires time-consuming computational simulation. Moreover, optimal alignment scores are less powerful than probabilistic scores that integrate over alignment uncertainty (“Forward” scores), but the expected distribution of Forward scores remains unknown. Here, I conjecture that both expected score distributions have simple, predictable forms when full probabilistic modeling methods are used. For a probabilistic model of local sequence alignment, optimal alignment bit scores (“Viterbi” scores) are Gumbel-distributed with constant λ = log 2, and the high scoring tail of Forward scores is exponential with the same constant λ. Simulation studies support these conjectures over a wide range of profile/sequence comparisons, using 9,318 profile-hidden Markov models from the Pfam database. This enables efficient and accurate determination of expectation values (E-values) for both Viterbi and Forward scores for probabilistic local alignments. Sequence database searches are a fundamental tool of molecular biology, enabling researchers to identify related sequences in other organisms, which often provides invaluable clues to the function and evolutionary history of genes. The power of database searches to detect more and more remote evolutionary relationships – essentially, to look back deeper in time – has improved steadily, with the adoption of more complex and realistic models. However, database searches require not just a realistic scoring model, but also the ability to distinguish good scores from bad ones – the ability to calculate the statistical significance of scores. For many models and scoring schemes, accurate statistical significance calculations have either involved expensive computational simulations, or not been feasible at all. Here, I introduce a probabilistic model of local sequence alignment that has readily predictable score statistics for position-specific profile scoring systems, and not just for traditional optimal alignment scores, but also for more powerful log-likelihood ratio scores derived in a full probabilistic inference framework. These results remove one of the main obstacles that have impeded the use of more powerful and biologically realistic statistical inference methods in sequence homology searches.
Collapse
Affiliation(s)
- Sean R Eddy
- Howard Hughes Medical Institute, Janelia Farm Research Campus, Ashburn, Virginia, United States of America.
| |
Collapse
|
11
|
Poleksic A, Fienup M. Optimizing the size of the sequence profiles to increase the accuracy of protein sequence alignments generated by profile-profile algorithms. Bioinformatics 2008; 24:1145-53. [PMID: 18337259 DOI: 10.1093/bioinformatics/btn097] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Profile-based protein homology detection algorithms are valuable tools in genome annotation and protein classification. By utilizing information present in the sequences of homologous proteins, profile-based methods are often able to detect extremely weak relationships between protein sequences, as evidenced by the large-scale benchmarking experiments such as CASP and LiveBench. RESULTS We study the relationship between the sensitivity of a profile-profile method and the size of the sequence profile, which is defined as the average number of different residue types observed at the profile's positions. We also demonstrate that improvements in the sensitivity of a profile-profile method can be made by incorporating a profile-dependent scoring scheme, such as position-specific background frequencies. The techniques presented in this article are implemented in an alignment algorithm UNI-FOLD. When tested against other well-established methods for fold recognition, UNI-FOLD shows increased sensitivity and specificity in detecting remote relationships between protein sequences. AVAILABILITY UNI-FOLD web server can be accessed at http://blackhawk.cs.uni.edu
Collapse
Affiliation(s)
- Aleksandar Poleksic
- Department of Computer Science, University of Northern Iowa, Cedar Falls, IA 50614, USA.
| | | |
Collapse
|
12
|
Kann MG, Sheetlin SL, Park Y, Bryant SH, Spouge JL. The identification of complete domains within protein sequences using accurate E-values for semi-global alignment. Nucleic Acids Res 2007; 35:4678-85. [PMID: 17596268 PMCID: PMC1950549 DOI: 10.1093/nar/gkm414] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
The sequencing of complete genomes has created a pressing need for automated annotation of gene function. Because domains are the basic units of protein function and evolution, a gene can be annotated from a domain database by aligning domains to the corresponding protein sequence. Ideally, complete domains are aligned to protein subsequences, in a ‘semi-global alignment’. Local alignment, which aligns pieces of domains to subsequences, is common in high-throughput annotation applications, however. It is a mature technique, with the heuristics and accurate E-values required for screening large databases and evaluating the screening results. Hidden Markov models (HMMs) provide an alternative theoretical framework for semi-global alignment, but their use is limited because they lack heuristic acceleration and accurate E-values. Our new tool, GLOBAL, overcomes some limitations of previous semi-global HMMs: it has accurate E-values and the possibility of the heuristic acceleration required for high-throughput applications. Moreover, according to a standard of truth based on protein structure, two semi-global HMM alignment tools (GLOBAL and HMMer) had comparable performance in identifying complete domains, but distinctly outperformed two tools based on local alignment. When searching for complete protein domains, therefore, GLOBAL avoids disadvantages commonly associated with HMMs, yet maintains their superior retrieval performance.
Collapse
Affiliation(s)
| | | | | | | | - John L. Spouge
- *To whom correspondence should be addressed.301 402 9310301 480 2484
| |
Collapse
|
13
|
Parsons M, Worthey EA, Ward PN, Mottram JC. Comparative analysis of the kinomes of three pathogenic trypanosomatids: Leishmania major, Trypanosoma brucei and Trypanosoma cruzi. BMC Genomics 2005; 6:127. [PMID: 16164760 PMCID: PMC1266030 DOI: 10.1186/1471-2164-6-127] [Citation(s) in RCA: 273] [Impact Index Per Article: 14.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2005] [Accepted: 09/15/2005] [Indexed: 12/27/2022] Open
Abstract
Background The trypanosomatids Leishmania major, Trypanosoma brucei and Trypanosoma cruzi cause some of the most debilitating diseases of humankind: cutaneous leishmaniasis, African sleeping sickness, and Chagas disease. These protozoa possess complex life cycles that involve development in mammalian and insect hosts, and a tightly coordinated cell cycle ensures propagation of the highly polarized cells. However, the ways in which the parasites respond to their environment and coordinate intracellular processes are poorly understood. As a part of an effort to understand parasite signaling functions, we report the results of a genome-wide analysis of protein kinases (PKs) of these three trypanosomatids. Results Bioinformatic searches of the trypanosomatid genomes for eukaryotic PKs (ePKs) and atypical PKs (aPKs) revealed a total of 176 PKs in T. brucei, 190 in T. cruzi and 199 in L. major, most of which are orthologous across the three species. This is approximately 30% of the number in the human host and double that of the malaria parasite, Plasmodium falciparum. The representation of various groups of ePKs differs significantly as compared to humans: trypanosomatids lack receptor-linked tyrosine and tyrosine kinase-like kinases, although they do possess dual-specificity kinases. A relative expansion of the CMGC, STE and NEK groups has occurred. A large number of unique ePKs show no strong affinity to any known group. The trypanosomatids possess few ePKs with predicted transmembrane domains, suggesting that receptor ePKs are rare. Accessory Pfam domains, which are frequently present in human ePKs, are uncommon in trypanosomatid ePKs. Conclusion Trypanosomatids possess a large set of PKs, comprising approximately 2% of each genome, suggesting a key role for phosphorylation in parasite biology. Whilst it was possible to place most of the trypanosomatid ePKs into the seven established groups using bioinformatic analyses, it has not been possible to ascribe function based solely on sequence similarity. Hence the connection of stimuli to protein phosphorylation networks remains enigmatic. The presence of numerous PKs with significant sequence similarity to known drug targets, as well as a large number of unusual kinases that might represent novel targets, strongly argue for functional analysis of these molecules.
Collapse
Affiliation(s)
- Marilyn Parsons
- Seattle Biomedical Research Institute, 307 Westlake Ave. N., Seattle, WA, 98109 USA
- Department of Pathobiology, University of Washington, Seattle, WA, 98195 USA
| | - Elizabeth A Worthey
- Seattle Biomedical Research Institute, 307 Westlake Ave. N., Seattle, WA, 98109 USA
| | - Pauline N Ward
- Wellcome Centre for Molecular Parasitology, The Anderson College, University of Glasgow, Glasgow G11 6NU, UK
| | - Jeremy C Mottram
- Wellcome Centre for Molecular Parasitology, The Anderson College, University of Glasgow, Glasgow G11 6NU, UK
| |
Collapse
|
14
|
Wistrand M, Sonnhammer ELL. Improved profile HMM performance by assessment of critical algorithmic features in SAM and HMMER. BMC Bioinformatics 2005; 6:99. [PMID: 15831105 PMCID: PMC1097716 DOI: 10.1186/1471-2105-6-99] [Citation(s) in RCA: 40] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2005] [Accepted: 04/15/2005] [Indexed: 11/24/2022] Open
Abstract
Background Profile hidden Markov model (HMM) techniques are among the most powerful methods for protein homology detection. Yet, the critical features for successful modelling are not fully known. In the present work we approached this by using two of the most popular HMM packages: SAM and HMMER. The programs' abilities to build models and score sequences were compared on a SCOP/Pfam based test set. The comparison was done separately for local and global HMM scoring. Results Using default settings, SAM was overall more sensitive. SAM's model estimation was superior, while HMMER's model scoring was more accurate. Critical features for model building were then analysed by comparing the two packages' algorithmic choices and parameters. The weighting between prior probabilities and multiple alignment counts held the primary explanation why SAM's model building was superior. Our analysis suggests that HMMER gives too much weight to the sequence counts. SAM's emission prior probabilities were also shown to be more sensitive. The relative sequence weighting schemes are different in the two packages but performed equivalently. Conclusion SAM model estimation was more sensitive, while HMMER model scoring was more accurate. By combining the best algorithmic features from both packages the accuracy was substantially improved compared to their default performance.
Collapse
Affiliation(s)
- Markus Wistrand
- Center for Genomics and Bioinformatics, Karolinska Institutet, S-17177 Stockholm, Sweden
| | - Erik LL Sonnhammer
- Center for Genomics and Bioinformatics, Karolinska Institutet, S-17177 Stockholm, Sweden
| |
Collapse
|
15
|
Abstract
MOTIVATION Protein homology detection and sequence alignment are at the basis of protein structure prediction, function prediction and evolution. RESULTS We have generalized the alignment of protein sequences with a profile hidden Markov model (HMM) to the case of pairwise alignment of profile HMMs. We present a method for detecting distant homologous relationships between proteins based on this approach. The method (HHsearch) is benchmarked together with BLAST, PSI-BLAST, HMMER and the profile-profile comparison tools PROF_SIM and COMPASS, in an all-against-all comparison of a database of 3691 protein domains from SCOP 1.63 with pairwise sequence identities below 20%.Sensitivity: When the predicted secondary structure is included in the HMMs, HHsearch is able to detect between 2.7 and 4.2 times more homologs than PSI-BLAST or HMMER and between 1.44 and 1.9 times more than COMPASS or PROF_SIM for a rate of false positives of 10%. Approximately half of the improvement over the profile-profile comparison methods is attributable to the use of profile HMMs in place of simple profiles. Alignment quality: Higher sensitivity is mirrored by an increased alignment quality. HHsearch produced 1.2, 1.7 and 3.3 times more good alignments ('balanced' score >0.3) than the next best method (COMPASS), and 1.6, 2.9 and 9.4 times more than PSI-BLAST, at the family, superfamily and fold level, respectively.Speed: HHsearch scans a query of 200 residues against 3691 domains in 33 s on an AMD64 2GHz PC. This is 10 times faster than PROF_SIM and 17 times faster than COMPASS.
Collapse
Affiliation(s)
- Johannes Söding
- Department of Protein Evolution, Max-Planck-Institute for Developmental Biology Spemannstrasse 35, D-72076 Tübingen, Germany.
| |
Collapse
|
16
|
Paredes CJ, Rigoutsos I, Papoutsakis ET. Transcriptional organization of the Clostridium acetobutylicum genome. Nucleic Acids Res 2004; 32:1973-81. [PMID: 15060177 PMCID: PMC390361 DOI: 10.1093/nar/gkh509] [Citation(s) in RCA: 50] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Prokaryotic genes are frequently organized in multicistronic operons (or transcriptional units, TUs), and usually the regulatory motifs for the whole TU are located upstream of the first TU gene. Although the number of sequenced genomes has increased dramatically, experimental information on TU organization is extremely limited. Even for organisms as extensively studied as Escherichia coli and Bacillus subtilis, TU annotation is far from complete. It therefore becomes imperative to rely on computational approaches to complement experimental information. Here we present a TU map for the obligate anaerobe Clostridium acetobutylicum ATCC 824. This map is largely based on the distance between pairs of consecutive genes but enhanced and refined by predictions of several types of promoters (sigmaA, sigmaE and sigmaF/G) and rho-independent terminator structures. Based on the set of known C.acetobutylicum TUs, the presented TU map offers an 88% prediction accuracy.
Collapse
Affiliation(s)
- Carlos J Paredes
- Department of Chemical and Biological Engineering, Northwestern University, Evanston, IL 60208, USA
| | | | | |
Collapse
|
17
|
Churches T, Christen P, Lim K, Zhu JX. Preparation of name and address data for record linkage using hidden Markov models. BMC Med Inform Decis Mak 2002; 2:9. [PMID: 12482326 PMCID: PMC140019 DOI: 10.1186/1472-6947-2-9] [Citation(s) in RCA: 19] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2002] [Accepted: 12/13/2002] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Record linkage refers to the process of joining records that relate to the same entity or event in one or more data collections. In the absence of a shared, unique key, record linkage involves the comparison of ensembles of partially-identifying, non-unique data items between pairs of records. Data items with variable formats, such as names and addresses, need to be transformed and normalised in order to validly carry out these comparisons. Traditionally, deterministic rule-based data processing systems have been used to carry out this pre-processing, which is commonly referred to as "standardisation". This paper describes an alternative approach to standardisation, using a combination of lexicon-based tokenisation and probabilistic hidden Markov models (HMMs). METHODS HMMs were trained to standardise typical Australian name and address data drawn from a range of health data collections. The accuracy of the results was compared to that produced by rule-based systems. RESULTS Training of HMMs was found to be quick and did not require any specialised skills. For addresses, HMMs produced equal or better standardisation accuracy than a widely-used rule-based system. However, accuracy was worse when used with simpler name data. Possible reasons for this poorer performance are discussed. CONCLUSION Lexicon-based tokenisation and HMMs provide a viable and effort-effective alternative to rule-based systems for pre-processing more complex variably formatted data such as addresses. Further work is required to improve the performance of this approach with simpler data such as names. Software which implements the methods described in this paper is freely available under an open source license for other researchers to use and improve.
Collapse
Affiliation(s)
- Tim Churches
- Centre for Epidemiology and Research, Public Health Division, New South Wales Department of Health, Locked Mail Bag 961, North Sydney 2059, Australia
| | - Peter Christen
- Department of Computer Science, Australian National University, Canberra, Australia
| | - Kim Lim
- Centre for Epidemiology and Research, Public Health Division, New South Wales Department of Health, Locked Mail Bag 961, North Sydney 2059, Australia
| | - Justin Xi Zhu
- Department of Computer Science, Australian National University, Canberra, Australia
| |
Collapse
|
18
|
Shlapatska LM, Mikhalap SV, Berdova AG, Zelensky OM, Yun TJ, Nichols KE, Clark EA, Sidorenko SP. CD150 association with either the SH2-containing inositol phosphatase or the SH2-containing protein tyrosine phosphatase is regulated by the adaptor protein SH2D1A. JOURNAL OF IMMUNOLOGY (BALTIMORE, MD. : 1950) 2001; 166:5480-7. [PMID: 11313386 DOI: 10.4049/jimmunol.166.9.5480] [Citation(s) in RCA: 175] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
CD150 (SLAM/IPO-3) is a cell surface receptor that, like the B cell receptor, CD40, and CD95, can transmit positive or negative signals. CD150 can associate with the SH2-containing inositol phosphatase (SHIP), the SH2-containing protein tyrosine phosphatase (SHP-2), and the adaptor protein SH2 domain protein 1A (SH2D1A/DSHP/SAP, also called Duncan's disease SH2-protein (DSHP) or SLAM-associated protein (SAP)). Mutations in SH2D1A are found in X-linked lymphoproliferative syndrome and non-Hodgkin's lymphomas. Here we report that SH2D1A is expressed in tonsillar B cells and in some B lymphoblastoid cell lines, where CD150 coprecipitates with SH2D1A and SHIP. However, in SH2D1A-negative B cell lines, including B cell lines from X-linked lymphoproliferative syndrome patients, CD150 associates only with SHP-2. SH2D1A protein levels are up-regulated by CD40 cross-linking and down-regulated by B cell receptor ligation. Using GST-fusion proteins with single replacements of tyrosine at Y269F, Y281F, Y307F, or Y327F in the CD150 cytoplasmic tail, we found that the same phosphorylated Y281 and Y327 are essential for both SHP-2 and SHIP binding. The presence of SH2D1A facilitates binding of SHIP to CD150. Apparently, SH2D1A may function as a regulator of alternative interactions of CD150 with SHP-2 or SHIP via a novel TxYxxV/I motif (immunoreceptor tyrosine-based switch motif (ITSM)). Multiple sequence alignments revealed the presence of this TxYxxV/I motif not only in CD2 subfamily members but also in the cytoplasmic domains of the members of the SHP-2 substrate 1, sialic acid-binding Ig-like lectin, carcinoembryonic Ag, and leukocyte-inhibitory receptor families.
Collapse
MESH Headings
- Amino Acid Sequence
- Antigens, CD
- B-Lymphocytes/enzymology
- B-Lymphocytes/metabolism
- Carrier Proteins/biosynthesis
- Carrier Proteins/metabolism
- Carrier Proteins/physiology
- Cell Line, Transformed
- Cells, Cultured
- Cytoplasm/immunology
- Cytoplasm/metabolism
- Glycoproteins/genetics
- Glycoproteins/metabolism
- Humans
- Immunoglobulins/genetics
- Immunoglobulins/metabolism
- Intracellular Signaling Peptides and Proteins
- Jurkat Cells
- Models, Molecular
- Molecular Sequence Data
- Mutagenesis, Site-Directed
- Peptide Fragments/immunology
- Peptide Fragments/metabolism
- Phosphatidylinositol-3,4,5-Trisphosphate 5-Phosphatases
- Phosphoric Monoester Hydrolases/metabolism
- Protein Binding/genetics
- Protein Binding/immunology
- Protein Tyrosine Phosphatase, Non-Receptor Type 11
- Protein Tyrosine Phosphatase, Non-Receptor Type 6
- Protein Tyrosine Phosphatases/metabolism
- Receptors, Antigen, B-Cell/metabolism
- Receptors, Cell Surface
- Recombinant Fusion Proteins/metabolism
- SH2 Domain-Containing Protein Tyrosine Phosphatases
- Signaling Lymphocytic Activation Molecule Associated Protein
- Signaling Lymphocytic Activation Molecule Family Member 1
- Tumor Cells, Cultured
- Tyrosine/genetics
- src Homology Domains/immunology
Collapse
Affiliation(s)
- L M Shlapatska
- Kavetsky Institute of Experimental Pathology, Oncology and Radiobiology National Academy of Sciences of the Ukraine, Kiev, Ukraine
| | | | | | | | | | | | | | | |
Collapse
|
19
|
Moser MJ, Holley WR, Chatterjee A, Mian IS. The proofreading domain of Escherichia coli DNA polymerase I and other DNA and/or RNA exonuclease domains. Nucleic Acids Res 1997; 25:5110-8. [PMID: 9396823 PMCID: PMC147149 DOI: 10.1093/nar/25.24.5110] [Citation(s) in RCA: 193] [Impact Index Per Article: 7.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023] Open
Abstract
Prior sequence analysis studies have suggested that bacterial ribonuclease (RNase) Ds comprise a complete domain that is found also in Homo sapiens polymyositis-scleroderma overlap syndrome 100 kDa autoantigen and Werner syndrome protein. This RNase D 3'-->5' exoribonuclease domain was predicted to have a structure and mechanism of action similar to the 3'-->5' exodeoxyibonuclease (proofreading) domain of DNA polymerases. Here, hidden Markov model (HMM) and phylogenetic studies have been used to identify and characterise other sequences that may possess this exonuclease domain. Results indicate that it is also present in the RNase T family; Borrelia burgdorferi P93 protein, an immunodominant antigen in Lyme disease; bacteriophage T4 dexA and Escherichia coli exonuclease I, processive 3'-->5' exodeoxyribonucleases that degrade single-stranded DNA; Bacillus subtilis dinG, a probable helicase involved in DNA repair and possibly replication, and peptide synthase 1; Saccharomyces cerevisiae Pab1p-dependent poly(A) nuclease PAN2 subunit, required for shortening mRNA poly(A) tails; Caenorhabditis elegans and Mus musculus CAF1, transcription factor CCR4-associated factor 1; Xenopus laevis XPMC2, prevention of mitotic catastrophe in fission yeast; Drosophila melanogaster egalitarian, oocyte specification and axis determination, and exuperantia, establishment of oocyte polarity; H.sapiens HEM45, expressed in tumour cell lines and uterus and regulated by oestrogen; and 31 open reading frames including one in Methanococcus jannaschii . Examination of a multiple sequence alignment and two three-dimensional structures of proofreading domains has allowed definition of the core sequence, structural and functional elements of this exonuclease domain.
Collapse
Affiliation(s)
- M J Moser
- Life Sciences Division (Mail Stop 29-100), Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA
| | | | | | | |
Collapse
|
20
|
Dalgaard JZ, Klar AJ, Moser MJ, Holley WR, Chatterjee A, Mian IS. Statistical modeling and analysis of the LAGLIDADG family of site-specific endonucleases and identification of an intein that encodes a site-specific endonuclease of the HNH family. Nucleic Acids Res 1997; 25:4626-38. [PMID: 9358175 PMCID: PMC147097 DOI: 10.1093/nar/25.22.4626] [Citation(s) in RCA: 156] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023] Open
Abstract
The LAGLIDADG and HNH families of site-specific DNA endonucleases encoded by viruses, bacteriophages as well as archaeal, eucaryotic nuclear and organellar genomes are characterized by the sequence motifs 'LAGLIDADG' and 'HNH', respectively. These endonucleases have been shown to occur in different environments: LAGLIDADG endonucleases are found in inteins, archaeal and group I introns and as free standing open reading frames (ORFs); HNH endonucleases occur in group I and group II introns and as ORFs. Here, statistical models (hidden Markov models, HMMs) that encompass both the conserved motifs and more variable regions of these families have been created and employed to characterize known and potential new family members. A number of new, putative LAGLIDADG and HNH endonucleases have been identified including an intein-encoded HNH sequence. Analysis of an HMM-generated multiple alignment of 130 LAGLIDADG family members and the three-dimensional structure of the I- Cre I endonuclease has enabled definition of the core elements of the repeated domain (approximately 90 residues) that is present in this family of proteins. A conserved negatively charged residue is proposed to be involved in catalysis. Phylogenetic analysis of the two families indicates a lack of exchange of endonucleases between different mobile elements (environments) and between hosts from different phylogenetic kingdoms. However, there does appear to have been considerable exchange of endonuclease domains amongst elements of the same type. Such events are suggested to be important for the formation of elements of new specficity.
Collapse
Affiliation(s)
- J Z Dalgaard
- NCI-Frederick Cancer Research and Development Center, ABL-Basic Research Program, PO Box B, Building 549, Room 154, Frederick, MD 21702-1202, USA.
| | | | | | | | | | | |
Collapse
|
21
|
Abstract
Escherichia coli ribonucleases (RNases) HII, III, II, PH and D have been used to characterise new and known viral, bacterial, archaeal and eucaryotic sequences similar to these endo- (HII and III) and exoribonucleases (II, PH and D). Statistical models, hidden Markov models (HMMs), were created for the RNase HII, III, II and PH and D families as well as a double-stranded RNA binding domain present in RNase III. Results suggest that the RNase D family, which includes Werner syndrome protein and the 100 kDa antigenic component of the human polymyositis scleroderma (PMSCL) autoantigen, is a 3'-->5' exoribonuclease structurally and functionally related to the 3'-->5' exodeoxyribonuclease domain of DNA polymerases. Polynucleotide phosphorylases and the RNase PH family, which includes the 75 kDa PMSCL autoantigen, possess a common domain suggesting similar structures and mechanisms of action for these 3'-->5' phosphorolytic enzymes. Examination of HMM-generated multiple sequences alignments for each family suggest amino acids that may be important for their structure, substrate binding and/or catalysis.
Collapse
Affiliation(s)
- I S Mian
- Sinsheimer Laboratories, University of California Santa Cruz, Santa Cruz, CA 95064, USA.
| |
Collapse
|