1
|
Neuwald AF, Altschul SF. Inference of Functionally-Relevant N-acetyltransferase Residues Based on Statistical Correlations. PLoS Comput Biol 2016; 12:e1005294. [PMID: 28002465 PMCID: PMC5225019 DOI: 10.1371/journal.pcbi.1005294] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2016] [Revised: 01/10/2017] [Accepted: 12/08/2016] [Indexed: 11/25/2022] Open
Abstract
Over evolutionary time, members of a superfamily of homologous proteins sharing a common structural core diverge into subgroups filling various functional niches. At the sequence level, such divergence appears as correlations that arise from residue patterns distinct to each subgroup. Such a superfamily may be viewed as a population of sequences corresponding to a complex, high-dimensional probability distribution. Here we model this distribution as hierarchical interrelated hidden Markov models (hiHMMs), which describe these sequence correlations implicitly. By characterizing such correlations one may hope to obtain information regarding functionally-relevant properties that have thus far evaded detection. To do so, we infer a hiHMM distribution from sequence data using Bayes’ theorem and Markov chain Monte Carlo (MCMC) sampling, which is widely recognized as the most effective approach for characterizing a complex, high dimensional distribution. Other routines then map correlated residue patterns to available structures with a view to hypothesis generation. When applied to N-acetyltransferases, this reveals sequence and structural features indicative of functionally important, yet generally unknown biochemical properties. Even for sets of proteins for which nothing is known beyond unannotated sequences and structures, this can lead to helpful insights. We describe, for example, a putative coenzyme-A-induced-fit substrate binding mechanism mediated by arginine residue switching between salt bridge and π-π stacking interactions. A suite of programs implementing this approach is available (psed.igs.umaryland.edu). Protein sequence data, when gathered in great quantity, contain important but implicit biological information manifest as statistical correlations. Here we describe an approach to access this information by comprehensively modeling and characterizing the distribution of sequences belonging to a major protein superfamily. This approach takes as input a large set of unaligned sequences belonging to the superfamily. By applying the minimum description length principle, it seeks the statistical model that best explains the sequences while avoiding over-fitting the data. It concurrently aligns the sequences and, to model evolutionary divergence, partitions them into subgroups that are hierarchically-arranged based upon correlated residue patterns. Auxiliary routines create PyMOL scripts to visualize the locations of correlated residues within available structures. Because these correlations likely arise from structural and biochemical constraints, they can help elucidate protein properties important for functional specificity. Comparing and contrasting sequence and structural features in this way may therefore suggest, in the light of published studies, plausible biological hypotheses for experimental investigation. We illustrate this approach with N-acetyltransferases.
Collapse
Affiliation(s)
- Andrew F. Neuwald
- Institute for Genome Sciences and Department of Biochemistry & Molecular Biology, University of Maryland School of Medicine, BioPark II, Room 617, Baltimore, MD, United States of America
- * E-mail:
| | - Stephen F. Altschul
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, United States of America
| |
Collapse
|
2
|
Neuwald AF, Altschul SF. Bayesian Top-Down Protein Sequence Alignment with Inferred Position-Specific Gap Penalties. PLoS Comput Biol 2016; 12:e1004936. [PMID: 27192614 PMCID: PMC4871425 DOI: 10.1371/journal.pcbi.1004936] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2015] [Accepted: 04/24/2016] [Indexed: 11/19/2022] Open
Abstract
We describe a Bayesian Markov chain Monte Carlo (MCMC) sampler for protein multiple sequence alignment (MSA) that, as implemented in the program GISMO and applied to large numbers of diverse sequences, is more accurate than the popular MSA programs MUSCLE, MAFFT, Clustal-Ω and Kalign. Features of GISMO central to its performance are: (i) It employs a "top-down" strategy with a favorable asymptotic time complexity that first identifies regions generally shared by all the input sequences, and then realigns closely related subgroups in tandem. (ii) It infers position-specific gap penalties that favor insertions or deletions (indels) within each sequence at alignment positions in which indels are invoked in other sequences. This favors the placement of insertions between conserved blocks, which can be understood as making up the proteins' structural core. (iii) It uses a Bayesian statistical measure of alignment quality based on the minimum description length principle and on Dirichlet mixture priors. Consequently, GISMO aligns sequence regions only when statistically justified. This is unlike methods based on the ad hoc, but widely used, sum-of-the-pairs scoring system, which will align random sequences. (iv) It defines a system for exploring alignment space that provides natural avenues for further experimentation through the development of new sampling strategies for more efficiently escaping from suboptimal traps. GISMO's superior performance is illustrated using 408 protein sets containing, on average, 235 sequences. These sets correspond to NCBI Conserved Domain Database alignments, which have been manually curated in the light of available crystal structures, and thus provide a means to assess alignment accuracy. GISMO fills a different niche than other MSA programs, namely identifying and aligning a conserved domain present within a large, diverse set of full length sequences. The GISMO program is available at http://gismo.igs.umaryland.edu/.
Collapse
Affiliation(s)
- Andrew F. Neuwald
- Institute for Genome Sciences and Department of Biochemistry & Molecular Biology, University of Maryland School of Medicine, Baltimore, Maryland, United States of America
| | - Stephen F. Altschul
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, United States of America
| |
Collapse
|
3
|
Neuwald AF. Rapid detection, classification and accurate alignment of up to a million or more related protein sequences. ACTA ACUST UNITED AC 2009; 25:1869-75. [PMID: 19505947 DOI: 10.1093/bioinformatics/btp342] [Citation(s) in RCA: 51] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
MOTIVATION The patterns of sequence similarity and divergence present within functionally diverse, evolutionarily related proteins contain implicit information about corresponding biochemical similarities and differences. A first step toward accessing such information is to statistically analyze these patterns, which, in turn, requires that one first identify and accurately align a very large set of protein sequences. Ideally, the set should include many distantly related, functionally divergent subgroups. Because it is extremely difficult, if not impossible for fully automated methods to align such sequences correctly, researchers often resort to manual curation based on detailed structural and biochemical information. However, multiply-aligning vast numbers of sequences in this way is clearly impractical. RESULTS This problem is addressed using Multiply-Aligned Profiles for Global Alignment of Protein Sequences (MAPGAPS). The MAPGAPS program uses a set of multiply-aligned profiles both as a query to detect and classify related sequences and as a template to multiply-align the sequences. It relies on Karlin-Altschul statistics for sensitivity and on PSI-BLAST (and other) heuristics for speed. Using as input a carefully curated multiple-profile alignment for P-loop GTPases, MAPGAPS correctly aligned weakly conserved sequence motifs within 33 distantly related GTPases of known structure. By comparison, the sequence- and structurally based alignment methods hmmalign and PROMALS3D misaligned at least 11 and 23 of these regions, respectively. When applied to a dataset of 65 million protein sequences, MAPGAPS identified, classified and aligned (with comparable accuracy) nearly half a million putative P-loop GTPase sequences. AVAILABILITY A C++ implementation of MAPGAPS is available at http://mapgaps.igs.umaryland.edu. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Andrew F Neuwald
- Department of Biochemistry & Molecular Biology and The Institute for Genome Sciences, University of Maryland, School of Medicine, BioPark II, Baltimore, MD 21201, USA.
| |
Collapse
|
4
|
Siddharthan R, Siggia ED, van Nimwegen E. PhyloGibbs: a Gibbs sampling motif finder that incorporates phylogeny. PLoS Comput Biol 2005; 1:e67. [PMID: 16477324 PMCID: PMC1309704 DOI: 10.1371/journal.pcbi.0010067] [Citation(s) in RCA: 176] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2005] [Accepted: 10/28/2005] [Indexed: 12/27/2022] Open
Abstract
A central problem in the bioinformatics of gene regulation is to find the binding sites for regulatory proteins. One of the most promising approaches toward identifying these short and fuzzy sequence patterns is the comparative analysis of orthologous intergenic regions of related species. This analysis is complicated by various factors. First, one needs to take the phylogenetic relationship between the species into account in order to distinguish conservation that is due to the occurrence of functional sites from spurious conservation that is due to evolutionary proximity. Second, one has to deal with the complexities of multiple alignments of orthologous intergenic regions, and one has to consider the possibility that functional sites may occur outside of conserved segments. Here we present a new motif sampling algorithm, PhyloGibbs, that runs on arbitrary collections of multiple local sequence alignments of orthologous sequences. The algorithm searches over all ways in which an arbitrary number of binding sites for an arbitrary number of transcription factors (TFs) can be assigned to the multiple sequence alignments. These binding site configurations are scored by a Bayesian probabilistic model that treats aligned sequences by a model for the evolution of binding sites and "background" intergenic DNA. This model takes the phylogenetic relationship between the species in the alignment explicitly into account. The algorithm uses simulated annealing and Monte Carlo Markov-chain sampling to rigorously assign posterior probabilities to all the binding sites that it reports. In tests on synthetic data and real data from five Saccharomyces species our algorithm performs significantly better than four other motif-finding algorithms, including algorithms that also take phylogeny into account. Our results also show that, in contrast to the other algorithms, PhyloGibbs can make realistic estimates of the reliability of its predictions. Our tests suggest that, running on the five-species multiple alignment of a single gene's upstream region, PhyloGibbs on average recovers over 50% of all binding sites in S. cerevisiae at a specificity of about 50%, and 33% of all binding sites at a specificity of about 85%. We also tested PhyloGibbs on collections of multiple alignments of intergenic regions that were recently annotated, based on ChIP-on-chip data, to contain binding sites for the same TF. We compared PhyloGibbs's results with the previous analysis of these data using six other motif-finding algorithms. For 16 of 21 TFs for which all other motif-finding methods failed to find a significant motif, PhyloGibbs did recover a motif that matches the literature consensus. In 11 cases where there was disagreement in the results we compiled lists of known target genes from the literature, and found that running PhyloGibbs on their regulatory regions yielded a binding motif matching the literature consensus in all but one of the cases. Interestingly, these literature gene lists had little overlap with the targets annotated based on the ChIP-on-chip data. The PhyloGibbs code can be downloaded from http://www.biozentrum.unibas.ch/~nimwegen/cgi-bin/phylogibbs.cgi or http://www.imsc.res.in/~rsidd/phylogibbs. The full set of predicted sites from our tests on yeast are available at http://www.swissregulon.unibas.ch.
Collapse
Affiliation(s)
- Rahul Siddharthan
- Center for Studies in Physics and Biology, The Rockefeller University, New York, New York, United States of America
- Institute of Mathematical Sciences, Taramani, Chennai, India
| | - Eric D Siggia
- Center for Studies in Physics and Biology, The Rockefeller University, New York, New York, United States of America
| | - Erik van Nimwegen
- Center for Studies in Physics and Biology, The Rockefeller University, New York, New York, United States of America
- Division of Bioinformatics, Biozentrum, University of Basel, Basel, Switzerland
- * To whom correspondence should be addressed. E-mail:
| |
Collapse
|
5
|
Mazza C. Strand separation in negatively supercoiled DNA. J Math Biol 2005; 51:198-216. [PMID: 15868197 DOI: 10.1007/s00285-005-0320-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2003] [Revised: 02/02/2005] [Indexed: 10/25/2022]
Abstract
We consider Benham's model for strand separation in negatively supercoiled circular DNA, and study denaturation as function of the linking difference density kappa<0. We propose a statistical version of this model, based on bayesian segmentation methods of current use in bioinformatics; this leads to new algorithms with priors adapted to supercoiled DNA, taking into account the random nature of the free energies needed to denature base pairs.
Collapse
Affiliation(s)
- Christian Mazza
- Section de Mathématiques, 2-4 Rue du Lièvre, CP 64 CH-1211, Genève 4, Switzerland.
| |
Collapse
|
6
|
Abstract
The goal of disease-related proteogenomic research is a complete description of the unfolding of the disease process from its origin to its cure. With a properly selected patient cohort and correctly collected, processed, analyzed data, large scale proteomic spectra may be able to provide much of the information necessary for achieving this goal. Protein spectra, which are one way of representing protein expression, can be extremely useful clinically since they can be generated from blood rather than from diseased tissue. At the same time, the analysis of circulating proteins in blood presents unique challenges because of their heterogeneity, blood contains a large number of different abundance proteins generated by tissues throughout the body. Another challenge is that protein spectra are massively parallel information. One can choose to perform top-down analysis, where the entire spectra is examined and candidate peaks are selected for further assessment. Or one can choose a bottom-up analysis, where, via hypothesis testing, individual proteins are identified in the spectra and related to the disease process. Each approach has advantages and disadvantages that must be understood if protein spectral data are to be properly analyzed. With either approach, several levels of information must be in tegrated into a predictive model. This model will allow us to detect disease and it will allow us to discover therapeutic interventions that reduce the risk of disease in at-risk individuals and effectively treat newly diagnosed disease.
Collapse
Affiliation(s)
- Harry B Burke
- Medicine, Biochemistry and Molecular Biology, McCormick Genomics Center, George Washington University School of Medicine
| |
Collapse
|
7
|
Liang S, Samanta MP, Biegel BA. cWINNOWER algorithm for finding fuzzy dna motifs. J Bioinform Comput Biol 2004; 2:47-60. [PMID: 15272432 DOI: 10.1142/s0219720004000466] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2003] [Revised: 11/24/2003] [Accepted: 12/09/2003] [Indexed: 11/18/2022]
Abstract
The cWINNOWER algorithm detects fuzzy motifs in DNA sequences rich in protein-binding signals. A signal is defined as any short nucleotide pattern having up to d mutations differing from a motif of length l. The algorithm finds such motifs if a clique consisting of a sufficiently large number of mutated copies of the motif (i.e., the signals) is present in the DNA sequence. The cWINNOWER algorithm substantially improves the sensitivity of the winnower method of Pevzner and Sze by imposing a consensus constraint, enabling it to detect much weaker signals. We studied the minimum detectable clique size qc as a function of sequence length N for random sequences. We found that qc increases linearly with N for a fast version of the algorithm based on counting three-member sub-cliques. Imposing consensus constraints reduces qc by a factor of three in this case, which makes the algorithm dramatically more sensitive. Our most sensitive algorithm, which counts four-member sub-cliques, needs a minimum of only 13 signals to detect motifs in a sequence of length N = 12,000 for (l, d) = (15, 4).
Collapse
Affiliation(s)
- S Liang
- NASA Ames Research Center, NASA Advanced Supercomputing Division, Moffett Field, CA 94035, USA.
| | | | | |
Collapse
|
8
|
Zhou Q, Wong WH. CisModule: de novo discovery of cis-regulatory modules by hierarchical mixture modeling. Proc Natl Acad Sci U S A 2004; 101:12114-9. [PMID: 15297614 PMCID: PMC514443 DOI: 10.1073/pnas.0402858101] [Citation(s) in RCA: 147] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
The regulatory information for a eukaryotic gene is encoded in cis-regulatory modules. The binding sites for a set of interacting transcription factors have the tendency to colocalize to the same modules. Current de novo motif discovery methods do not take advantage of this knowledge. We propose a hierarchical mixture approach to model the cis-regulatory module structure. Based on the model, a new de novo motif-module discovery algorithm, CisModule, is developed for the Bayesian inference of module locations and within-module motif sites. Dynamic programming-like recursions are developed to reduce the computational complexity from exponential to linear in sequence length. By using both simulated and real data sets, we demonstrate that CisModule is not only accurate in predicting modules but also more sensitive in detecting motif patterns and binding sites than standard motif discovery methods are.
Collapse
Affiliation(s)
- Qing Zhou
- Department of Statistics, Harvard University, 1 Oxford Street, Cambridge, MA 02138, USA
| | | |
Collapse
|
9
|
Fearnhead P, Meligkotsidou L. Exact filtering for partially observed continuous time models. J R Stat Soc Series B Stat Methodol 2004. [DOI: 10.1111/j.1467-9868.2004.05561.x] [Citation(s) in RCA: 37] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
10
|
Jensen ST, Liu XS, Zhou Q, Liu JS. Computational Discovery of Gene Regulatory Binding Motifs: A Bayesian Perspective. Stat Sci 2004. [DOI: 10.1214/088342304000000107] [Citation(s) in RCA: 55] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
11
|
|
12
|
Shaw E, McCue LA, Lawrence CE, Dordick JS. Identification of a novel class in the alpha/beta hydrolase fold superfamily: the N-myc differentiation-related proteins. Proteins 2002; 47:163-8. [PMID: 11933063 DOI: 10.1002/prot.10083] [Citation(s) in RCA: 73] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
The alpha/beta hydrolases constitute a large protein superfamily that mainly consists of enzymes that catalyze a diverse range of reactions. These proteins exhibit the alpha/beta hydrolase fold, the essential features of which have recently been delineated: the presence of at least five parallel beta-strands, a catalytic triad in a specific order (nucleophile-acid-histidine), and a nucleophilic elbow. Because of the difficulties experimentally in identifying protein structures, we have used a Bayesian computational algorithm (PROBE) to identify the members of this superfamily based on distant sequence relationships. We found that the presence of five sequence motifs, which contain residues important for substrate binding and stabilization of the fold, are required for membership in this superfamily. The superfamily consists of at least 909 members, including the N-myc downstream regulated proteins, which are believed to be involved in cell differentiation. Unlike most of the other superfamily members, the N-myc downstream regulated proteins have never been proposed to possess the alpha/beta hydrolase fold and do not appear to be hydrolases.
Collapse
Affiliation(s)
- Eudean Shaw
- Department of Chemical Engineering, Rensselaer Polytechnic Institute, Troy, New York, USA
| | | | | | | |
Collapse
|
13
|
Purkayastha A, McCue LA, McDonough KA. Identification of a Mycobacterium tuberculosis putative classical nitroreductase gene whose expression is coregulated with that of the acr aene within macrophages, in standing versus shaking cultures, and under low oxygen conditions. Infect Immun 2002; 70:1518-29. [PMID: 11854240 PMCID: PMC127740 DOI: 10.1128/iai.70.3.1518-1529.2002] [Citation(s) in RCA: 64] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
Tuberculosis remains a leading killer worldwide, and new approaches for its treatment and prevention are urgently needed. This effort will benefit greatly from a better understanding of gene regulation in Mycobacterium tuberculosis, particularly with respect to this pathogen's response to its host environment. We examined the behavior of two promoters from the divergently transcribed M. tuberculosis genes acr/hspX/Rv2031c (alpha-crystallin homolog) and Rv2032/acg (acr-coregulated gene) by using a promoter-GFP fusion assay in Mycobacterium bovis BCG. We found that Rv2032 is a novel macrophage-induced gene whose expression is coregulated with that of acr. Relative levels of intracellular induction for both promoters were significantly affected by shallow standing versus shaking bacterial culture conditions prior to macrophage infection, and both promoters were strongly induced under low oxygen conditions. Deletion analyses showed that DNA sequences within a 43-bp region were required for expression of these promoters under all conditions. Multiple sequence alignment and database searches performed with PROBE indicated that Rv2032 is one of eight M. tuberculosis genes of previously unknown function that belong to an unusual superfamily of classical nitroreductases, which may have a role for bacteria within the host environment. These findings show that mycobacterial culture conditions can greatly influence the results and interpretation of subsequent gene regulation experiments. We propose that these differences might be exploited for dissection of the regulatory factors that affect mycobacterial gene expression within the host.
Collapse
Affiliation(s)
- Anjan Purkayastha
- Department of Biomedical Sciences, University of Albany School of Public Health, Albany, New York 12201-2002, USA
| | | | | |
Collapse
|
14
|
|
15
|
|
16
|
Abstract
To refine the location of a disease gene within the bounds provided by linkage analysis, many scientists use the pattern of linkage disequilibrium between the disease allele and alleles at nearby markers. We describe a method that seeks to refine location by analysis of "disease" and "normal" haplotypes, thereby using multivariate information about linkage disequilibrium. Under the assumption that the disease mutation occurs in a specific gap between adjacent markers, the method first combines parsimony and likelihood to build an evolutionary tree of disease haplotypes, with each node (haplotype) separated, by a single mutational or recombinational step, from its parent. If required, latent nodes (unobserved haplotypes) are incorporated to complete the tree. Once the tree is built, its likelihood is computed from probabilities of mutation and recombination. When each gap between adjacent markers is evaluated in this fashion and these results are combined with prior information, they yield a posterior probability distribution to guide the search for the disease mutation. We show, by evolutionary simulations, that an implementation of these methods, called "FineMap," yields substantial refinement and excellent coverage for the true location of the disease mutation. Moreover, by analysis of hereditary hemochromatosis haplotypes, we show that FineMap can be robust to genetic heterogeneity.
Collapse
Affiliation(s)
- Johnny C. Lam
- Department of Statistics, Carnegie Mellon University, and Department of Psychiatry, University of Pittsburgh, Pittsburgh
| | - Kathryn Roeder
- Department of Statistics, Carnegie Mellon University, and Department of Psychiatry, University of Pittsburgh, Pittsburgh
| | - B. Devlin
- Department of Statistics, Carnegie Mellon University, and Department of Psychiatry, University of Pittsburgh, Pittsburgh
| |
Collapse
|