51
|
Hossain KSMT, Patnaik D, Laxman S, Jain P, Bailey-Kellogg C, Ramakrishnan N. Improved multiple sequence alignments using coupled pattern mining. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2013; 10:1098-1112. [PMID: 24384701 DOI: 10.1109/tcbb.2013.36] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
We present alignment refinement by mining coupled residues (ARMiCoRe), a novel approach to a classical bioinformatics problem, viz., multiple sequence alignment (MSA) of gene and protein sequences. Aligning multiple biological sequences is a key step in elucidating evolutionary relationships, annotating newly sequenced segments, and understanding the relationship between biological sequences and functions. Classical MSA algorithms are designed to primarily capture conservations in sequences whereas couplings, or correlated mutations, are well known as an additional important aspect of sequence evolution. (Two sequence positions are coupled when mutations in one are accompanied by compensatory mutations in another). As a result, better exposition of couplings is sometimes one of the reasons for hand-tweaking of MSAs by practitioners. ARMiCoRe introduces a distinctly pattern mining approach to improving MSAs: using frequent episode mining as a foundational basis, we define the notion of a coupled pattern and demonstrate how the discovery and tiling of coupled patterns using a max-flow approach can yield MSAs that are better than conservation-based alignments. Although we were motivated to improve MSAs for the sake of better exposing couplings, we demonstrate that our MSAs are also improvements in terms of traditional metrics of assessment. We demonstrate the effectiveness of ARMiCoRe on a large collection of data sets.
Collapse
|
52
|
Parker AS, Choi Y, Griswold KE, Bailey-Kellogg C. Structure-guided deimmunization of therapeutic proteins. J Comput Biol 2013; 20:152-65. [PMID: 23384000 DOI: 10.1089/cmb.2012.0251] [Citation(s) in RCA: 36] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023] Open
Abstract
Therapeutic proteins continue to yield revolutionary new treatments for a growing spectrum of human disease, but the development of these powerful drugs requires solving a unique set of challenges. For instance, it is increasingly apparent that mitigating potential anti-therapeutic immune responses, driven by molecular recognition of a therapeutic protein's peptide fragments, may be best accomplished early in the drug development process. One may eliminate immunogenic peptide fragments by mutating the cognate amino acid sequences, but deimmunizing mutations are constrained by the need for a folded, stable, and functional protein structure. These two concerns may be competing, as the mutations that are best at reducing immunogenicity often involve amino acids that are substantially different physicochemically. We develop a novel approach, called EpiSweep, that simultaneously optimizes both concerns. Our algorithm identifies sets of mutations making such Pareto optimal trade-offs between structure and immunogenicity, embodied by a molecular mechanics energy function and a T-cell epitope predictor, respectively. EpiSweep integrates structure-based protein design, sequence-based protein deimmunization, and algorithms for finding the Pareto frontier of a design space. While structure-based protein design is NP-hard, we employ integer programming techniques that are efficient in practice. Furthermore, EpiSweep only invokes the optimizer once per identified Pareto optimal design. We show that EpiSweep designs of regions of the therapeutics erythropoietin and staphylokinase are predicted to outperform previous experimental efforts. We also demonstrate EpiSweep's capacity for deimmunization of the entire proteins, case analyses involving dozens of predicted epitopes, and tens of thousands of unique side-chain interactions. Ultimately, Epi-Sweep is a powerful protein design tool that guides the protein engineer toward the most promising immunotolerant biotherapeutic candidates.
Collapse
|
53
|
He L, Vandin F, Pandurangan G, Bailey-Kellogg C. Ballast: a ball-based algorithm for structural motifs. J Comput Biol 2013; 20:137-51. [PMID: 23383999 DOI: 10.1089/cmb.2012.0246] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Structural motifs encapsulate local sequence-structure-function relationships characteristic of related proteins, enabling the prediction of functional characteristics of new proteins, providing molecular-level insights into how those functions are performed, and supporting the development of variants specifically maintaining or perturbing function in concert with other properties. Numerous computational methods have been developed to search through databases of structures for instances of specified motifs. However, it remains an open problem how best to leverage the local geometric and chemical constraints underlying structural motifs in order to develop motif-finding algorithms that are both theoretically and practically efficient. We present a simple, general, efficient approach, called Ballast (ball-based algorithm for structural motifs), to match given structural motifs to given structures. Ballast combines the best properties of previously developed methods, exploiting the composition and local geometry of a structural motif and its possible instances in order to effectively filter candidate matches. We show that on a wide range of motif-matching problems, Ballast efficiently and effectively finds good matches, and we provide theoretical insights into why it works well. By supporting generic measures of compositional and geometric similarity, Ballast provides a powerful substrate for the development of motif-matching algorithms.
Collapse
|
54
|
Gutierrez A, Bailey-Kellogg C, Moise L, Terry F, Abdel Hady K, Leng Q, Losikoff P, Verberkmoes N, Martin W, Rothman A, De Groot A. The two-faced T cell epitope: examining the host-microbe interface with JanusMatrix (P4399). THE JOURNAL OF IMMUNOLOGY 2013. [DOI: 10.4049/jimmunol.190.supp.205.9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/05/2023]
Abstract
Abstract
To explore the intersection of commensals, pathogens, and the human genome at the T cell epitope level, we leveraged the vast genomic sequence information available in databases of microorganisms to which humans are exposed. We developed the JanusMatrix immunoinformatic tool that identifies potentially cross-reactive T cell epitopes from both HLA binding and TCR-facing sides to allow comparison across large genome sequence databases including common human pathogens (HP), the human gut microbiome (HGM), the human genome (HG), and the human plasma proteome (HPP). Initial studies reveal different levels of HPP/HG, HGM, and HP cross-reactivity (XR) for known Treg and Teff epitopes. In Hand Foot Mouth Disease (HFMD), extensive XR with HGM seems to predict immunodominance; more limited XR with enteroviruses (e.g., polio) may protect against severe HFMD. For common Teff epitopes, HPP/HG XR is more limited than HGM XR. For Treg epitopes defined in HCV disease and for Tregitopes (De Groot et al, Blood, 2008), HPP/HG XR is more extensive. Overall, greater XR with HPP/HG compared to HGM seems to distinguish known Treg and Teff epitopes. While predicting all influences on immune responses may be impossible, the vast availability of human pathogen and commensal organism sequences now allows T cell epitope comparisons in these large datasets. Startling discoveries relevant to vaccine development and T cell response phenotype understanding are emerging as we apply this powerful technology.
Collapse
|
55
|
Moise L, Gutierrez AH, Bailey-Kellogg C, Terry F, Leng Q, Abdel Hady KM, VerBerkmoes NC, Sztein MB, Losikoff PT, Martin WD, Rothman AL, De Groot AS. The two-faced T cell epitope: examining the host-microbe interface with JanusMatrix. Hum Vaccin Immunother 2013; 9:1577-86. [PMID: 23584251 PMCID: PMC3974887 DOI: 10.4161/hv.24615] [Citation(s) in RCA: 77] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023] Open
Abstract
Advances in the field of T cell immunology have contributed to the understanding that cross-reactivity is an intrinsic characteristic of the T cell receptor (TCR), and that each TCR can potentially interact with many different T cell epitopes. To better define the potential for TCR cross-reactivity between epitopes derived from the human genome, the human microbiome, and human pathogens, we developed a new immunoinformatics tool, JanusMatrix, that represents an extension of the validated T cell epitope mapping tool, EpiMatrix. Initial explorations, summarized in this synopsis, have uncovered what appear to be important differences in the TCR cross-reactivity of selected regulatory and effector T cell epitopes with other epitopes in the human genome, human microbiome, and selected human pathogens. In addition to exploring the T cell epitope relationships between human self, commensal and pathogen, JanusMatrix may also be useful to explore some aspects of heterologous immunity and to examine T cell epitope relatedness between pathogens to which humans are exposed (Dengue serotypes, or HCV and Influenza, for example). In Hand-Foot-Mouth disease (HFMD) for example, extensive enterovirus and human microbiome cross-reactivity (and limited cross-reactivity with the human genome) seemingly predicts immunodominance. In contrast, more extensive cross-reactivity with proteins contained in the human genome as compared to the human microbiome was observed for selected Treg epitopes. While it may be impossible to predict all immune response influences, the availability of sequence data from the human genome, the human microbiome, and an array of human pathogens and vaccines has made computationally–driven exploration of the effects of T cell epitope cross-reactivity now possible. This is the first description of JanusMatrix, an algorithm that assesses TCR cross-reactivity that may contribute to a means of predicting the phenotype of T cells responding to selected T cell epitopes. Whether used for explorations of T cell phenotype or for evaluating cross-conservation between related viral strains at the TCR face of viral epitopes, further JanusMatrix studies may contribute to developing safer, more effective vaccines.
Collapse
|
56
|
Ackerman ME, Crispin M, Yu X, Baruah K, Boesch AW, Harvey DJ, Dugast AS, Heizen EL, Ercan A, Choi I, Streeck H, Nigrovic PA, Bailey-Kellogg C, Scanlan C, Alter G. Natural variation in Fc glycosylation of HIV-specific antibodies impacts antiviral activity. J Clin Invest 2013; 123:2183-92. [PMID: 23563315 DOI: 10.1172/jci65708] [Citation(s) in RCA: 274] [Impact Index Per Article: 24.9] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2012] [Accepted: 02/07/2013] [Indexed: 12/24/2022] Open
Abstract
While the induction of a neutralizing antibody response against HIV remains a daunting goal, data from both natural infection and vaccine-induced immune responses suggest that it may be possible to induce antibodies with enhanced Fc effector activity and improved antiviral control via vaccination. However, the specific features of naturally induced HIV-specific antibodies that allow for the potent recruitment of antiviral activity and the means by which these functions are regulated are poorly defined. Because antibody effector functions are critically dependent on antibody Fc domain glycosylation, we aimed to define the natural glycoforms associated with robust Fc-mediated antiviral activity. We demonstrate that spontaneous control of HIV and improved antiviral activity are associated with a dramatic shift in the global antibody-glycosylation profile toward agalactosylated glycoforms. HIV-specific antibodies exhibited an even greater frequency of agalactosylated, afucosylated, and asialylated glycans. These glycoforms were associated with enhanced Fc-mediated reduction of viral replication and enhanced Fc receptor binding and were consistent with transcriptional profiling of glycosyltransferases in peripheral B cells. These data suggest that B cell programs tune antibody glycosylation actively in an antigen-specific manner, potentially contributing to antiviral control during HIV infection.
Collapse
|
57
|
Choi Y, Griswold KE, Bailey-Kellogg C. Structure-based redesign of proteins for minimal T-cell epitope content. J Comput Chem 2013; 34:879-91. [PMID: 23299435 PMCID: PMC3763725 DOI: 10.1002/jcc.23213] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2012] [Revised: 11/16/2012] [Accepted: 11/28/2012] [Indexed: 12/31/2022]
Abstract
The protein universe displays a wealth of therapeutically relevant activities, but T-cell driven immune responses to non-"self" biological agents present a major impediment to harnessing the full diversity of these molecular functions. Mutagenic T-cell epitope deletion seeks to mitigate the immune response, but can typically address only a small number of epitopes. Here, we pursue a "bottom-up" approach that redesigns an entire protein to remain native-like but contain few if any immunogenic epitopes. We do so by extending the Rosetta flexible-backbone protein design software with an epitope scoring mechanism and appropriate constraints. The method is benchmarked with a diverse panel of proteins and applied to three targets of therapeutic interest. We show that the deimmunized designs indeed have minimal predicted epitope content and are native-like in terms of various quality measures, and moreover that they display levels of native sequence recovery comparable to those of non-deimmunized designs.
Collapse
|
58
|
Paul S, Friedman AM, Bailey-Kellogg C, Craig BA. Bayesian reconstruction of P( r) directly from two-dimensional detector images via a Markov chain Monte Carlo method. J Appl Crystallogr 2013; 46:404-414. [PMID: 23596342 DOI: 10.1107/s002188981300109x] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2012] [Accepted: 01/11/2013] [Indexed: 11/10/2022] Open
Abstract
The interatomic distance distribution, P(r), is a valuable tool for evaluating the structure of a molecule in solution and represents the maximum structural information that can be derived from solution scattering data without further assumptions. Most current instrumentation for scattering experiments (typically CCD detectors) generates a finely pixelated two-dimensional image. In contin-uation of the standard practice with earlier one-dimensional detectors, these images are typically reduced to a one-dimensional profile of scattering inten-sities, I(q), by circular averaging of the two-dimensional image. Indirect Fourier transformation methods are then used to reconstruct P(r) from I(q). Substantial advantages in data analysis, however, could be achieved by directly estimating the P(r) curve from the two-dimensional images. This article describes a Bayesian framework, using a Markov chain Monte Carlo method, for estimating the parameters of the indirect transform, and thus P(r), directly from the two-dimensional images. Using simulated detector images, it is demonstrated that this method yields P(r) curves nearly identical to the reference P(r). Furthermore, an approach for evaluating spatially correlated errors (such as those that arise from a detector point spread function) is evaluated. Accounting for these errors further improves the precision of the P(r) estimation. Experimental scattering data, where no ground truth reference P(r) is available, are used to demonstrate that this method yields a scattering and detector model that more closely reflects the two-dimensional data, as judged by smaller residuals in cross-validation, than P(r) obtained by indirect transformation of a one-dimensional profile. Finally, the method allows concurrent estimation of the beam center and Dmax, the longest interatomic distance in P(r), as part of the Bayesian Markov chain Monte Carlo method, reducing experimental effort and providing a well defined protocol for these parameters while also allowing estimation of the covariance among all parameters. This method provides parameter estimates of greater precision from the experimental data. The observed improvement in precision for the traditionally problematic Dmax is particularly noticeable.
Collapse
|
59
|
Brown EP, Licht AF, Dugast AS, Choi I, Bailey-Kellogg C, Alter G, Ackerman ME. High-throughput, multiplexed IgG subclassing of antigen-specific antibodies from clinical samples. J Immunol Methods 2012; 386:117-23. [PMID: 23023091 DOI: 10.1016/j.jim.2012.09.007] [Citation(s) in RCA: 170] [Impact Index Per Article: 14.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2012] [Revised: 09/18/2012] [Accepted: 09/19/2012] [Indexed: 11/27/2022]
Abstract
In vivo, the activity of antibodies relies critically on properties of both the variable domain, responsible for antigen recognition, and the constant domain, responsible for innate immune recognition. Here, we describe a flexible, microsphere-based array format for capturing information about both functional ends of disease-specific antibodies from complex, polyclonal clinical serum samples. Using minimal serum, we demonstrate IgG subclass profiling of multiple antibody specificities. We further capture and determine the subclass of epitope-specific antibodies. The data generated in this array provides a profile of the humoral immune response with multi-dimensional metrics regarding properties of both variable and constant IgG domains. Significantly, these properties are assessed simultaneously, and therefore information about the relationship between variable and constant domain characteristics is captured, and can be used to predict functions such as antibody effector activity.
Collapse
|
60
|
Osipovitch DC, Parker AS, Makokha CD, Desrosiers J, Kett WC, Moise L, Bailey-Kellogg C, Griswold KE. Design and analysis of immune-evading enzymes for ADEPT therapy. Protein Eng Des Sel 2012; 25:613-23. [PMID: 22898588 DOI: 10.1093/protein/gzs044] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open
Abstract
The unparalleled specificity and activity of therapeutic proteins has reshaped many aspects of modern clinical practice, and aggressive development of new protein drugs promises a continued revolution in disease therapy. As a result of their biological origins, however, therapeutic proteins present unique design challenges for the biomolecular engineer. For example, protein drugs are subject to immune surveillance within the patient's body; this anti-drug immune response can compromise therapeutic efficacy and even threaten patient safety. Thus, there is a growing demand for broadly applicable protein deimmunization strategies. We have recently developed optimization algorithms that integrate computational prediction of T-cell epitopes and bioinformatics-based assessment of the structural and functional consequences of epitope-deleting mutations. Here, we describe the first experimental validation of our deimmunization algorithms using Enterobacter cloacae P99 β-lactamase, a component of antibody-directed enzyme prodrug cancer therapies. Compared with wild-type or a previously deimmunized variant, our computationally optimized sequences exhibited significantly less in vitro binding to human type II major histocompatibility complex immune molecules. At the same time, our globally optimal design exhibited wild-type catalytic proficiency. We conclude that our deimmunization algorithms guide the protein engineer towards promising immunoevasive candidates and thereby have the potential to streamline biotherapeutic development.
Collapse
|
61
|
Wang T, Kettenbach AN, Gerber SA, Bailey-Kellogg C. Response to 'Comments on "MMFPh: A Maximal Motif Finder for Phosphoproteomics Datasets"'. Bioinformatics 2012. [DOI: 10.1093/bioinformatics/bts347] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
|
62
|
Wang T, Kettenbach AN, Gerber SA, Bailey-Kellogg C. MMFPh: a maximal motif finder for phosphoproteomics datasets. Bioinformatics 2012; 28:1562-70. [PMID: 22531218 PMCID: PMC3371830 DOI: 10.1093/bioinformatics/bts195] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023] Open
Abstract
MOTIVATION Protein phosphorylation, driven by specific recognition of substrates by kinases and phosphatases, plays central roles in a variety of important cellular processes such as signaling and enzyme activation. Mass spectrometry enables the determination of phosphorylated peptides (and thereby proteins) in scenarios ranging from targeted in vitro studies to in vivo cell lysates under particular conditions. The characterization of commonalities among identified phosphopeptides provides insights into the specificities of the kinases involved in a study. Several algorithms have been developed to uncover linear motifs representing position-specific amino acid patterns in sets of phosphopeptides. To more fully capture the available information, reduce sensitivity to both parameter choices and natural experimental variation, and develop more precise characterizations of kinase specificities, it is necessary to determine all statistically significant motifs represented in a dataset. RESULTS We have developed MMFPh (Maximal Motif Finder for Phosphoproteomics datasets), which extends the approach of the popular phosphorylation motif software Motif-X (Schwartz and Gygi, 2005) to identify all statistically significant motifs and return the maximal ones (those not subsumed by motifs with more fixed amino acids). In tests with both synthetic and experimental data, we show that MMFPh finds important motifs missed by the greedy approach of Motif-X, while also finding more motifs that are more characteristic of the dataset relative to the background proteome. Thus MMFPh is in some sense both more sensitive and more specific in characterizing the involved kinases. We also show that MMFPh compares favorably to other recent methods for finding phosphorylation motifs. Furthermore, MMFPh is less dependent on parameter choices. We support this powerful new approach with a web interface so that it may become a useful tool for studies of kinase specificity and phosphorylation site prediction. AVAILABILITY A web server is at www.cs.dartmouth.edu/~cbk/.
Collapse
|
63
|
Kettenbach AN, Wang T, Faherty BK, Madden DR, Knapp S, Bailey-Kellogg C, Gerber SA. Rapid determination of multiple linear kinase substrate motifs by mass spectrometry. CHEMISTRY & BIOLOGY 2012; 19:608-18. [PMID: 22633412 PMCID: PMC3366114 DOI: 10.1016/j.chembiol.2012.04.011] [Citation(s) in RCA: 58] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/10/2011] [Revised: 04/06/2012] [Accepted: 04/10/2012] [Indexed: 01/02/2023]
Abstract
Kinase-substrate recognition depends on the chemical properties of the phosphorylatable residue as well as the surrounding linear sequence motif. Detailed knowledge of these characteristics increases the confidence of linking identified phosphorylation sites to kinases, predicting phosphorylation sites, and designing optimal peptide substrates. Here, we present a mass spectrometry-based approach for determining linear kinase substrate motifs by elaborating the positional and chemical preference of the kinase for a phosphorylatable residue using libraries of naturally-occurring peptides that are amenable to peptide identification by commonly used proteomics platforms. We applied this approach to a structurally and functionally diverse set of purified kinases, which recapitulated their previously described substrate motifs and discovered additional ones, including preferences of certain kinases for phosphorylatable residues adjacent to peptide termini. Furthermore, we identify specific and distinguishable motif elements for the four members of the polo-like kinase (Plk) family and verify members of these motif elements for Plk1 in vivo.
Collapse
|
64
|
Abstract
BACKGROUND DNA shuffling generates combinatorial libraries of chimeric genes by stochastically recombining parent genes. The resulting libraries are subjected to large-scale genetic selection or screening to identify those chimeras with favorable properties (e.g., enhanced stability or enzymatic activity). While DNA shuffling has been applied quite successfully, it is limited by its homology-dependent, stochastic nature. Consequently, it is used only with parents of sufficient overall sequence identity, and provides no control over the resulting chimeric library. RESULTS This paper presents efficient methods to extend the scope of DNA shuffling to handle significantly more diverse parents and to generate more predictable, optimized libraries. Our CODNS (cross-over optimization for DNA shuffling) approach employs polynomial-time dynamic programming algorithms to select codons for the parental amino acids, allowing for zero or a fixed number of conservative substitutions. We first present efficient algorithms to optimize the local sequence identity or the nearest-neighbor approximation of the change in free energy upon annealing, objectives that were previously optimized by computationally-expensive integer programming methods. We then present efficient algorithms for more powerful objectives that seek to localize and enhance the frequency of recombination by producing "runs" of common nucleotides either overall or according to the sequence diversity of the resulting chimeras. We demonstrate the effectiveness of CODNS in choosing codons and allocating substitutions to promote recombination between parents targeted in earlier studies: two GAR transformylases (41% amino acid sequence identity), two very distantly related DNA polymerases, Pol X and β (15%), and beta-lactamases of varying identity (26-47%). CONCLUSIONS Our methods provide the protein engineer with a new approach to DNA shuffling that supports substantially more diverse parents, is more deterministic, and generates more predictable and more diverse chimeric libraries.
Collapse
|
65
|
He L, Friedman AM, Bailey-Kellogg C. A divide-and-conquer approach to determine the Pareto frontier for optimization of protein engineering experiments. Proteins 2012; 80:790-806. [PMID: 22180081 PMCID: PMC4939273 DOI: 10.1002/prot.23237] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2011] [Revised: 10/06/2011] [Accepted: 10/21/2011] [Indexed: 01/07/2023]
Abstract
In developing improved protein variants by site-directed mutagenesis or recombination, there are often competing objectives that must be considered in designing an experiment (selecting mutations or breakpoints): stability versus novelty, affinity versus specificity, activity versus immunogenicity, and so forth. Pareto optimal experimental designs make the best trade-offs between competing objectives. Such designs are not "dominated"; that is, no other design is better than a Pareto optimal design for one objective without being worse for another objective. Our goal is to produce all the Pareto optimal designs (the Pareto frontier), to characterize the trade-offs and suggest designs most worth considering, but to avoid explicitly considering the large number of dominated designs. To do so, we develop a divide-and-conquer algorithm, Protein Engineering Pareto FRontier (PEPFR), that hierarchically subdivides the objective space, using appropriate dynamic programming or integer programming methods to optimize designs in different regions. This divide-and-conquer approach is efficient in that the number of divisions (and thus calls to the optimizer) is directly proportional to the number of Pareto optimal designs. We demonstrate PEPFR with three protein engineering case studies: site-directed recombination for stability and diversity via dynamic programming, site-directed mutagenesis of interacting proteins for affinity and specificity via integer programming, and site-directed mutagenesis of a therapeutic protein for activity and immunogenicity via integer programming. We show that PEPFR is able to effectively produce all the Pareto optimal designs, discovering many more designs than previous methods. The characterization of the Pareto frontier provides additional insights into the local stability of design choices as well as global trends leading to trade-offs between competing criteria.
Collapse
|
66
|
Xiong F, Friedman AM, Bailey-Kellogg C. Planning combinatorial disulfide cross-links for protein fold determination. BMC Bioinformatics 2011; 12 Suppl 12:S5. [PMID: 22168447 PMCID: PMC3247086 DOI: 10.1186/1471-2105-12-s12-s5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Fold recognition techniques take advantage of the limited number of overall structural organizations, and have become increasingly effective at identifying the fold of a given target sequence. However, in the absence of sufficient sequence identity, it remains difficult for fold recognition methods to always select the correct model. While a native-like model is often among a pool of highly ranked models, it is not necessarily the highest-ranked one, and the model rankings depend sensitively on the scoring function used. Structure elucidation methods can then be employed to decide among the models based on relatively rapid biochemical/biophysical experiments. RESULTS This paper presents an integrated computational-experimental method to determine the fold of a target protein by probing it with a set of planned disulfide cross-links. We start with predicted structural models obtained by standard fold recognition techniques. In a first stage, we characterize the fold-level differences between the models in terms of topological (contact) patterns of secondary structure elements (SSEs), and select a small set of SSE pairs that differentiate the folds. In a second stage, we determine a set of residue-level cross-links to probe the selected SSE pairs. Each stage employs an information-theoretic planning algorithm to maximize information gain while minimizing experimental complexity, along with a Bayes error plan assessment framework to characterize the probability of making a correct decision once data for the plan are collected. By focusing on overall topological differences and planning cross-linking experiments to probe them, our fold determination approach is robust to noise and uncertainty in the models (e.g., threading misalignment) and in the actual structure (e.g., flexibility). We demonstrate the effectiveness of our approach in case studies for a number of CASP targets, showing that the optimized plans have low risk of error while testing only a small portion of the quadratic number of possible cross-link candidates. Simulation studies with these plans further show that they do a very good job of selecting the correct model, according to cross-links simulated from the actual crystal structures. CONCLUSIONS Fold determination can overcome scoring limitations in purely computational fold recognition methods, while requiring less experimental effort than traditional protein structure determination approaches.
Collapse
|
67
|
Martin JW, Yan AK, Bailey-Kellogg C, Zhou P, Donald BR. A geometric arrangement algorithm for structure determination of symmetric protein homo-oligomers from NOEs and RDCs. J Comput Biol 2011; 18:1507-23. [PMID: 22035328 DOI: 10.1089/cmb.2011.0173] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/15/2022] Open
Abstract
Nuclear magnetic resonance (NMR) spectroscopy is a primary tool to perform structural studies of proteins in physiologically-relevant solution conditions. Restraints on distances between pairs of nuclei in the protein, derived from the nuclear Overhauser effect (NOE), provide information about the structure of the protein in its folded state. NMR studies of symmetric protein homo-oligomers present a unique challenge. Using X-filtered NOESY experiments, it is possible to determine whether an NOE restrains a pair of protons across different subunits or within a single subunit, but current experimental techniques are unable to determine in which subunits the restrained protons lie. Consequently, it is difficult to assign NOEs to particular pairs of subunits with certainty, thus hindering the structural analysis of the oligomeric state. Computational approaches are needed to address this subunit ambiguity, but traditional solutions often rely on stochastic search coupled with simulated annealing and simulations of simplified molecular dynamics, which have many tunable parameters that must be chosen carefully and can also fail to report structures consistent with the experimental restraints. In addition, these traditional approaches rarely provide guarantees on running time or solution quality. We reduce the structure determination of homo-oligomers with cyclic symmetry to computing geometric arrangements of unions of annuli in a plane. Our algorithm, disco, runs in expected O(n²) time, where n is the number of distance restraints, potentially assigned ambiguously. disco is guaranteed to report the exact set of oligomer structures consistent with the distance restraints and also with orientational restraints from residual dipolar couplings (RDCs). We demonstrate our method using two symmetric protein complexes: the trimeric E. coli diacylglycerol kinase (DAGK) and a dimeric mutant of the immunoglobulin-binding domain B1 of streptococcal protein G (GB1). In both cases, disco computes oligomer structures with high precision and also finds distance restraints that are either mutually inconsistent or inconsistent with the RDCs. The entire protocol DISCO has been completely automated in a software package that is freely available and open-source at www.cs.duke.edu/donaldlab/software.php.
Collapse
|
68
|
Parker AS, Griswold KE, Bailey-Kellogg C. Optimization of combinatorial mutagenesis. J Comput Biol 2011; 18:1743-56. [PMID: 21923411 DOI: 10.1089/cmb.2011.0152] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Protein engineering by combinatorial site-directed mutagenesis evaluates a portion of the sequence space near a target protein, seeking variants with improved properties (e.g., stability, activity, immunogenicity). In order to improve the hit-rate of beneficial variants in such mutagenesis libraries, we develop methods to select optimal positions and corresponding sets of the mutations that will be used, in all combinations, in constructing a library for experimental evaluation. Our approach, OCoM (Optimization of Combinatorial Mutagenesis), encompasses both degenerate oligonucleotides and specified point mutations, and can be directed accordingly by requirements of experimental cost and library size. It evaluates the quality of the resulting library by one- and two-body sequence potentials, averaged over the variants. To ensure that it is not simply recapitulating extant sequences, it balances the quality of a library with an explicit evaluation of the novelty of its members. We show that, despite dealing with a combinatorial set of variants, in our approach the resulting library optimization problem is actually isomorphic to single-variant optimization. By the same token, this means that the two-body sequence potential results in an NP-hard optimization problem. We present an efficient dynamic programming algorithm for the one-body case and a practically-efficient integer programming approach for the general two-body case. We demonstrate the effectiveness of our approach in designing libraries for three different case study proteins targeted by previous combinatorial libraries--a green fluorescent protein, a cytochrome P450, and a beta lactamase. We found that OCoM worked quite efficiently in practice, requiring only 1 hour even for the massive design problem of selecting 18 mutations to generate 10⁷ variants of a 443-residue P450. We demonstrate the general ability of OCoM in enabling the protein engineer to explore and evaluate trade-offs between quality and novelty as well as library construction technique, and identify optimal libraries for experimental evaluation.
Collapse
|
69
|
Parker AS, Griswold KE, Bailey-Kellogg C. Optimization of therapeutic proteins to delete T-cell epitopes while maintaining beneficial residue interactions. J Bioinform Comput Biol 2011; 9:207-29. [PMID: 21523929 DOI: 10.1142/s0219720011005471] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2011] [Revised: 02/28/2011] [Accepted: 03/01/2011] [Indexed: 11/18/2022]
Abstract
Exogenous enzymes, signaling peptides, and other classes of nonhuman proteins represent a potentially massive but largely untapped pool of biotherapeutic agents. Adapting a foreign protein for therapeutic use poses numerous design challenges. We focus here on one significant problem: modifying the protein to mitigate the immune response mounted against "non-self" proteins, while not adversely affecting the protein's stability or therapeutic activity. In order to propose such variants suitable for experimental evaluation, this paper develops a computational method to select sets of mutations predicted to delete immunogenic T-cell epitopes, as evaluated by a 9-mer potential, while simultaneously maintaining important residues and residue interactions, as evaluated by one- and two-body potentials. While this design problem is NP-hard, we develop an integer programming approach that works very well in practice. We demonstrate the effectiveness of our approach by developing plans for biotherapeutic proteins that, in previous studies, have been partially deimmunized via extensive experimental characterization and modification of limited segments. In contrast, our global optimization technique considers an entire protein and accounts for all residues, residue interactions, and epitopes in proposing candidates worth subjecting to experimental evaluation.
Collapse
|
70
|
Chandola H, Yan AK, Potluri S, Donald BR, Bailey-Kellogg C. NMR structural inference of symmetric homo-oligomers. J Comput Biol 2011; 18:1757-75. [PMID: 21718128 DOI: 10.1089/cmb.2010.0327] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Symmetric homo-oligomers represent a majority of proteins, and determining their structures helps elucidate important biological processes, including ion transport, signal transduction, and transcriptional regulation. In order to account for the noise and sparsity in the distance restraints used in Nuclear Magnetic Resonance (NMR) structure determination of cyclic (C(n)) symmetric homo-oligomers, and the resulting uncertainty in the determined structures, we develop a Bayesian structural inference approach. In contrast to traditional NMR structure determination methods, which identify a small set of low-energy conformations, the inferential approach characterizes the entire posterior distribution of conformations. Unfortunately, traditional stochastic techniques for inference may under-sample the rugged landscape of the posterior, missing important contributions from high-quality individual conformations and not accounting for the possible aggregate effects on inferred quantities from numerous unsampled conformations. However, by exploiting the geometry of symmetric homo-oligomers, we develop an algorithm that provides provable guarantees for the posterior distribution and the inferred mean atomic coordinates. Using experimental restraints for three proteins, we demonstrate that our approach is able to objectively characterize the structural diversity supported by the data. By simulating spurious and missing restraints, we further demonstrate that our approach is robust, degrading smoothly with noise and sparsity.
Collapse
|
71
|
Martin JW, Yan AK, Bailey-Kellogg C, Zhou P, Donald BR. A graphical method for analyzing distance restraints using residual dipolar couplings for structure determination of symmetric protein homo-oligomers. Protein Sci 2011; 20:970-85. [PMID: 21413097 DOI: 10.1002/pro.620] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2010] [Revised: 02/22/2011] [Accepted: 02/23/2011] [Indexed: 11/09/2022]
Abstract
High-resolution structure determination of homo-oligomeric protein complexes remains a daunting task for NMR spectroscopists. Although isotope-filtered experiments allow separation of intermolecular NOEs from intramolecular NOEs and determination of the structure of each subunit within the oligomeric state, degenerate chemical shifts of equivalent nuclei from different subunits make it difficult to assign intermolecular NOEs to nuclei from specific pairs of subunits with certainty, hindering structural analysis of the oligomeric state. Here, we introduce a graphical method, DISCO, for the analysis of intermolecular distance restraints and structure determination of symmetric homo-oligomers using residual dipolar couplings. Based on knowledge that the symmetry axis of an oligomeric complex must be parallel to an eigenvector of the alignment tensor of residual dipolar couplings, we can represent distance restraints as annuli in a plane encoding the parameters of the symmetry axis. Oligomeric protein structures with the best restraint satisfaction correspond to regions of this plane with the greatest number of overlapping annuli. This graphical analysis yields a technique to characterize the complete set of oligomeric structures satisfying the distance restraints and to quantitatively evaluate the contribution of each distance restraint. We demonstrate our method for the trimeric E. coli diacylglycerol kinase, addressing the challenges in obtaining subunit assignments for distance restraints. We also demonstrate our method on a dimeric mutant of the immunoglobulin-binding domain B1 of streptococcal protein G to show the resilience of our method to ambiguous atom assignments. In both studies, DISCO computed oligomer structures with high accuracy despite using ambiguously assigned distance restraints.
Collapse
|
72
|
Zheng W, Griswold KE, Bailey-Kellogg C. Protein fragment swapping: a method for asymmetric, selective site-directed recombination. J Comput Biol 2010; 17:459-75. [PMID: 20377457 DOI: 10.1089/cmb.2009.0189] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
This article presents a new approach to site-directed recombination, swapping combinations of selected discontiguous fragments from a source protein in place of corresponding fragments of a target protein. By being both asymmetric (differentiating source and target) and selective (swapping discontiguous fragments), our method focuses experimental effort on a more restricted portion of sequence space, constructing hybrids that are more likely to have the properties that are the objective of the experiment. Furthermore, since the source and target need to be structurally homologous only locally (rather than overall), our method supports swapping fragments from functionally important regions of a source into a target "scaffold" (for example, to humanize an exogenous therapeutic protein). A protein fragment swapping plan is defined by the residue position boundaries of the fragments to be swapped; it is assessed by an average potential score over the resulting hybrid library, with singleton and pairwise terms evaluating the importance and fit of the swapped residues. While we prove that it is NP-hard to choose an optimal set of fragments under such a potential score, we develop an integer programming approach, which we call Swagmer, that works very well in practice. We demonstrate the effectiveness of our method in three swapping problems: selective recombination between beta-lactamases, activity swapping between glutathione transferases, and activity swapping between carboxylases and mutases in the purE family. We show that the selective recombination approach generates better plan (in terms of resulting potential score) than traditional site-directed recombination approaches. We also show that in all cases the optimized experiments are significantly better than ones that would result from stochastic methods.
Collapse
|
73
|
Kavathekar PA, Craig BA, Friedman AM, Bailey-Kellogg C, Balkcom DJ. Characterizing the space of interatomic distance distribution functions consistent with solution scattering data. J Bioinform Comput Biol 2010; 8:315-35. [PMID: 20401948 DOI: 10.1142/s0219720010004781] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2009] [Revised: 11/01/2009] [Accepted: 11/01/2009] [Indexed: 11/18/2022]
Abstract
Scattering of neutrons and X-rays from molecules in solution offers alternative approaches to the study of a wide range of macromolecular structures in their solution state without crystallization. We study one part of the problem of elucidating three-dimensional structure from solution scattering data, determining the distribution of interatomic distances, P(r), where r is the distance between two atoms in the protein molecule. This problem is known to be ill-conditioned: for a single observed diffraction pattern, there may be many consistent distance distribution functions, and there is a risk of overfitting the observed scattering data. We propose a new approach to avoiding this problem: accepting the validity of multiple alternative P(r) curves rather than seeking a single "best." We place linear constraints to ensure that a computed P(r) is consistent with the experimental data. The constraints enforce smoothness in the P(r) curve, ensure that the P(r) curve is a probability distribution, and allow for experimental error. We use these constraints to precisely describe the space of all consistent P(r) curves as a polytope of histogram values or Fourier coefficients. We develop a linear programming approach to sampling the space of consistent, realistic P(r) curves. On both experimental and simulated scattering data, our approach efficiently generates ensembles of such curves that display substantial diversity.
Collapse
|
74
|
Parker AS, Zheng W, Griswold KE, Bailey-Kellogg C. Optimization algorithms for functional deimmunization of therapeutic proteins. BMC Bioinformatics 2010; 11:180. [PMID: 20380721 PMCID: PMC2873530 DOI: 10.1186/1471-2105-11-180] [Citation(s) in RCA: 46] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2009] [Accepted: 04/09/2010] [Indexed: 11/10/2022] Open
Abstract
Background To develop protein therapeutics from exogenous sources, it is necessary to mitigate the risks of eliciting an anti-biotherapeutic immune response. A key aspect of the response is the recognition and surface display by antigen-presenting cells of epitopes, short peptide fragments derived from the foreign protein. Thus, developing minimal-epitope variants represents a powerful approach to deimmunizing protein therapeutics. Critically, mutations selected to reduce immunogenicity must not interfere with the protein's therapeutic activity. Results This paper develops methods to improve the likelihood of simultaneously reducing the anti-biotherapeutic immune response while maintaining therapeutic activity. A dynamic programming approach identifies optimal and near-optimal sets of conservative point mutations to minimize the occurrence of predicted T-cell epitopes in a target protein. In contrast with existing methods, those described here integrate analysis of immunogenicity and stability/activity, are broadly applicable to any protein class, guarantee global optimality, and provide sufficient flexibility for users to limit the total number of mutations and target MHC alleles of interest. The input is simply the primary amino acid sequence of the therapeutic candidate, although crystal structures and protein family sequence alignments may also be input when available. The output is a scored list of sets of point mutations predicted to reduce the protein's immunogenicity while maintaining structure and function. We demonstrate the effectiveness of our approach in a number of case study applications, showing that, in general, our best variants are predicted to be better than those produced by previous deimmunization efforts in terms of either immunogenicity or stability, or both factors. Conclusions By developing global optimization algorithms leveraging well-established immunogenicity and stability prediction techniques, we provide the protein engineer with a mechanism for exploring the favorable sequence space near a targeted protein therapeutic. Our mechanism not only helps identify designs more likely to be effective, but also provides insights into the interrelated implications of design choices.
Collapse
|
75
|
Zheng W, Friedman AM, Bailey-Kellogg C. Algorithms for joint optimization of stability and diversity in planning combinatorial libraries of chimeric proteins. J Comput Biol 2009; 16:1151-68. [PMID: 19645597 DOI: 10.1089/cmb.2009.0090] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
In engineering protein variants by constructing and screening combinatorial libraries of chimeric proteins, two complementary and competing goals are desired: the new proteins must be similar enough to the evolutionarily-selected wild-type proteins to be stably folded, and they must be different enough to display functional variation. We present here the first method, Staversity, to simultaneously optimize stability and diversity in selecting sets of breakpoint locations for site-directed recombination. Our goal is to uncover all "undominated" breakpoint sets, for which no other breakpoint set is better in both factors. Our first algorithm finds the undominated sets serving as the vertices of the lower envelope of the two-dimensional (stability and diversity) convex hull containing all possible breakpoint sets. Our second algorithm identifies additional breakpoint sets in the concavities that are either undominated or dominated only by undiscovered breakpoint sets within a distance bound computed by the algorithm. Both algorithms are efficient, requiring only time polynomial in the numbers of residues and breakpoints, while characterizing a space defined by an exponential number of possible breakpoint sets. We applied Staversity to identify 2-10 breakpoint plans for different sets of parent proteins taken from the purE family, as well as for parent proteins TEM-1 and PSE-4 from the beta-lactamase family. The average normalized distance between our plans and the lower bound for optimal plans is around 2%. Our plans dominate most (60-90% on average for each parent set) of the plans found by other possible approaches, random sampling or explicit optimization for stability with implicit optimization for diversity. The identified breakpoint sets provide a compact representation of good plans, enabling a protein engineer to understand and account for the trade-offs between two key considerations in combinatorial chimeragenesis.
Collapse
|