1
|
Guo J, Lin LF, Oraskovich SV, Rivera de Jesús JA, Listgarten J, Schaffer DV. Computationally guided AAV engineering for enhanced gene delivery. Trends Biochem Sci 2024:S0968-0004(24)00054-9. [PMID: 38531696 DOI: 10.1016/j.tibs.2024.03.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2023] [Revised: 02/22/2024] [Accepted: 03/01/2024] [Indexed: 03/28/2024]
Abstract
Gene delivery vehicles based on adeno-associated viruses (AAVs) are enabling increasing success in human clinical trials, and they offer the promise of treating a broad spectrum of both genetic and non-genetic disorders. However, delivery efficiency and targeting must be improved to enable safe and effective therapies. In recent years, considerable effort has been invested in creating AAV variants with improved delivery, and computational approaches have been increasingly harnessed for AAV engineering. In this review, we discuss how computationally designed AAV libraries are enabling directed evolution. Specifically, we highlight approaches that harness sequences outputted by next-generation sequencing (NGS) coupled with machine learning (ML) to generate new functional AAV capsids and related regulatory elements, pushing the frontier of what vector engineering and gene therapy may achieve.
Collapse
Affiliation(s)
- Jingxuan Guo
- California Institute for Quantitative Biosciences (QB3), University of California, Berkeley, CA 94720, USA
| | - Li F Lin
- Department of Chemical and Biomolecular Engineering, University of California, Berkeley, CA 94720, USA
| | - Sydney V Oraskovich
- Department of Bioengineering, University of California, Berkeley, CA 94720, USA; Graduate Program in Bioengineering, University of California, San Francisco and University of California, Berkeley, CA 94720, USA
| | - Julio A Rivera de Jesús
- Department of Bioengineering, University of California, Berkeley, CA 94720, USA; Graduate Program in Bioengineering, University of California, San Francisco and University of California, Berkeley, CA 94720, USA; Department of Neurological Surgery, University of California, San Francisco, CA 94143, USA
| | - Jennifer Listgarten
- Department of Electrical Engineering and Computer Science, University of California, Berkeley, CA 94720, USA
| | - David V Schaffer
- California Institute for Quantitative Biosciences (QB3), University of California, Berkeley, CA 94720, USA; Department of Chemical and Biomolecular Engineering, University of California, Berkeley, CA 94720, USA; Department of Bioengineering, University of California, Berkeley, CA 94720, USA; Department of Molecular and Cell Biology, University of California, Berkeley, CA 94720, USA.
| |
Collapse
|
2
|
Listgarten J. The perpetual motion machine of AI-generated data and the distraction of ChatGPT as a 'scientist'. Nat Biotechnol 2024; 42:371-373. [PMID: 38273064 DOI: 10.1038/s41587-023-02103-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/27/2024]
Affiliation(s)
- Jennifer Listgarten
- Electrical Engineering and Computer Science, University of California, Berkeley, Berkeley, CA, USA.
- Center for Computational Biology, University of California, Berkeley, Berkeley, CA, USA.
| |
Collapse
|
3
|
Affiliation(s)
- Chloe Hsu
- University of California, Berkeley, Berkeley, CA, USA.
| | | | | |
Collapse
|
4
|
Fannjiang C, Listgarten J. Is Novelty Predictable? Cold Spring Harb Perspect Biol 2024; 16:a041469. [PMID: 38052497 PMCID: PMC10835614 DOI: 10.1101/cshperspect.a041469] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/07/2023]
Abstract
Machine learning-based design has gained traction in the sciences, most notably in the design of small molecules, materials, and proteins, with societal applications ranging from drug development and plastic degradation to carbon sequestration. When designing objects to achieve novel property values with machine learning, one faces a fundamental challenge: how to push past the frontier of current knowledge, distilled from the training data into the model, in a manner that rationally controls the risk of failure. If one trusts learned models too much in extrapolation, one is likely to design rubbish. In contrast, if one does not extrapolate, one cannot find novelty. Herein, we ponder how one might strike a useful balance between these two extremes. We focus in particular on designing proteins with novel property values, although much of our discussion is relevant to machine learning-based design more broadly.
Collapse
Affiliation(s)
- Clara Fannjiang
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, California 94720, USA
| | - Jennifer Listgarten
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, California 94720, USA
| |
Collapse
|
5
|
Zhu D, Brookes DH, Busia A, Carneiro A, Fannjiang C, Popova G, Shin D, Donohue KC, Lin LF, Miller ZM, Williams ER, Chang EF, Nowakowski TJ, Listgarten J, Schaffer DV. Optimal trade-off control in machine learning-based library design, with application to adeno-associated virus (AAV) for gene therapy. Sci Adv 2024; 10:eadj3786. [PMID: 38266077 PMCID: PMC10807795 DOI: 10.1126/sciadv.adj3786] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/23/2023] [Accepted: 12/22/2023] [Indexed: 01/26/2024]
Abstract
Adeno-associated viruses (AAVs) hold tremendous promise as delivery vectors for gene therapies. AAVs have been successfully engineered-for instance, for more efficient and/or cell-specific delivery to numerous tissues-by creating large, diverse starting libraries and selecting for desired properties. However, these starting libraries often contain a high proportion of variants unable to assemble or package their genomes, a prerequisite for any gene delivery goal. Here, we present and showcase a machine learning (ML) method for designing AAV peptide insertion libraries that achieve fivefold higher packaging fitness than the standard NNK library with negligible reduction in diversity. To demonstrate our ML-designed library's utility for downstream engineering goals, we show that it yields approximately 10-fold more successful variants than the NNK library after selection for infection of human brain tissue, leading to a promising glial-specific variant. Moreover, our design approach can be applied to other types of libraries for AAV and beyond.
Collapse
Affiliation(s)
- Danqing Zhu
- California Institute for Quantitative Biosciences, University of California, Berkeley, Berkeley, CA 94720, USA
| | - David H. Brookes
- Biophysics Graduate Group, University of California, Berkeley, Berkeley, CA 94720, USA
| | - Akosua Busia
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley, CA 94720, USA
| | - Ana Carneiro
- Department of Chemical and Biomolecular Engineering, University of California, Berkeley, Berkeley, CA 94720, USA
| | | | - Galina Popova
- Department of Anatomy, University of California San Francisco, San Francisco, CA 94143, USA
- Department of Psychiatry and Behavioural Sciences, University of California San Francisco, San Francisco, CA 94143, USA
- Eli and Edythe Broad Center for Regeneration Medicine and Stem Cell Research, University of California San Francisco, San Francisco, CA 94143, USA
| | - David Shin
- Department of Anatomy, University of California San Francisco, San Francisco, CA 94143, USA
- Department of Psychiatry and Behavioural Sciences, University of California San Francisco, San Francisco, CA 94143, USA
- Eli and Edythe Broad Center for Regeneration Medicine and Stem Cell Research, University of California San Francisco, San Francisco, CA 94143, USA
| | - Kevin C. Donohue
- Department of Psychiatry and Behavioural Sciences, University of California San Francisco, San Francisco, CA 94143, USA
- School of Medicine, University of California San Francisco, San Francisco, CA, USA. 94143
- Kavli Institute of Fundamental Neuroscience, University of California San Francisco, San Francisco, CA 94143, USA
- Weill Institute for Neurosciences, University of California San Francisco, San Francisco, CA 94143, USA
| | - Li F. Lin
- Department of Chemical and Biomolecular Engineering, University of California, Berkeley, Berkeley, CA 94720, USA
| | - Zachary M. Miller
- Department of Chemistry, University of California, Berkeley, Berkeley, CA 94720, USA
| | - Evan R. Williams
- Department of Chemistry, University of California, Berkeley, Berkeley, CA 94720, USA
| | - Edward F. Chang
- Department of Neurological Surgery, University of California San Francisco, San Francisco, CA 94143, USA
| | - Tomasz J. Nowakowski
- Department of Anatomy, University of California San Francisco, San Francisco, CA 94143, USA
- Department of Psychiatry and Behavioural Sciences, University of California San Francisco, San Francisco, CA 94143, USA
- Eli and Edythe Broad Center for Regeneration Medicine and Stem Cell Research, University of California San Francisco, San Francisco, CA 94143, USA
- Weill Institute for Neurosciences, University of California San Francisco, San Francisco, CA 94143, USA
- Department of Neurological Surgery, University of California San Francisco, San Francisco, CA 94143, USA
| | - Jennifer Listgarten
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley, CA 94720, USA
- Center for Computational Biology, University of California, Berkeley, Berkeley, CA 94720, USA
| | - David V. Schaffer
- California Institute for Quantitative Biosciences, University of California, Berkeley, Berkeley, CA 94720, USA
- Department of Chemical and Biomolecular Engineering, University of California, Berkeley, Berkeley, CA 94720, USA
- Department of Bioengineering, University of California, Berkeley, Berkeley, CA 94720, USA
- Department of Molecular and Cell Biology, University of California, Berkeley, Berkeley, CA 94720, USA
- Helen Wills Neuroscience Institute, University of California, Berkeley, Berkeley, CA 94720, USA
- Innovative Genomics Institute (IGI), University of California, Berkeley, Berkeley, CA 94720, USA
| |
Collapse
|
6
|
Nisonoff H, Wang Y, Listgarten J. Coherent Blending of Biophysics-Based Knowledge with Bayesian Neural Networks for Robust Protein Property Prediction. ACS Synth Biol 2023; 12:3242-3251. [PMID: 37888887 DOI: 10.1021/acssynbio.3c00217] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2023]
Abstract
Predicting properties of proteins is of interest for basic biological understanding and protein engineering alike. Increasingly, machine learning (ML) approaches are being used for this task. However, the accuracy of such ML models typically degrades as test proteins stray further from the training data distribution. On the other hand, models that are more data-free, such as biophysics-based models, are typically uniformly accurate over all of the protein space, even if inferior for test points close to the training distribution. Consequently, being able to cohesively blend these two types of information within one model, as appropriate in different parts of the protein space, will improve overall importance. Herein, we tackle just this problem to yield a simple, practical, and scalable approach that can be easily implemented. In particular, we use a Bayesian formulation to integrate biophysical knowledge into neural networks. However, in doing so, a technical challenge arises: Bayesian neural networks (BNNs) enable the user to specify prior information only on the neural network weight parameters, rather than on the function values given to us from a typical biophysics-based model. Consequently, we devise a principled probabilistic method to overcome this challenge. Our approach yields intuitively pleasing results: predictions rely more heavily on the biophysical prior information when the BNN epistemic uncertainty─uncertainty arising from a lack of training data rather than sensor noise─is large and more heavily on the neural network when the epistemic uncertainty is small. We demonstrate this approach on an illustrative synthetic example, on two examples of protein property prediction (fluorescence and binding), and for generality on one small molecule property prediction problem.
Collapse
Affiliation(s)
- Hunter Nisonoff
- Center for Computational Biology, University of California, Berkeley, Berkeley, California 94720-3220, United States
| | - Yixin Wang
- Department of Statistics, University of Michigan, Ann Arbor, Michigan 48109-1107, United States
| | - Jennifer Listgarten
- Center for Computational Biology, University of California, Berkeley, Berkeley, California 94720-3220, United States
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley, California 94720-1776, United States
| |
Collapse
|
7
|
Busia A, Listgarten J. MBE: model-based enrichment estimation and prediction for differential sequencing data. Genome Biol 2023; 24:218. [PMID: 37784130 PMCID: PMC10544408 DOI: 10.1186/s13059-023-03058-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2023] [Accepted: 09/14/2023] [Indexed: 10/04/2023] Open
Abstract
Characterizing differences in sequences between two conditions, such as with and without drug exposure, using high-throughput sequencing data is a prevalent problem involving quantifying changes in sequence abundances, and predicting such differences for unobserved sequences. A key shortcoming of current approaches is their extremely limited ability to share information across related but non-identical reads. Consequently, they cannot use sequencing data effectively, nor be directly applied in many settings of interest. We introduce model-based enrichment (MBE) to overcome this shortcoming. We evaluate MBE using both simulated and real data. Overall, MBE improves accuracy compared to current differential analysis methods.
Collapse
Affiliation(s)
- Akosua Busia
- Department of Electrical Engineering & Computer Science, University of California, Berkeley, Berkeley, 94720, CA, USA.
| | - Jennifer Listgarten
- Department of Electrical Engineering & Computer Science, University of California, Berkeley, Berkeley, 94720, CA, USA.
| |
Collapse
|
8
|
Hsu C, Nisonoff H, Fannjiang C, Listgarten J. Learning protein fitness models from evolutionary and assay-labeled data. Nat Biotechnol 2022; 40:1114-1122. [PMID: 35039677 DOI: 10.1038/s41587-021-01146-5] [Citation(s) in RCA: 50] [Impact Index Per Article: 25.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2021] [Accepted: 11/02/2021] [Indexed: 01/27/2023]
Abstract
Machine learning-based models of protein fitness typically learn from either unlabeled, evolutionarily related sequences or variant sequences with experimentally measured labels. For regimes where only limited experimental data are available, recent work has suggested methods for combining both sources of information. Toward that goal, we propose a simple combination approach that is competitive with, and on average outperforms more sophisticated methods. Our approach uses ridge regression on site-specific amino acid features combined with one probability density feature from modeling the evolutionary data. Within this approach, we find that a variational autoencoder-based probability density model showed the best overall performance, although any evolutionary density model can be used. Moreover, our analysis highlights the importance of systematic evaluations and sufficient baselines.
Collapse
Affiliation(s)
- Chloe Hsu
- Department of Electrical Engineering and Computer Science, University of California, Berkeley, USA.
| | - Hunter Nisonoff
- Center for Computational Biology, University of California, Berkeley, USA
| | - Clara Fannjiang
- Department of Electrical Engineering and Computer Science, University of California, Berkeley, USA
| | - Jennifer Listgarten
- Department of Electrical Engineering and Computer Science, University of California, Berkeley, USA. .,Center for Computational Biology, University of California, Berkeley, USA.
| |
Collapse
|
9
|
Aghazadeh A, Nisonoff H, Ocal O, Brookes DH, Huang Y, Koyluoglu OO, Listgarten J, Ramchandran K. Epistatic Net allows the sparse spectral regularization of deep neural networks for inferring fitness functions. Nat Commun 2021; 12:5225. [PMID: 34471113 PMCID: PMC8410946 DOI: 10.1038/s41467-021-25371-3] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2020] [Accepted: 07/27/2021] [Indexed: 11/18/2022] Open
Abstract
Despite recent advances in high-throughput combinatorial mutagenesis assays, the number of labeled sequences available to predict molecular functions has remained small for the vastness of the sequence space combined with the ruggedness of many fitness functions. While deep neural networks (DNNs) can capture high-order epistatic interactions among the mutational sites, they tend to overfit to the small number of labeled sequences available for training. Here, we developed Epistatic Net (EN), a method for spectral regularization of DNNs that exploits evidence that epistatic interactions in many fitness functions are sparse. We built a scalable extension of EN, usable for larger sequences, which enables spectral regularization using fast sparse recovery algorithms informed by coding theory. Results on several biological landscapes show that EN consistently improves the prediction accuracy of DNNs and enables them to outperform competing models which assume other priors. EN estimates the higher-order epistatic interactions of DNNs trained on massive sequence spaces-a computational problem that otherwise takes years to solve.
Collapse
Affiliation(s)
- Amirali Aghazadeh
- Department of Electrical Engineering and Computer Sciences, Berkeley, CA, USA
| | | | - Orhan Ocal
- Department of Electrical Engineering and Computer Sciences, Berkeley, CA, USA
| | - David H Brookes
- Biophysics Graduate Group, University of California, Berkeley, CA, USA
| | - Yijie Huang
- Department of Electrical Engineering and Computer Sciences, Berkeley, CA, USA
| | - O Ozan Koyluoglu
- Department of Electrical Engineering and Computer Sciences, Berkeley, CA, USA
| | - Jennifer Listgarten
- Department of Electrical Engineering and Computer Sciences, Berkeley, CA, USA
- Center for Computational Biology, Berkeley, CA, USA
| | - Kannan Ramchandran
- Department of Electrical Engineering and Computer Sciences, Berkeley, CA, USA.
| |
Collapse
|
10
|
Schneider P, Walters WP, Plowright AT, Sieroka N, Listgarten J, Goodnow RA, Fisher J, Jansen JM, Duca JS, Rush TS, Zentgraf M, Hill JE, Krutoholow E, Kohler M, Blaney J, Funatsu K, Luebkemann C, Schneider G. Rethinking drug design in the artificial intelligence era. Nat Rev Drug Discov 2019. [DOI: 78495111110.1038/s41573-019-0050-3' target='_blank'>'"<>78495111110.1038/s41573-019-0050-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [78495111110.1038/s41573-019-0050-3','', 'Jennifer Listgarten')">Reference Citation Analysis] [78495111110.1038/s41573-019-0050-3', 10)">What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/29/2022]
78495111110.1038/s41573-019-0050-3" />
|
11
|
Najm FJ, Strand C, Donovan KF, Hegde M, Sanson KR, Vaimberg EW, Sullender ME, Hartenian E, Kalani Z, Fusi N, Listgarten J, Younger ST, Bernstein BE, Root DE, Doench JG. Orthologous CRISPR-Cas9 enzymes for combinatorial genetic screens. Nat Biotechnol 2018; 36:179-189. [PMID: 29251726 PMCID: PMC5800952 DOI: 10.1038/nbt.4048] [Citation(s) in RCA: 162] [Impact Index Per Article: 27.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2017] [Accepted: 11/04/2017] [Indexed: 12/21/2022]
Abstract
Combinatorial genetic screening using CRISPR-Cas9 is a useful approach to uncover redundant genes and to explore complex gene networks. However, current methods suffer from interference between the single-guide RNAs (sgRNAs) and from limited gene targeting activity. To increase the efficiency of combinatorial screening, we employ orthogonal Cas9 enzymes from Staphylococcus aureus and Streptococcus pyogenes. We used machine learning to establish S. aureus Cas9 sgRNA design rules and paired S. aureus Cas9 with S. pyogenes Cas9 to achieve dual targeting in a high fraction of cells. We also developed a lentiviral vector and cloning strategy to generate high-complexity pooled dual-knockout libraries to identify synthetic lethal and buffering gene pairs across multiple cell types, including MAPK pathway genes and apoptotic genes. Our orthologous approach also enabled a screen combining gene knockouts with transcriptional activation, which revealed genetic interactions with TP53. The "Big Papi" (paired aureus and pyogenes for interactions) approach described here will be widely applicable for the study of combinatorial phenotypes.
Collapse
Affiliation(s)
- Fadi J Najm
- Broad Institute of Harvard and MIT, Cambridge, Massachusetts, USA
- Department of Pathology, Massachusetts General Hospital and Harvard Medical School, Boston, Massachusetts, USA
- Center for Cancer Research, Massachusetts General Hospital and Harvard Medical School, Boston, Massachusetts, USA
| | - Christine Strand
- Broad Institute of Harvard and MIT, Cambridge, Massachusetts, USA
| | | | - Mudra Hegde
- Broad Institute of Harvard and MIT, Cambridge, Massachusetts, USA
| | - Kendall R Sanson
- Broad Institute of Harvard and MIT, Cambridge, Massachusetts, USA
| | - Emma W Vaimberg
- Broad Institute of Harvard and MIT, Cambridge, Massachusetts, USA
| | | | - Ella Hartenian
- Broad Institute of Harvard and MIT, Cambridge, Massachusetts, USA
| | - Zohra Kalani
- Broad Institute of Harvard and MIT, Cambridge, Massachusetts, USA
| | - Nicolo Fusi
- Microsoft Research New England, Cambridge, Massachusetts, USA
| | | | - Scott T Younger
- Broad Institute of Harvard and MIT, Cambridge, Massachusetts, USA
| | - Bradley E Bernstein
- Broad Institute of Harvard and MIT, Cambridge, Massachusetts, USA
- Department of Pathology, Massachusetts General Hospital and Harvard Medical School, Boston, Massachusetts, USA
- Center for Cancer Research, Massachusetts General Hospital and Harvard Medical School, Boston, Massachusetts, USA
| | - David E Root
- Broad Institute of Harvard and MIT, Cambridge, Massachusetts, USA
| | - John G Doench
- Broad Institute of Harvard and MIT, Cambridge, Massachusetts, USA
| |
Collapse
|
12
|
Fusi N, Listgarten J. Flexible Modeling of Genetic Effects on Function-Valued Traits. J Comput Biol 2017; 24:524-535. [PMID: 28056190 DOI: 10.1089/cmb.2016.0174] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Genome-wide association studies commonly examine one trait at a time. Occasionally they examine several related traits with the hope of increasing power; in such a setting, the traits are not generally smoothly varying in any way such as time or space. However, for function-valued traits, the trait is often smoothly varying along the axis of interest, such as space or time. For instance, in the case of longitudinal traits such as growth curves, the axis of interest is time; for spatially varying traits such as chromatin accessibility, it would be position along the genome. Although there have been efforts to perform genome-wide association studies with such function-valued traits, the statistical approaches developed for this purpose often have limitations such as requiring the trait to behave linearly in time or space, or constraining the genetic effect itself to be constant or linear in time. Herein, we present a flexible model for this problem-the Partitioned Gaussian Process-which removes many such limitations and is especially effective as the number of time points increases. The theoretical basis of this model provides machinery for handling missing and unaligned function values such as would occur when not all individuals are measured at the same time points. Furthermore, we make use of algebraic refactorizations to substantially reduce the time complexity of our model beyond the naive implementation. Finally, we apply our approach and several others to synthetic data before closing, with some directions for improved modeling and statistical testing.
Collapse
Affiliation(s)
- Nicolo Fusi
- Microsoft Research , Cambridge, Massachusetts
| | | |
Collapse
|
13
|
Germanguz I, Listgarten J, Cinkornpumin J, Solomon A, Gaeta X, Lowry WE. Identifying gene expression modules that define human cell fates. Stem Cell Res 2016; 16:712-24. [PMID: 27108395 DOI: 10.1016/j.scr.2016.04.008] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/02/2015] [Revised: 04/01/2016] [Accepted: 04/07/2016] [Indexed: 10/21/2022] Open
Abstract
Using a compendium of cell-state-specific gene expression data, we identified genes that uniquely define cell states, including those thought to represent various developmental stages. Our analysis sheds light on human cell fate through the identification of core genes that are altered over several developmental milestones, and across regional specification. Here we present cell-type specific gene expression data for 17 distinct cell states and demonstrate that these modules of genes can in fact define cell fate. Lastly, we introduce a web-based database to disseminate the results.
Collapse
Affiliation(s)
- I Germanguz
- Molecular, Cell and Developmental Biology, UCLA, United States; Eli and Edythe Broad Center for Regenerative Medicine, UCLA, United States
| | | | - J Cinkornpumin
- Molecular, Cell and Developmental Biology, UCLA, United States; Eli and Edythe Broad Center for Regenerative Medicine, UCLA, United States
| | - A Solomon
- Molecular, Cell and Developmental Biology, UCLA, United States; Eli and Edythe Broad Center for Regenerative Medicine, UCLA, United States
| | - X Gaeta
- Molecular, Cell and Developmental Biology, UCLA, United States; Eli and Edythe Broad Center for Regenerative Medicine, UCLA, United States
| | - W E Lowry
- Molecular, Cell and Developmental Biology, UCLA, United States; Eli and Edythe Broad Center for Regenerative Medicine, UCLA, United States; Molecular Biology Institute, UCLA, United States.
| |
Collapse
|
14
|
Doench JG, Fusi N, Sullender M, Hegde M, Vaimberg EW, Donovan KF, Smith I, Tothova Z, Wilen C, Orchard R, Virgin HW, Listgarten J, Root DE. Optimized sgRNA design to maximize activity and minimize off-target effects of CRISPR-Cas9. Nat Biotechnol 2016; 34:184-191. [PMID: 26780180 PMCID: PMC4744125 DOI: 10.1038/nbt.3437] [Citation(s) in RCA: 2368] [Impact Index Per Article: 296.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2015] [Accepted: 11/19/2015] [Indexed: 12/12/2022]
Abstract
CRISPR-Cas9-based genetic screens are a powerful new tool in biology. By simply altering the sequence of the single-guide RNA (sgRNA), one can reprogram Cas9 to target different sites in the genome with relative ease, but the on-target activity and off-target effects of individual sgRNAs can vary widely. Here, we use recently devised sgRNA design rules to create human and mouse genome-wide libraries, perform positive and negative selection screens and observe that the use of these rules produced improved results. Additionally, we profile the off-target activity of thousands of sgRNAs and develop a metric to predict off-target sites. We incorporate these findings from large-scale, empirical data to improve our computational design rules and create optimized sgRNA libraries that maximize on-target activity and minimize off-target effects to enable more effective and efficient genetic screens and genome engineering.
Collapse
Affiliation(s)
- John G Doench
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA
| | - Nicolo Fusi
- Microsoft Research New England, Cambridge, Massachusetts, USA
| | - Meagan Sullender
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA
| | - Mudra Hegde
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA
| | - Emma W Vaimberg
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA
| | | | - Ian Smith
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA
| | - Zuzana Tothova
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA
- Dana Farber Cancer Institute, Division of Hematologic Malignancies, Boston, Massachusetts, USA
| | - Craig Wilen
- Washington University School of Medicine, Department of Pathology and Immunology, St. Louis, Missouri, USA
| | - Robert Orchard
- Washington University School of Medicine, Department of Pathology and Immunology, St. Louis, Missouri, USA
| | - Herbert W Virgin
- Washington University School of Medicine, Department of Pathology and Immunology, St. Louis, Missouri, USA
| | | | - David E Root
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA
| |
Collapse
|
15
|
Dudley JT, Listgarten J, Stegle O, Brenner SE, Parts L. Personalized medicine: from genotypes, molecular phenotypes and the quantified self, towards improved medicine. Pac Symp Biocomput 2015:342-346. [PMID: 25592594 PMCID: PMC5893135] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
Advances in molecular profiling and sensor technologies are expanding the scope of personalized medicine beyond genotypes, providing new opportunities for developing richer and more dynamic multi-scale models of individual health. Recent studies demonstrate the value of scoring high-dimensional microbiome, immune, and metabolic traits from individuals to inform personalized medicine. Efforts to integrate multiple dimensions of clinical and molecular data towards predictive multi-scale models of individual health and wellness are already underway. Improved methods for mining and discovery of clinical phenotypes from electronic medical records and technological developments in wearable sensor technologies present new opportunities for mapping and exploring the critical yet poorly characterized "phenome" and "envirome" dimensions of personalized medicine. There are ambitious new projects underway to collect multi-scale molecular, sensor, clinical, behavioral, and environmental data streams from large population cohorts longitudinally to enable more comprehensive and dynamic models of individual biology and personalized health. Personalized medicine stands to benefit from inclusion of rich new sources and dimensions of data. However, realizing these improvements in care relies upon novel informatics methodologies, tools, and systems to make full use of these data to advance both the science and translational applications of personalized medicine.
Collapse
Affiliation(s)
- Joel T Dudley
- Icahn School of Medicine at Mount Sinai, 1425 Madison Ave., New York, USA.
| | | | | | | | | |
Collapse
|
16
|
Widmer C, Lippert C, Weissbrod O, Fusi N, Kadie C, Davidson R, Listgarten J, Heckerman D. Further improvements to linear mixed models for genome-wide association studies. Sci Rep 2014; 4:6874. [PMID: 25387525 PMCID: PMC4230738 DOI: 10.1038/srep06874] [Citation(s) in RCA: 45] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2013] [Accepted: 10/14/2014] [Indexed: 11/09/2022] Open
Abstract
We examine improvements to the linear mixed model (LMM) that better correct for population structure and family relatedness in genome-wide association studies (GWAS). LMMs rely on the estimation of a genetic similarity matrix (GSM), which encodes the pairwise similarity between every two individuals in a cohort. These similarities are estimated from single nucleotide polymorphisms (SNPs) or other genetic variants. Traditionally, all available SNPs are used to estimate the GSM. In empirical studies across a wide range of synthetic and real data, we find that modifications to this approach improve GWAS performance as measured by type I error control and power. Specifically, when only population structure is present, a GSM constructed from SNPs that well predict the phenotype in combination with principal components as covariates controls type I error and yields more power than the traditional LMM. In any setting, with or without population structure or family relatedness, a GSM consisting of a mixture of two component GSMs, one constructed from all SNPs and another constructed from SNPs that well predict the phenotype again controls type I error and yields more power than the traditional LMM. Software implementing these improvements and the experimental comparisons are available at http://microsoft.com/science.
Collapse
Affiliation(s)
- Christian Widmer
- eScience Group, Microsoft Research, 1100 Glendon Avenue, Suite
PH1, Los Angeles, CA, 90024, United States
| | - Christoph Lippert
- eScience Group, Microsoft Research, 1100 Glendon Avenue, Suite
PH1, Los Angeles, CA, 90024, United States
| | - Omer Weissbrod
- Computer Science Department, Technion - Israel Institute of
Technology, Haifa 32000, Israel
| | - Nicolo Fusi
- eScience Group, Microsoft Research, 1100 Glendon Avenue, Suite
PH1, Los Angeles, CA, 90024, United States
| | - Carl Kadie
- eScience Group, Microsoft Research, One Microsoft Way, Redmond,
WA, 98052, United States
| | - Robert Davidson
- eScience Group, Microsoft Research, One Microsoft Way, Redmond,
WA, 98052, United States
| | - Jennifer Listgarten
- eScience Group, Microsoft Research, 1100 Glendon Avenue, Suite
PH1, Los Angeles, CA, 90024, United States
| | - David Heckerman
- eScience Group, Microsoft Research, 1100 Glendon Avenue, Suite
PH1, Los Angeles, CA, 90024, United States
| |
Collapse
|
17
|
Patterson M, Gaeta X, Loo K, Edwards M, Smale S, Cinkornpumin J, Xie Y, Listgarten J, Azghadi S, Douglass SM, Pellegrini M, Lowry WE. let-7 miRNAs can act through notch to regulate human gliogenesis. Stem Cell Reports 2014; 3:758-73. [PMID: 25316189 PMCID: PMC4235151 DOI: 10.1016/j.stemcr.2014.08.015] [Citation(s) in RCA: 75] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2014] [Revised: 08/27/2014] [Accepted: 08/28/2014] [Indexed: 12/13/2022] Open
Abstract
It is clear that neural differentiation from human pluripotent stem cells generates cells that are developmentally immature. Here, we show that the let-7 plays a functional role in the developmental decision making of human neural progenitors, controlling whether these cells make neurons or glia. Through gain- and loss-of-function studies on both tissue and pluripotent derived cells, our data show that let-7 specifically regulates decision making in this context by regulation of a key chromatin-associated protein, HMGA2. Furthermore, we provide evidence that the let-7/HMGA2 circuit acts on HES5, a NOTCH effector and well-established node that regulates fate decisions in the nervous system. These data link the let-7 circuit to NOTCH signaling and suggest that this interaction serves to regulate human developmental progression. let-7 miRNAs influence developmental maturity of neural progenitors let-7 miRNAs act through HMGA2 and NOTCH to regulate gliogenesis HMGA2 expression regulates access of NICD to HES5 promoter Induction of let-7 miRNAs can accelerate oligodendrogenesis
Collapse
Affiliation(s)
- M Patterson
- Eli and Edythe Broad Center for Regenerative Medicine, UCLA, Box 957357, Los Angeles, CA 90095, USA; Department of Molecular, Cell and Developmental Biology, UCLA, 621 Charles E. Young Drive East, Los Angeles, CA 90095, USA
| | - X Gaeta
- Eli and Edythe Broad Center for Regenerative Medicine, UCLA, Box 957357, Los Angeles, CA 90095, USA; Department of Molecular, Cell and Developmental Biology, UCLA, 621 Charles E. Young Drive East, Los Angeles, CA 90095, USA; Molecular Biology Institute, UCLA, 611 Charles E. Young Drive East, Los Angeles, CA 90095, USA
| | - K Loo
- Department of Molecular, Cell and Developmental Biology, UCLA, 621 Charles E. Young Drive East, Los Angeles, CA 90095, USA
| | - M Edwards
- Eli and Edythe Broad Center for Regenerative Medicine, UCLA, Box 957357, Los Angeles, CA 90095, USA
| | - S Smale
- Eli and Edythe Broad Center for Regenerative Medicine, UCLA, Box 957357, Los Angeles, CA 90095, USA
| | - J Cinkornpumin
- Eli and Edythe Broad Center for Regenerative Medicine, UCLA, Box 957357, Los Angeles, CA 90095, USA; Department of Molecular, Cell and Developmental Biology, UCLA, 621 Charles E. Young Drive East, Los Angeles, CA 90095, USA
| | - Y Xie
- Eli and Edythe Broad Center for Regenerative Medicine, UCLA, Box 957357, Los Angeles, CA 90095, USA; Department of Molecular, Cell and Developmental Biology, UCLA, 621 Charles E. Young Drive East, Los Angeles, CA 90095, USA
| | - J Listgarten
- Microsoft Research, 1100 Glendon Avenue Suite PH1, Los Angeles, CA 90024, USA
| | - S Azghadi
- Department of Molecular, Cell and Developmental Biology, UCLA, 621 Charles E. Young Drive East, Los Angeles, CA 90095, USA
| | - S M Douglass
- Eli and Edythe Broad Center for Regenerative Medicine, UCLA, Box 957357, Los Angeles, CA 90095, USA; Department of Molecular, Cell and Developmental Biology, UCLA, 621 Charles E. Young Drive East, Los Angeles, CA 90095, USA
| | - M Pellegrini
- Eli and Edythe Broad Center for Regenerative Medicine, UCLA, Box 957357, Los Angeles, CA 90095, USA; Department of Molecular, Cell and Developmental Biology, UCLA, 621 Charles E. Young Drive East, Los Angeles, CA 90095, USA
| | - W E Lowry
- Eli and Edythe Broad Center for Regenerative Medicine, UCLA, Box 957357, Los Angeles, CA 90095, USA; Department of Molecular, Cell and Developmental Biology, UCLA, 621 Charles E. Young Drive East, Los Angeles, CA 90095, USA; Molecular Biology Institute, UCLA, 611 Charles E. Young Drive East, Los Angeles, CA 90095, USA.
| |
Collapse
|
18
|
Lippert C, Xiang J, Horta D, Widmer C, Kadie C, Heckerman D, Listgarten J. Greater power and computational efficiency for kernel-based association testing of sets of genetic variants. ACTA ACUST UNITED AC 2014; 30:3206-14. [PMID: 25075117 PMCID: PMC4221116 DOI: 10.1093/bioinformatics/btu504] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
Motivation: Set-based variance component tests have been identified as a way to increase power in association studies by aggregating weak individual effects. However, the choice of test statistic has been largely ignored even though it may play an important role in obtaining optimal power. We compared a standard statistical test—a score test—with a recently developed likelihood ratio (LR) test. Further, when correction for hidden structure is needed, or gene–gene interactions are sought, state-of-the art algorithms for both the score and LR tests can be computationally impractical. Thus we develop new computationally efficient methods. Results: After reviewing theoretical differences in performance between the score and LR tests, we find empirically on real data that the LR test generally has more power. In particular, on 15 of 17 real datasets, the LR test yielded at least as many associations as the score test—up to 23 more associations—whereas the score test yielded at most one more association than the LR test in the two remaining datasets. On synthetic data, we find that the LR test yielded up to 12% more associations, consistent with our results on real data, but also observe a regime of extremely small signal where the score test yielded up to 25% more associations than the LR test, consistent with theory. Finally, our computational speedups now enable (i) efficient LR testing when the background kernel is full rank, and (ii) efficient score testing when the background kernel changes with each test, as for gene–gene interaction tests. The latter yielded a factor of 2000 speedup on a cohort of size 13 500. Availability: Software available at http://research.microsoft.com/en-us/um/redmond/projects/MSCompBio/Fastlmm/. Contact:heckerma@microsoft.com Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Christoph Lippert
- eScience Research Group, Microsoft Research, Los Angeles, CA, 90024 and eScience Research Group, Microsoft Research, Redmond, WA, 98052, USA
| | - Jing Xiang
- eScience Research Group, Microsoft Research, Los Angeles, CA, 90024 and eScience Research Group, Microsoft Research, Redmond, WA, 98052, USA
| | - Danilo Horta
- eScience Research Group, Microsoft Research, Los Angeles, CA, 90024 and eScience Research Group, Microsoft Research, Redmond, WA, 98052, USA
| | - Christian Widmer
- eScience Research Group, Microsoft Research, Los Angeles, CA, 90024 and eScience Research Group, Microsoft Research, Redmond, WA, 98052, USA
| | - Carl Kadie
- eScience Research Group, Microsoft Research, Los Angeles, CA, 90024 and eScience Research Group, Microsoft Research, Redmond, WA, 98052, USA
| | - David Heckerman
- eScience Research Group, Microsoft Research, Los Angeles, CA, 90024 and eScience Research Group, Microsoft Research, Redmond, WA, 98052, USA
| | - Jennifer Listgarten
- eScience Research Group, Microsoft Research, Los Angeles, CA, 90024 and eScience Research Group, Microsoft Research, Redmond, WA, 98052, USA
| |
Collapse
|
19
|
Listgarten J, Stegle O, Morris Q, Brenner SE, Parts L. Personalized medicine: from genotypes and molecular phenotypes towards therapy- session introduction. Pac Symp Biocomput 2014; 19:224-228. [PMID: 24297549 PMCID: PMC5215523 DOI: 10.1142/9789814583220_0022] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
|
20
|
Bartha I, Carlson JM, Brumme CJ, McLaren PJ, Brumme ZL, John M, Haas DW, Martinez-Picado J, Dalmau J, López-Galíndez C, Casado C, Rauch A, Günthard HF, Bernasconi E, Vernazza P, Klimkait T, Yerly S, O'Brien SJ, Listgarten J, Pfeifer N, Lippert C, Fusi N, Kutalik Z, Allen TM, Müller V, Harrigan PR, Heckerman D, Telenti A, Fellay J. A genome-to-genome analysis of associations between human genetic variation, HIV-1 sequence diversity, and viral control. eLife 2013; 2:e01123. [PMID: 24171102 PMCID: PMC3807812 DOI: 10.7554/elife.01123] [Citation(s) in RCA: 93] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2013] [Accepted: 09/26/2013] [Indexed: 12/31/2022] Open
Abstract
HIV-1 sequence diversity is affected by selection pressures arising from host genomic factors. Using paired human and viral data from 1071 individuals, we ran >3000 genome-wide scans, testing for associations between host DNA polymorphisms, HIV-1 sequence variation and plasma viral load (VL), while considering human and viral population structure. We observed significant human SNP associations to a total of 48 HIV-1 amino acid variants (p<2.4 × 10(-12)). All associated SNPs mapped to the HLA class I region. Clinical relevance of host and pathogen variation was assessed using VL results. We identified two critical advantages to the use of viral variation for identifying host factors: (1) association signals are much stronger for HIV-1 sequence variants than VL, reflecting the 'intermediate phenotype' nature of viral variation; (2) association testing can be run without any clinical data. The proposed genome-to-genome approach highlights sites of genomic conflict and is a strategy generally applicable to studies of host-pathogen interaction. DOI:http://dx.doi.org/10.7554/eLife.01123.001.
Collapse
Affiliation(s)
- István Bartha
- School of Life Sciences , École Polytechnique Fédérale de Lausanne , Lausanne , Switzerland ; Institute of Microbiology , University Hospital and University of Lausanne , Lausanne , Switzerland ; Research Group of Theoretical Biology and Evolutionary Ecology , Eötvös Loránd University and the Hungarian Academy of Sciences , Budapest , Hungary ; Swiss Institute of Bioinformatics , Lausanne , Switzerland
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
21
|
Listgarten J, Lippert C, Heckerman D. FaST-LMM-Select for addressing confounding from spatial structure and rare variants. Nat Genet 2013; 45:470-1. [PMID: 23619783 DOI: 10.1038/ng.2620] [Citation(s) in RCA: 56] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
|
22
|
Abstract
MOTIVATION Approaches for testing sets of variants, such as a set of rare or common variants within a gene or pathway, for association with complex traits are important. In particular, set tests allow for aggregation of weak signal within a set, can capture interplay among variants and reduce the burden of multiple hypothesis testing. Until now, these approaches did not address confounding by family relatedness and population structure, a problem that is becoming more important as larger datasets are used to increase power. RESULTS We introduce a new approach for set tests that handles confounders. Our model is based on the linear mixed model and uses two random effects-one to capture the set association signal and one to capture confounders. We also introduce a computational speedup for two random-effects models that makes this approach feasible even for extremely large cohorts. Using this model with both the likelihood ratio test and score test, we find that the former yields more power while controlling type I error. Application of our approach to richly structured Genetic Analysis Workshop 14 data demonstrates that our method successfully corrects for population structure and family relatedness, whereas application of our method to a 15 000 individual Crohn's disease case-control cohort demonstrates that it additionally recovers genes not recoverable by univariate analysis. AVAILABILITY A Python-based library implementing our approach is available at http://mscompbio.codeplex.com.
Collapse
|
23
|
Quon G, Lippert C, Heckerman D, Listgarten J. Patterns of methylation heritability in a genome-wide analysis of four brain regions. Nucleic Acids Res 2013; 41:2095-104. [PMID: 23303775 PMCID: PMC3575819 DOI: 10.1093/nar/gks1449] [Citation(s) in RCA: 38] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2012] [Revised: 12/07/2012] [Accepted: 12/12/2012] [Indexed: 01/08/2023] Open
Abstract
DNA methylation has been implicated in a number of diseases and other phenotypes. It is, therefore, of interest to identify and understand the genetic determinants of methylation and epigenomic variation. We investigated the extent to which genetic variation in cis-DNA sequence explains variation in CpG dinucleotide methylation in publicly available data for four brain regions from unrelated individuals, finding that 3-4% of CpG loci assayed were heritable, with a mean estimated narrow-sense heritability of 30% over the heritable loci. Over all loci, the mean estimated heritability was 3%, as compared with a recent twin-based study reporting 18%. Heritable loci were enriched for open chromatin regions and binding sites of CTCF, an influential regulator of transcription and chromatin architecture. Additionally, heritable loci were proximal to genes enriched in several known pathways, suggesting a possible functional role for these loci. Our estimates of heritability are conservative, and we suspect that the number of identified heritable loci will increase as the methylome is assayed across a broader range of cell types and the density of the tested loci is increased. Finally, we show that the number of heritable loci depends on the window size parameter commonly used to identify candidate cis-acting single-nucleotide polymorphism variants.
Collapse
Affiliation(s)
- Gerald Quon
- eScience Group, Microsoft Research, 1100 Glendon Avenue, Suite PH1, Los Angeles, CA 90024, USA and Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, 32 Vassar Street, Room 32-D516, Cambridge, MA 02139, USA
| | - Christoph Lippert
- eScience Group, Microsoft Research, 1100 Glendon Avenue, Suite PH1, Los Angeles, CA 90024, USA and Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, 32 Vassar Street, Room 32-D516, Cambridge, MA 02139, USA
| | - David Heckerman
- eScience Group, Microsoft Research, 1100 Glendon Avenue, Suite PH1, Los Angeles, CA 90024, USA and Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, 32 Vassar Street, Room 32-D516, Cambridge, MA 02139, USA
| | - Jennifer Listgarten
- eScience Group, Microsoft Research, 1100 Glendon Avenue, Suite PH1, Los Angeles, CA 90024, USA and Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, 32 Vassar Street, Room 32-D516, Cambridge, MA 02139, USA
| |
Collapse
|
24
|
Lippert C, Listgarten J, Davidson RI, Baxter S, Poon H, Poong H, Kadie CM, Heckerman D. An exhaustive epistatic SNP association analysis on expanded Wellcome Trust data. Sci Rep 2013; 3:1099. [PMID: 23346356 PMCID: PMC3551227 DOI: 10.1038/srep01099] [Citation(s) in RCA: 40] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2012] [Accepted: 12/17/2012] [Indexed: 11/21/2022] Open
Abstract
We present an approach for genome-wide association analysis with improved power on the Wellcome Trust data consisting of seven common phenotypes and shared controls. We achieved improved power by expanding the control set to include other disease cohorts, multiple races, and closely related individuals. Within this setting, we conducted exhaustive univariate and epistatic interaction association analyses. Use of the expanded control set identified more known associations with Crohn's disease and potential new biology, including several plausible epistatic interactions in several diseases. Our work suggests that carefully combining data from large repositories could reveal many new biological insights through increased power. As a community resource, all results have been made available through an interactive web server.
Collapse
|
25
|
Morris Q, Brenner SE, Listgarten J, Stegle O. The future of genome-based medicine. Pac Symp Biocomput 2013:456-458. [PMID: 23424151 PMCID: PMC5894348] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/01/2023]
Affiliation(s)
- Quaid Morris
- University of Toronto, Donnelly Centre, 160 College Street, Toronto, ON M5S 3E1, Canada.
| | | | | | | |
Collapse
|
26
|
Stegle O, Brenner SE, Morris Q, Listgarten J. PERSONALIZED MEDICINE: FROM GENOTYPES AND MOLECULAR PHENOTYPES TOWARDS COMPUTED THERAPY. Pac Symp Biocomput 2013; 18:171-174. [PMID: 23424122 PMCID: PMC5894351] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/01/2023]
Affiliation(s)
- Oliver Stegle
- Max Planck Institutes Tübingen, 72076 Tübingen, Germany.
| | | | | | | |
Collapse
|
27
|
Zhang X, Cheng W, Listgarten J, Kadie C, Huang S, Wang W, Heckerman D. Learning transcriptional regulatory relationships using sparse graphical models. PLoS One 2012; 7:e35762. [PMID: 22586449 PMCID: PMC3346750 DOI: 10.1371/journal.pone.0035762] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2011] [Accepted: 03/21/2012] [Indexed: 11/19/2022] Open
Abstract
Understanding the organization and function of transcriptional regulatory networks by analyzing high-throughput gene expression profiles is a key problem in computational biology. The challenges in this work are 1) the lack of complete knowledge of the regulatory relationship between the regulators and the associated genes, 2) the potential for spurious associations due to confounding factors, and 3) the number of parameters to learn is usually larger than the number of available microarray experiments. We present a sparse (L1 regularized) graphical model to address these challenges. Our model incorporates known transcription factors and introduces hidden variables to represent possible unknown transcription and confounding factors. The expression level of a gene is modeled as a linear combination of the expression levels of known transcription factors and hidden factors. Using gene expression data covering 39,296 oligonucleotide probes from 1109 human liver samples, we demonstrate that our model better predicts out-of-sample data than a model with no hidden variables. We also show that some of the gene sets associated with hidden variables are strongly correlated with Gene Ontology categories. The software including source code is available at http://grnl1.codeplex.com.
Collapse
Affiliation(s)
- Xiang Zhang
- Microsoft Research, Los Angeles, California, United States of America
- Case Western Reserve University, Cleveland, Ohio, United States of America
- University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, United States of America
| | - Wei Cheng
- University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, United States of America
| | | | - Carl Kadie
- Microsoft Research, Los Angeles, California, United States of America
| | - Shunping Huang
- University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, United States of America
| | - Wei Wang
- University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, United States of America
| | - David Heckerman
- Microsoft Research, Los Angeles, California, United States of America
- * E-mail:
| |
Collapse
|
28
|
Carlson JM, Listgarten J, Pfeifer N, Tan V, Kadie C, Walker BD, Ndung'u T, Shapiro R, Frater J, Brumme ZL, Goulder PJR, Heckerman D. Widespread impact of HLA restriction on immune control and escape pathways of HIV-1. J Virol 2012; 86:5230-43. [PMID: 22379086 PMCID: PMC3347390 DOI: 10.1128/jvi.06728-11] [Citation(s) in RCA: 99] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2011] [Accepted: 02/20/2012] [Indexed: 11/20/2022] Open
Abstract
The promiscuous presentation of epitopes by similar HLA class I alleles holds promise for a universal T-cell-based HIV-1 vaccine. However, in some instances, cytotoxic T lymphocytes (CTL) restricted by HLA alleles with similar or identical binding motifs are known to target epitopes at different frequencies, with different functional avidities and with different apparent clinical outcomes. Such differences may be illuminated by the association of similar HLA alleles with distinctive escape pathways. Using a novel computational method featuring phylogenetically corrected odds ratios, we systematically analyzed differential patterns of immune escape across all optimally defined epitopes in Gag, Pol, and Nef in 2,126 HIV-1 clade C-infected adults. Overall, we identified 301 polymorphisms in 90 epitopes associated with HLA alleles belonging to shared supertypes. We detected differential escape in 37 of 38 epitopes restricted by more than one allele, which included 278 instances of differential escape at the polymorphism level. The majority (66 to 97%) of these resulted from the selection of unique HLA-specific polymorphisms rather than differential epitope targeting rates, as confirmed by gamma interferon (IFN-γ) enzyme-linked immunosorbent spot assay (ELISPOT) data. Discordant associations between HLA alleles and viral load were frequently observed between allele pairs that selected for differential escape. Furthermore, the total number of associated polymorphisms strongly correlated with average viral load. These studies confirm that differential escape is a widespread phenomenon and may be the norm when two alleles present the same epitope. Given the clinical correlates of immune escape, such heterogeneity suggests that certain epitopes will lead to discordant outcomes if applied universally in a vaccine.
Collapse
|
29
|
Matthews PC, Listgarten J, Carlson JM, Payne R, Huang KHG, Frater J, Goedhals D, Steyn D, van Vuuren C, Paioni P, Jooste P, Ogwu A, Shapiro R, Mncube Z, Ndung'u T, Walker BD, Heckerman D, Goulder PJR. Co-operative additive effects between HLA alleles in control of HIV-1. PLoS One 2012; 7:e47799. [PMID: 23094091 PMCID: PMC3477121 DOI: 10.1371/journal.pone.0047799] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2012] [Accepted: 09/17/2012] [Indexed: 12/16/2022] Open
Abstract
BACKGROUND HLA class I genotype is a major determinant of the outcome of HIV infection, and the impact of certain alleles on HIV disease outcome is well studied. Recent studies have demonstrated that certain HLA class I alleles that are in linkage disequilibrium, such as HLA-A*74 and HLA-B*57, appear to function co-operatively to result in greater immune control of HIV than mediated by either single allele alone. We here investigate the extent to which HLA alleles--irrespective of linkage disequilibrium--function co-operatively. METHODOLOGY/PRINCIPAL FINDINGS We here refined a computational approach to the analysis of >2000 subjects infected with C-clade HIV first to discern the individual effect of each allele on disease control, and second to identify pairs of alleles that mediate 'co-operative additive' effects, either to improve disease suppression or to contribute to immunological failure. We identified six pairs of HLA class I alleles that have a co-operative additive effect in mediating HIV disease control and four hazardous pairs of alleles that, occurring together, are predictive of worse disease outcomes (q<0.05 in each case). We developed a novel 'sharing score' to quantify the breadth of CD8+ T cell responses made by pairs of HLA alleles across the HIV proteome, and used this to demonstrate that successful viraemic suppression correlates with breadth of unique CD8+ T cell responses (p = 0.03). CONCLUSIONS/SIGNIFICANCE These results identify co-operative effects between HLA Class I alleles in the control of HIV-1 in an extended Southern African cohort, and underline complementarity and breadth of the CD8+ T cell targeting as one potential mechanism for this effect.
Collapse
|
30
|
Lippert C, Listgarten J, Liu Y, Kadie CM, Davidson RI, Heckerman D. FaST linear mixed models for genome-wide association studies. Nat Methods 2011; 8:833-5. [PMID: 21892150 DOI: 10.1038/nmeth.1681] [Citation(s) in RCA: 707] [Impact Index Per Article: 54.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2011] [Accepted: 08/02/2011] [Indexed: 02/07/2023]
Abstract
We describe factored spectrally transformed linear mixed models (FaST-LMM), an algorithm for genome-wide association studies (GWAS) that scales linearly with cohort size in both run time and memory use. On Wellcome Trust data for 15,000 individuals, FaST-LMM ran an order of magnitude faster than current efficient algorithms. Our algorithm can analyze data for 120,000 individuals in just a few hours, whereas current algorithms fail on data for even 20,000 individuals (http://mscompbio.codeplex.com/).
Collapse
|
31
|
Matthews PC, Adland E, Listgarten J, Leslie A, Mkhwanazi N, Carlson JM, Harndahl M, Stryhn A, Payne RP, Ogwu A, Huang KHG, Frater J, Paioni P, Kloverpris H, Jooste P, Goedhals D, van Vuuren C, Steyn D, Riddell L, Chen F, Luzzi G, Balachandran T, Ndung'u T, Buus S, Carrington M, Shapiro R, Heckerman D, Goulder PJR. HLA-A*7401-mediated control of HIV viremia is independent of its linkage disequilibrium with HLA-B*5703. J Immunol 2011; 186:5675-86. [PMID: 21498667 DOI: 10.4049/jimmunol.1003711] [Citation(s) in RCA: 47] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/26/2022]
Abstract
The potential contribution of HLA-A alleles to viremic control in chronic HIV type 1 (HIV-1) infection has been relatively understudied compared with HLA-B. In these studies, we show that HLA-A*7401 is associated with favorable viremic control in extended southern African cohorts of >2100 C-clade-infected subjects. We present evidence that HLA-A*7401 operates an effect that is independent of HLA-B*5703, with which it is in linkage disequilibrium in some populations, to mediate lowered viremia. We describe a novel statistical approach to detecting additive effects between class I alleles in control of HIV-1 disease, highlighting improved viremic control in subjects with HLA-A*7401 combined with HLA-B*57. In common with HLA-B alleles that are associated with effective control of viremia, HLA-A*7401 presents highly targeted epitopes in several proteins, including Gag, Pol, Rev, and Nef, of which the Gag epitopes appear immunodominant. We identify eight novel putative HLA-A*7401-restricted epitopes, of which three have been defined to the optimal epitope. In common with HLA-B alleles linked with slow progression, viremic control through an HLA-A*7401-restricted response appears to be associated with the selection of escape mutants within Gag epitopes that reduce viral replicative capacity. These studies highlight the potentially important contribution of an HLA-A allele to immune control of HIV infection, which may have been concealed by a stronger effect mediated by an HLA-B allele with which it is in linkage disequilibrium. In addition, these studies identify a factor contributing to different HIV disease outcomes in individuals expressing HLA-B*5703.
Collapse
Affiliation(s)
- Philippa C Matthews
- Department of Paediatrics, University of Oxford, Oxford OX1 3SY, United Kingdom
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
32
|
Rousseau CM, Lockhart DW, Listgarten J, Maley SN, Kadie C, Learn GH, Nickle DC, Heckerman DE, Deng W, Brander C, Ndung'u T, Coovadia H, Goulder PJ, Korber BT, Walker BD, Mullins JI. Rare HLA drive additional HIV evolution compared to more frequent alleles. AIDS Res Hum Retroviruses 2009; 25:297-303. [PMID: 19327049 DOI: 10.1089/aid.2008.0208] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
HIV-1 can evolve HLA-specific escape variants in response to HLA-mediated cellular immunity. HLA alleles that are common in the host population may increase the frequency of such escape variants at the population level. When loss of viral fitness is caused by immune escape variation, these variants may revert upon infection of a new host who does not have the corresponding HLA allele. Furthermore, additional escape variants may appear in response to the nonconcordant HLA alleles. Because individuals with rare HLA alleles are less likely to be infected by a partner with concordant HLA alleles, viral populations infecting hosts with rare HLA alleles may undergo a greater amount of evolution than those infecting hosts with common alleles due to the loss of preexisting escape variants followed by new immune escape. This hypothesis was evaluated using maximum likelihood phylogenetic trees of each gene from 272 full-length HIV-1 sequences. Recent viral evolution, as measured by the external branch length, was found to be inversely associated with HLA frequency in nef (p < 0.02), env (p < 0.03), and pol (p < or = 0.05), suggesting that rare HLA alleles provide a disproportionate force driving viral evolution compared to common alleles, likely due to the loss of preexisting escape variants during early stages postinfection.
Collapse
Affiliation(s)
| | - David W. Lockhart
- Department of Biostatistics, University of Washington, Seattle Washington 98103
| | | | - Stephen N. Maley
- Department of Microbiology, University of Washington, Seattle Washington 98103
| | - Carl Kadie
- eScience Research Group, Microsoft Research, Redmond, Washington 98052
| | - Gerald H. Learn
- Department of Microbiology, University of Washington, Seattle Washington 98103
| | - David C. Nickle
- Department of Microbiology, University of Washington, Seattle Washington 98103
| | | | - Wenjie Deng
- Department of Microbiology, University of Washington, Seattle Washington 98103
| | - Christian Brander
- Partners AIDS Research Center, Massachusetts General Hospital, Harvard Medical School, Boston, Massachusetts 02114
| | - Thumbi Ndung'u
- Partners AIDS Research Center, Massachusetts General Hospital, Harvard Medical School, Boston, Massachusetts 02114
- HIV Pathogenesis Program, Nelson R. Mandela School of Medicine, University of KwaZulu-Natal, Durban, South Africa
| | - Hoosen Coovadia
- HIV Pathogenesis Program, Nelson R. Mandela School of Medicine, University of KwaZulu-Natal, Durban, South Africa
| | - Philip J.R. Goulder
- Partners AIDS Research Center, Massachusetts General Hospital, Harvard Medical School, Boston, Massachusetts 02114
- HIV Pathogenesis Program, Nelson R. Mandela School of Medicine, University of KwaZulu-Natal, Durban, South Africa
- Department of Pediatrics, Nuffield Department of Medicine, Oxford, England
| | - Bette T. Korber
- Los Alamos National Laboratory, Los Alamos, New Mexico 87544
- Santa Fe Institute, Santa Fe, New Mexico 87501
| | - Bruce D. Walker
- Partners AIDS Research Center, Massachusetts General Hospital, Harvard Medical School, Boston, Massachusetts 02114
- HIV Pathogenesis Program, Nelson R. Mandela School of Medicine, University of KwaZulu-Natal, Durban, South Africa
- Howard Hughes Medical Institute, Chevy Chase, Maryland 20815
| | - James I. Mullins
- Department of Microbiology, University of Washington, Seattle Washington 98103
- Department of Medicine, University of Washington, Seattle, Washington 98103
| |
Collapse
|
33
|
Listgarten J, Brumme Z, Kadie C, Xiaojiang G, Walker B, Carrington M, Goulder P, Heckerman D. 168-P: In silico resolution of ambiguous HLA typing data. Hum Immunol 2008. [DOI: 10.1016/j.humimm.2008.08.187] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
34
|
Listgarten J, Brumme Z, Kadie C, Xiaojiang G, Walker B, Carrington M, Goulder P, Heckerman D. Statistical resolution of ambiguous HLA typing data. PLoS Comput Biol 2008; 4:e1000016. [PMID: 18392148 PMCID: PMC2289775 DOI: 10.1371/journal.pcbi.1000016] [Citation(s) in RCA: 45] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2007] [Accepted: 01/30/2008] [Indexed: 11/18/2022] Open
Abstract
High-resolution HLA typing plays a central role in many areas of immunology, such as in identifying immunogenetic risk factors for disease, in studying how the genomes of pathogens evolve in response to immune selection pressures, and also in vaccine design, where identification of HLA-restricted epitopes may be used to guide the selection of vaccine immunogens. Perhaps one of the most immediate applications is in direct medical decisions concerning the matching of stem cell transplant donors to unrelated recipients. However, high-resolution HLA typing is frequently unavailable due to its high cost or the inability to re-type historical data. In this paper, we introduce and evaluate a method for statistical, in silico refinement of ambiguous and/or low-resolution HLA data. Our method, which requires an independent, high-resolution training data set drawn from the same population as the data to be refined, uses linkage disequilibrium in HLA haplotypes as well as four-digit allele frequency data to probabilistically refine HLA typings. Central to our approach is the use of haplotype inference. We introduce new methodology to this area, improving upon the Expectation-Maximization (EM)-based approaches currently used within the HLA community. Our improvements are achieved by using a parsimonious parameterization for haplotype distributions and by smoothing the maximum likelihood (ML) solution. These improvements make it possible to scale the refinement to a larger number of alleles and loci in a more computationally efficient and stable manner. We also show how to augment our method in order to incorporate ethnicity information (as HLA allele distributions vary widely according to race/ethnicity as well as geographic area), and demonstrate the potential utility of this experimentally. A tool based on our approach is freely available for research purposes at http://microsoft.com/science. At the core of the human adaptive immune response is the train-to-kill mechanism in which specialized immune cells are sensitized to recognize small peptides from foreign sources (e.g., from HIV or bacteria). Following this sensitization, these immune cells are then activated to kill other cells which display this same peptide (and which contain this same foreign peptide). However, in order for sensitization and killing to occur, the foreign peptide must be “paired up” with one of the infected person's other specialized immune molecules—an HLA molecule. The way in which peptides interact with these HLA molecules defines if and how an immune response will be generated. There is a huge repertoire of such HLA molecules, with almost no two people having the same set. Furthermore, a person's HLA type can determine their susceptibility to disease, or the success of a transplant, for example. However, obtaining high quality HLA data for patients is often difficult because of the great cost and specialized laboratories required, or because the data are historical and cannot be retyped with modern methods. Therefore, we introduce a statistical model which can make use of existing high-quality HLA data, to infer higher-quality HLA data from lower-quality data.
Collapse
Affiliation(s)
- Jennifer Listgarten
- Microsoft Research, Redmond, Washington, United States of America
- * E-mail: (JL); (DH)
| | - Zabrina Brumme
- Partners AIDS Research Center, Massachusetts General Hospital, Harvard Medical School, Boston, Massachusetts, United States of America
| | - Carl Kadie
- Microsoft Research, Redmond, Washington, United States of America
| | - Gao Xiaojiang
- SAIC-Frederick, National Cancer Institute, Frederick, Maryland, United States of America
| | - Bruce Walker
- Partners AIDS Research Center, Massachusetts General Hospital, Harvard Medical School, Boston, Massachusetts, United States of America
- Howard Hughes Medical Institute, Frederick, Maryland, United States of America
| | - Mary Carrington
- SAIC-Frederick, National Cancer Institute, Frederick, Maryland, United States of America
| | - Philip Goulder
- Partners AIDS Research Center, Massachusetts General Hospital, Harvard Medical School, Boston, Massachusetts, United States of America
- Department of Paediatrics, University of Oxford, Oxford, United Kingdom
| | - David Heckerman
- Microsoft Research, Redmond, Washington, United States of America
- * E-mail: (JL); (DH)
| |
Collapse
|
35
|
Listgarten J, Frahm N, Kadie C, Brander C, Heckerman D. A statistical framework for modeling HLA-dependent T cell response data. PLoS Comput Biol 2007; 3:1879-86. [PMID: 17937494 PMCID: PMC2014793 DOI: 10.1371/journal.pcbi.0030188] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2007] [Accepted: 08/14/2007] [Indexed: 12/18/2022] Open
Abstract
The identification of T cell epitopes and their HLA (human leukocyte antigen) restrictions is important for applications such as the design of cellular vaccines for HIV. Traditional methods for such identification are costly and time-consuming. Recently, a more expeditious laboratory technique using ELISpot assays has been developed that allows for rapid screening of specific responses. However, this assay does not directly provide information concerning the HLA restriction of a response, a critical piece of information for vaccine design. Thus, we introduce, apply, and validate a statistical model for identifying HLA-restricted epitopes from ELISpot data. By looking at patterns across a broad range of donors, in conjunction with our statistical model, we can determine (probabilistically) which of the HLA alleles are likely to be responsible for the observed reactivities. Additionally, we can provide a good estimate of the number of false positives generated by our analysis (i.e., the false discovery rate). This model allows us to learn about new HLA-restricted epitopes from ELISpot data in an efficient, cost-effective, and high-throughput manner. We applied our approach to data from donors infected with HIV and identified many potential new HLA restrictions. Among 134 such predictions, six were confirmed in the lab and the remainder could not be ruled as invalid. These results shed light on the extent of HLA class I promiscuity, which has significant implications for the understanding of HLA class I antigen presentation and vaccine development. At the core of the human adaptive immune response is the train-to-kill mechanism in which specialized immune cells are sensitized to recognize small peptides from foreign pathogens (e.g., HIV virus). Following this sensitization, these cells are then activated to kill other cells that display this same peptide (and that are infected by this same pathogen). However, for sensitization and killing to occur, the pathogen peptide must be “paired up” with one of the infected person's other specialized immune molecules—an HLA (human leukocyte antigen) molecule. The way in which pathogen peptides interact with these HLA molecules defines if and how an immune response will be generated, which has implications for vaccine design where one may artificially introduce select peptides to pre-train the immune system. Furthermore, there is a huge repertoire of such HLA molecules, with almost no two people having the same set. We introduce a statistical approach for identifying which HLA molecules interact with which pathogen peptides, given a particular kind of laboratory data. Our approach takes as input, data that tells us only which pathogen peptides generate a response, but not which HLA molecules support the response. Our statistical approach fills in this missing information.
Collapse
Affiliation(s)
| | - Nicole Frahm
- Partners AIDS Research Center, Massachusetts General Hospital, Charlestown, Massachusetts, United States of America
| | - Carl Kadie
- Microsoft Research, Redmond, Washington, United States of America
| | - Christian Brander
- Partners AIDS Research Center, Massachusetts General Hospital, Charlestown, Massachusetts, United States of America
| | - David Heckerman
- Microsoft Research, Redmond, Washington, United States of America
- * To whom correspondence should be addressed. E-mail:
| |
Collapse
|
36
|
Young LC, Listgarten J, Trotter MJ, Andrew SE, Tron VA. Evidence that dysregulated DNA mismatch repair characterizes human nonmelanoma skin cancer. Br J Dermatol 2007; 158:59-69. [PMID: 17970804 DOI: 10.1111/j.1365-2133.2007.08249.x] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
Abstract
BACKGROUND In addition to an established role in the repair of postreplicative DNA errors, DNA mismatch repair (MMR) proteins also contribute to cellular responses to exogenous DNA damage. Previously, we have shown that Msh2-null mice display increased sensitivity to ultraviolet (UV) B-induced tumorigenesis, but squamous cell carcinomas (SCC) generated are microsatellite stable, suggesting a role for MMR other than postreplicative repair in UV-induced cutaneous tumour formation. OBJECTIVES We questioned whether there was evidence of MMR dysfunction in human SCC, thus validating the mouse models of MMR-dependent UVB-induced skin cancer. METHODS Using tissue microarrays we examined both nuclear and cytoplasmic levels of MMR proteins MSH2, MSH6, MSH3, MLH1 and PMS2 in more than 200 cases of cutaneous SCC and basal cell carcinoma (BCC). RESULTS We found that subsets of these 10 MMR protein measures were increased in nonmelanoma skin cancer (NMSC) compared with normal epidermal samples; this was particularly true of SCC. In fact, based on post hoc tests and MMR protein distribution patterns, BCC was distinct from SCC. With the exception of nuclear MSH2, the BCC had lower levels of identified MMR protein measures than SCC. We believe this to be important because not only is SCC more aggressive than BCC, but evidence suggests that these two NMSC subtypes arise through different molecular pathways. CONCLUSIONS In combination with previously established roles for MMR proteins in response to UVB-induced DNA damage, our data point towards an expanded perspective of the importance of MMR proteins in the suppression of UVB-induced tumorigenesis and, potentially, tumour behaviour.
Collapse
Affiliation(s)
- L C Young
- Department of Medical Genetics, University of Alberta, Edmonton, Alberta, Canada
| | | | | | | | | |
Collapse
|
37
|
Frahm N, Yusim K, Suscovich TJ, Adams S, Sidney J, Hraber P, Hewitt HS, Linde CH, Kavanagh DG, Woodberry T, Henry LM, Faircloth K, Listgarten J, Kadie C, Jojic N, Sango K, Brown NV, Pae E, Zaman MT, Bihl F, Khatri A, John M, Mallal S, Marincola FM, Walker BD, Sette A, Heckerman D, Korber BT, Brander C. Extensive HLA class I allele promiscuity among viral CTL epitopes. Eur J Immunol 2007; 37:2419-33. [PMID: 17705138 PMCID: PMC2628559 DOI: 10.1002/eji.200737365] [Citation(s) in RCA: 106] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Promiscuous binding of T helper epitopes to MHC class II molecules has been well established, but few examples of promiscuous class I-restricted epitopes exist. To address the extent of promiscuity of HLA class I peptides, responses to 242 well-defined viral epitopes were tested in 100 subjects regardless of the individuals' HLA type. Surprisingly, half of all detected responses were seen in the absence of the originally reported restricting HLA class I allele, and only 3% of epitopes were recognized exclusively in the presence of their original allele. Functional assays confirmed the frequent recognition of HLA class I-restricted T cell epitopes on several alternative alleles across HLA class I supertypes and encoded on different class I loci. These data have significant implications for the understanding of MHC class I-restricted antigen presentation and vaccine development.
Collapse
Affiliation(s)
- Nicole Frahm
- Partners AIDS Research Center, Massachusetts General Hospital and Division of AIDS, Harvard Medical School, Boston, MA
| | - Karina Yusim
- Theoretical Biophysics, Los Alamos National Laboratory, Los Alamos, NM
| | - Todd J. Suscovich
- Partners AIDS Research Center, Massachusetts General Hospital and Division of AIDS, Harvard Medical School, Boston, MA
| | | | - John Sidney
- La Jolla Institute of Allergy and Immunology, Redmond, WA
| | - Peter Hraber
- Theoretical Biophysics, Los Alamos National Laboratory, Los Alamos, NM
| | - Hannah S. Hewitt
- Partners AIDS Research Center, Massachusetts General Hospital and Division of AIDS, Harvard Medical School, Boston, MA
| | - Caitlyn H. Linde
- Partners AIDS Research Center, Massachusetts General Hospital and Division of AIDS, Harvard Medical School, Boston, MA
| | - Daniel G. Kavanagh
- Partners AIDS Research Center, Massachusetts General Hospital and Division of AIDS, Harvard Medical School, Boston, MA
| | - Tonia Woodberry
- Partners AIDS Research Center, Massachusetts General Hospital and Division of AIDS, Harvard Medical School, Boston, MA
| | - Leah M. Henry
- Partners AIDS Research Center, Massachusetts General Hospital and Division of AIDS, Harvard Medical School, Boston, MA
| | - Kellie Faircloth
- Partners AIDS Research Center, Massachusetts General Hospital and Division of AIDS, Harvard Medical School, Boston, MA
| | | | | | | | - Kaori Sango
- Partners AIDS Research Center, Massachusetts General Hospital and Division of AIDS, Harvard Medical School, Boston, MA
| | - Nancy V. Brown
- Partners AIDS Research Center, Massachusetts General Hospital and Division of AIDS, Harvard Medical School, Boston, MA
| | - Eunice Pae
- Fenway Community Health Center, Boston, MA
| | | | - Florian Bihl
- Partners AIDS Research Center, Massachusetts General Hospital and Division of AIDS, Harvard Medical School, Boston, MA
| | - Ashok Khatri
- Endocrine Unit, Massachusetts General Hospital and Division of AIDS, Harvard Medical School, Boston, MA
| | - Mina John
- Centre for Clinical Immunology and Biomedical Statistics, Royal Perth Hospital and Murdoch University, Perth, Australia
| | - Simon Mallal
- Centre for Clinical Immunology and Biomedical Statistics, Royal Perth Hospital and Murdoch University, Perth, Australia
| | | | - Bruce D. Walker
- Partners AIDS Research Center, Massachusetts General Hospital and Division of AIDS, Harvard Medical School, Boston, MA
- Howard Hughes Medical Institute, Massachusetts General Hospital and Division of AIDS, Harvard Medical School, Boston, MA
| | | | | | - Bette T. Korber
- Theoretical Biophysics, Los Alamos National Laboratory, Los Alamos, NM
- Santa Fe Institute, Santa Fe, NM, USA
| | - Christian Brander
- Partners AIDS Research Center, Massachusetts General Hospital and Division of AIDS, Harvard Medical School, Boston, MA
| |
Collapse
|
38
|
Abstract
We present a model for predicting HLA class I restricted CTL epitopes. In contrast to almost all other work in this area, we train a single model on epitopes from all HLA alleles and supertypes, yet retain the ability to make epitope predictions for specific HLA alleles. We are therefore able to leverage data across all HLA alleles and/or their supertypes, automatically learning what information should be shared and also how to combine allele-specific, supertype-specific, and global information in a principled way. We show that this leveraging can improve prediction of epitopes having HLA alleles with known supertypes, and dramatically increases our ability to predict epitopes having alleles which do not fall into any of the known supertypes. Our model, which is based on logistic regression, is simple to implement and understand, is solved by finding a single global maximum, and is more accurate (to our knowledge) than any other model.
Collapse
|
39
|
Abstract
MOTIVATION There is a pressing need for improved proteomic screening methods allowing for earlier diagnosis of disease, systematic monitoring of physiological responses and the uncovering of fundamental mechanisms of drug action. The combined platform of LC-MS (Liquid-Chromatography-Mass-Spectrometry) has shown promise in moving toward a solution in these areas. In this paper we present a technique for discovering differences in protein signal between two classes of samples of LC-MS serum proteomic data without use of tandem mass spectrometry, gels or labeling. This method works on data from a lower-precision MS instrument, the type routinely used by and available to the community at large today. We test our technique on a controlled (spike-in) but realistic (serum biomarker discovery) experiment which is therefore verifiable. We also develop a new method for helping to assess the difficulty of a given spike-in problem. Lastly, we show that the problem of class prediction, sometimes mistaken as a solution to biomarker discovery, is actually a much simpler problem. RESULTS Using precision-recall curves with experimentally extracted ground truth, we show that (1) our technique has good performance using seven replicates from each class, (2) performance degrades with decreasing number of replicates, (3) the signal that we are teasing out is not trivially available (i.e. the differences are not so large that the task is easy). Lastly, we easily obtain perfect classification results for data in which the problem of extracting differences does not produce absolutely perfect results. This emphasizes the different nature of the two problems and also their relative difficulties. AVAILABILITY Our data are publicly available as a benchmark for further studies of this nature at http://www.cs.toronto.edu/~jenn/LCMS
Collapse
Affiliation(s)
- Jennifer Listgarten
- Department of Computer Science, University of Toronto, Toronto, Ontario M5S 3G4, Canada.
| | | | | | | | | |
Collapse
|
40
|
Dicken BJ, Graham K, Hamilton SM, Andrews S, Lai R, Listgarten J, Jhangri GS, Saunders LD, Damaraju S, Cass C. Lymphovascular invasion is associated with poor survival in gastric cancer: an application of gene-expression and tissue array techniques. Ann Surg 2006; 243:64-73. [PMID: 16371738 PMCID: PMC1449982 DOI: 10.1097/01.sla.0000194087.96582.3e] [Citation(s) in RCA: 100] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
Abstract
OBJECTIVES To examine a population-based cohort for the association between clinicopathologic predictors of survival and immunohistochemical markers (IHC), and to assess changes in gene expression that are associated with lymphovascular invasion (LVI). SUMMARY BACKGROUND DATA LVI has been associated with poor survival and aggressive tumor behavior. The molecular changes responsible for the behavior of gastric cancer have yet to be determined. Characterization of IHC markers and gene expression profiles may identify molecular alterations governing tumor behavior. METHODS : Clinicopathologic and survival data of 114 patients were reviewed. Archival specimens were used to construct a multitumor tissue array that was subjected to IHC of selected protein targets. Correlation of IHC with tumor thickness (T status), LVI and prognosis was studied. Microarray analysis of fresh gastric cancer tissue was conducted to examine the gene expression profile with respect to LVI. RESULTS In a multivariate analysis, nodal status (N), metastasis (M), and LVI were independent predictors of survival. LVI was associated with a 5-year survival of 13.9% versus 55.9% in patients in whom it was absent. LVI correlated with advancing T status (P = 0.001) and N status (P < 0.001). IHC staining of cyclooxygenase-2 (COX-2) correlated with T status, tumor grade, lymph node positivity, and IHC staining of matrix metalloproteinase-2 (MMP-2) and matrix metalloproteinase-9 (MMP-9). Microarray analyses suggested differential expression of oligophrenin-1 (OPHN1) and ribophorin-II (RPNII) with respect to LVI. CONCLUSION LVI was an independent predictor of survival in gastric cancer. Expression of COX-2 may facilitate tumor invasion through MMP-2 and MMP-9 activation. OPHN1 and RPN II appeared to be differentially expressed in gastric cancers exhibiting LVI. The reported function of OPHN1 and RPN II makes these gene products promising candidates for future studies involving LVI in gastric cancer.
Collapse
Affiliation(s)
- Bryan J Dicken
- Department of Surgery, University of Alberta and Cross Cancer Institute, Edmonton, Alberta, Canada
| | | | | | | | | | | | | | | | | | | |
Collapse
|
41
|
Abstract
There is a pressing need for radically improved proteomic screening methods that allow for earlier diagnosis of disease, for systematic monitoring of physiological responses and for uncovering the fundamental mechanisms of drug action. Recent developments in proteomic technology offer tremendous, yet untapped, potential to yield novel biomarkers that are translatable to routine clinical use. Despite the significant conceptual promise of comparative proteomic profiling as a research platform for biomarker discovery, however, major hurdles remain for practical and clinical implementation. In particular, there is growing recognition that rigorous experimental design principles are urgently required to validate conclusively the unproven methodologies currently being touted. Debate and confusion persist about where the burden of proof lies: statistically, biologically or clinically? Moreover, there is no consensus about what constitutes a meaningful benchmark. An important question is how to achieve a scientifically rigorous, and therefore convincing, proof-of-concept that can be accepted by the field. Key analytical challenges related to these issues that must be addressed by the burgeoning biomarker community are discussed here.
Collapse
Affiliation(s)
- Jennifer Listgarten
- Department of Computer Science, University of Toronto, Toronto, Ontario, M5S 3G4, Canada
| | | |
Collapse
|
42
|
Listgarten J, Emili A. Statistical and computational methods for comparative proteomic profiling using liquid chromatography-tandem mass spectrometry. Mol Cell Proteomics 2005; 4:419-34. [PMID: 15741312 DOI: 10.1074/mcp.r500005-mcp200] [Citation(s) in RCA: 229] [Impact Index Per Article: 12.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open
Abstract
The combined method of LC-MS/MS is increasingly being used to explore differences in the proteomic composition of complex biological systems. The reliability and utility of such comparative protein expression profiling studies is critically dependent on an accurate and rigorous assessment of quantitative changes in the relative abundance of the myriad of proteins typically present in a biological sample such as blood or tissue. In this review, we provide an overview of key statistical and computational issues relevant to bottom-up shotgun global proteomic analysis, with an emphasis on methods that can be applied to improve the dependability of biological inferences drawn from large proteomic datasets. Focusing on a start-to-finish approach, we address the following topics: 1) low-level data processing steps, such as formation of a data matrix, filtering, and baseline subtraction to minimize noise, 2) mid-level processing steps, such as data normalization, alignment in time, peak detection, peak quantification, peak matching, and error models, to facilitate profile comparisons; and, 3) high-level processing steps such as sample classification and biomarker discovery, and related topics such as significance testing, multiple testing, and choice of feature space. We report on approaches that have recently been developed for these steps, discussing their merits and limitations, and propose areas deserving of further research.
Collapse
Affiliation(s)
- Jennifer Listgarten
- Department of Computer Science, University of Toronto, Toronto, Ontario M5S 3G4, Canada
| | | |
Collapse
|
43
|
Listgarten J, Damaraju S, Poulin B, Cook L, Dufour J, Driga A, Mackey J, Wishart D, Greiner R, Zanke B. Predictive models for breast cancer susceptibility from multiple single nucleotide polymorphisms. Clin Cancer Res 2004; 10:2725-37. [PMID: 15102677 DOI: 10.1158/1078-0432.ccr-1115-03] [Citation(s) in RCA: 137] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Hereditary predisposition and causative environmental exposures have long been recognized in human malignancies. In most instances, cancer cases occur sporadically, suggesting that environmental influences are critical in determining cancer risk. To test the influence of genetic polymorphisms on breast cancer risk, we have measured 98 single nucleotide polymorphisms (SNPs) distributed over 45 genes of potential relevance to breast cancer etiology in 174 patients and have compared these with matched normal controls. Using machine learning techniques such as support vector machines (SVMs), decision trees, and naïve Bayes, we identified a subset of three SNPs as key discriminators between breast cancer and controls. The SVMs performed maximally among predictive models, achieving 69% predictive power in distinguishing between the two groups, compared with a 50% baseline predictive power obtained from the data after repeated random permutation of class labels (individuals with cancer or controls). However, the simpler naïve Bayes model as well as the decision tree model performed quite similarly to the SVM. The three SNP sites most useful in this model were (a) the +4536T/C site of the aldosterone synthase gene CYP11B2 at amino acid residue 386 Val/Ala (T/C) (rs4541); (b) the +4328C/G site of the aryl hydrocarbon hydroxylase CYP1B1 at amino acid residue 293 Leu/Val (C/G) (rs5292); and (c) the +4449C/T site of the transcription factor BCL6 at amino acid 387 Asp/Asp (rs1056932). No single SNP site on its own could achieve more than 60% in predictive accuracy. We have shown that multiple SNP sites from different genes over distant parts of the genome are better at identifying breast cancer patients than any one SNP alone. As high-throughput technology for SNPs improves and as more SNPs are identified, it is likely that much higher predictive accuracy will be achieved and a useful clinical tool developed.
Collapse
Affiliation(s)
- Jennifer Listgarten
- Cross Cancer Institute of the Alberta Cancer Board, Edmonton, Alberta, Canada
| | | | | | | | | | | | | | | | | | | |
Collapse
|
44
|
Listgarten J, Graham K, Damaraju S, Cass C, Mackey J, Zanke B. Clinically validated benchmarking of normalisation techniques for two-colour oligonucleotide spotted microarray slides. Appl Bioinformatics 2003; 2:219-28. [PMID: 15130793] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 04/29/2023]
Abstract
Acquisition of microarray data is prone to systematic errors. A correction, called normalisation, must be applied to the data before further analysis is performed. With many normalisation techniques published and in use, the best way of executing this correction remains an open question. In this study, a variety of single-slide normalisation techniques, and different parameter settings for these techniques, were compared over many replicated microarray experiments. Different normalisation techniques were assessed through the distribution of the standard deviation of replicates from one biological sample across different slides. It is shown that local normalisation outperformed global normalisation, and intensity-based 'LOWESS' outperformed trimmed mean and median normalisation techniques. Overall, the top performing normalisation technique was a print-tip-based LOWESS with zero robust iterations. Lastly, we validated this evaluation methodology by examining the ability to predict oestrogen receptor-positive and -negative breast cancer samples with data that had been normalised using different techniques.
Collapse
MESH Headings
- Algorithms
- Benchmarking/methods
- Biomarkers, Tumor/genetics
- Breast Neoplasms/genetics
- Calibration/standards
- Gene Expression Profiling/instrumentation
- Gene Expression Profiling/methods
- Gene Expression Profiling/standards
- Genetic Testing/methods
- Humans
- Microscopy, Fluorescence, Multiphoton/instrumentation
- Microscopy, Fluorescence, Multiphoton/methods
- Microscopy, Fluorescence, Multiphoton/standards
- Oligonucleotide Array Sequence Analysis/instrumentation
- Oligonucleotide Array Sequence Analysis/methods
- Oligonucleotide Array Sequence Analysis/standards
- Quality Control
- Receptors, Estrogen/genetics
- Reference Standards
- Reproducibility of Results
- Sensitivity and Specificity
- Sequence Analysis, DNA/methods
- Sequence Analysis, DNA/standards
- Spectrometry, Fluorescence/instrumentation
- Spectrometry, Fluorescence/methods
- Spectrometry, Fluorescence/standards
Collapse
Affiliation(s)
- Jennifer Listgarten
- PolyomX Program, Cross Cancer Institute, University of Alberta, Alberta Cancer Board, Edmonton, AB, Canada.
| | | | | | | | | | | |
Collapse
|