1
|
Walker CR, Li X, Chakravarthy M, Lounsbery-Scaife W, Choi YA, Singh R, Gürsoy G. Private information leakage from single-cell count matrices. Cell 2024:S0092-8674(24)01030-4. [PMID: 39362221 DOI: 10.1016/j.cell.2024.09.012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2024] [Revised: 08/11/2024] [Accepted: 09/05/2024] [Indexed: 10/05/2024]
Abstract
The increase in publicly available human single-cell datasets, encompassing millions of cells from many donors, has significantly enhanced our understanding of complex biological processes. However, the accessibility of these datasets raises significant privacy concerns. Due to the inherent noise in single-cell measurements and the scarcity of population-scale single-cell datasets, recent private information quantification studies have focused on bulk gene expression data sharing. To address this gap, we demonstrate that individuals in single-cell gene expression datasets are vulnerable to linking attacks, where attackers can infer their sensitive phenotypic information using publicly available tissue or cell-type-specific expression quantitative trait loci (eQTLs) information. We further develop a method for genotype prediction and genotype-phenotype linking that remains effective without relying on eQTL information. We show that variants from one study can be exploited to uncover private information about individuals in another study.
Collapse
Affiliation(s)
- Conor R Walker
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, USA; New York Genome Center, New York, NY 10013, USA
| | - Xiaoting Li
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, USA; New York Genome Center, New York, NY 10013, USA
| | - Manav Chakravarthy
- Department of Computer Science, Brown University, Providence, RI 02912, USA
| | - William Lounsbery-Scaife
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, USA; New York Genome Center, New York, NY 10013, USA
| | - Yoolim A Choi
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, USA; New York Genome Center, New York, NY 10013, USA
| | - Ritambhara Singh
- Department of Computer Science, Brown University, Providence, RI 02912, USA
| | - Gamze Gürsoy
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, USA; New York Genome Center, New York, NY 10013, USA; Department of Computer Science, Columbia University, New York, NY 10032, USA.
| |
Collapse
|
2
|
Thomas M, Mackes N, Preuss-Dodhy A, Wieland T, Bundschus M. Assessing Privacy Vulnerabilities in Genetic Data Sets: Scoping Review. JMIR BIOINFORMATICS AND BIOTECHNOLOGY 2024; 5:e54332. [PMID: 38935957 PMCID: PMC11165293 DOI: 10.2196/54332] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/06/2023] [Revised: 03/26/2024] [Accepted: 03/29/2024] [Indexed: 06/29/2024]
Abstract
BACKGROUND Genetic data are widely considered inherently identifiable. However, genetic data sets come in many shapes and sizes, and the feasibility of privacy attacks depends on their specific content. Assessing the reidentification risk of genetic data is complex, yet there is a lack of guidelines or recommendations that support data processors in performing such an evaluation. OBJECTIVE This study aims to gain a comprehensive understanding of the privacy vulnerabilities of genetic data and create a summary that can guide data processors in assessing the privacy risk of genetic data sets. METHODS We conducted a 2-step search, in which we first identified 21 reviews published between 2017 and 2023 on the topic of genomic privacy and then analyzed all references cited in the reviews (n=1645) to identify 42 unique original research studies that demonstrate a privacy attack on genetic data. We then evaluated the type and components of genetic data exploited for these attacks as well as the effort and resources needed for their implementation and their probability of success. RESULTS From our literature review, we derived 9 nonmutually exclusive features of genetic data that are both inherent to any genetic data set and informative about privacy risk: biological modality, experimental assay, data format or level of processing, germline versus somatic variation content, content of single nucleotide polymorphisms, short tandem repeats, aggregated sample measures, structural variants, and rare single nucleotide variants. CONCLUSIONS On the basis of our literature review, the evaluation of these 9 features covers the great majority of privacy-critical aspects of genetic data and thus provides a foundation and guidance for assessing genetic data risk.
Collapse
|
3
|
Emani PS, Geradi MN, Gürsoy G, Grasty MR, Miranker A, Gerstein MB. Assessing and mitigating privacy risks of sparse, noisy genotypes by local alignment to haplotype databases. Genome Res 2023; 33:2156-2173. [PMID: 38097386 PMCID: PMC10760520 DOI: 10.1101/gr.278322.123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2023] [Accepted: 11/18/2023] [Indexed: 01/04/2024]
Abstract
Single nucleotide polymorphisms (SNPs) from omics data create a reidentification risk for individuals and their relatives. Although the ability of thousands of SNPs (especially rare ones) to identify individuals has been repeatedly shown, the availability of small sets of noisy genotypes, from environmental DNA samples or functional genomics data, motivated us to quantify their informativeness. We present a computational tool suite, termed Privacy Leakage by Inference across Genotypic HMM Trajectories (PLIGHT), using population-genetics-based hidden Markov models (HMMs) of recombination and mutation to find piecewise alignment of small, noisy SNP sets to reference haplotype databases. We explore cases in which query individuals are either known to be in the database, or not, and consider several genotype queries, including those from environmental sample swabs from known individuals and from simulated "mosaics" (two-individual composites). Using PLIGHT on a database with ∼5000 haplotypes, we find for common, noise-free SNPs that only ten are sufficient to identify individuals, ∼20 can identify both components in two-individual mosaics, and 20-30 can identify first-order relatives. Using noisy environmental-sample-derived SNPs, PLIGHT identifies individuals in a database using ∼30 SNPs. Even when the individuals are not in the database, local genotype matches allow for some phenotypic information leakage based on coarse-grained SNP imputation. Finally, by quantifying privacy leakage from sparse SNP sets, PLIGHT helps determine the value of selectively sanitizing released SNPs without explicit assumptions about population membership or allele frequency. To make this practical, we provide a sanitization tool to remove the most identifying SNPs from genomic data.
Collapse
Affiliation(s)
- Prashant S Emani
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut 06520, USA
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA
| | - Maya N Geradi
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut 06520, USA
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA
| | - Gamze Gürsoy
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut 06520, USA
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA
| | - Monica R Grasty
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA
| | - Andrew Miranker
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA
| | - Mark B Gerstein
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut 06520, USA;
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA
- Department of Computer Science, Yale University, New Haven, Connecticut 06520, USA
- Department of Statistics and Data Science, Yale University, New Haven, Connecticut 06520, USA
| |
Collapse
|
4
|
Sadhuka S, Fridman D, Berger B, Cho H. Assessing transcriptomic reidentification risks using discriminative sequence models. Genome Res 2023; 33:1101-1112. [PMID: 37541758 PMCID: PMC10538488 DOI: 10.1101/gr.277699.123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2023] [Accepted: 04/19/2023] [Indexed: 08/06/2023]
Abstract
Gene expression data provide molecular insights into the functional impact of genetic variation, for example, through expression quantitative trait loci (eQTLs). With an improving understanding of the association between genotypes and gene expression comes a greater concern that gene expression profiles could be matched to genotype profiles of the same individuals in another data set, known as a linking attack. Prior works show such a risk could analyze only a fraction of eQTLs that is independent owing to restrictive model assumptions, leaving the full extent of this risk incompletely understood. To address this challenge, we introduce the discriminative sequence model (DSM), a novel probabilistic framework for predicting a sequence of genotypes based on gene expression data. By modeling the joint distribution over all known eQTLs in a genomic region, DSM improves the power of linking attacks with necessary calibration for linkage disequilibrium and redundant predictive signals. We show greater linking accuracy of DSM compared with existing approaches across a range of attack scenarios and data sets including up to 22,288 individuals, suggesting that DSM helps uncover a substantial additional risk overlooked by previous studies. Our work provides a unified framework for assessing the privacy risks of sharing diverse omics data sets beyond transcriptomics.
Collapse
Affiliation(s)
- Shuvom Sadhuka
- Computer Science and AI Lab, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA
| | - Daniel Fridman
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts 02115, USA
| | - Bonnie Berger
- Computer Science and AI Lab, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA
| | - Hyunghoon Cho
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA;
| |
Collapse
|
5
|
Wang S, Kim M, Li W, Jiang X, Chen H, Harmanci A. Privacy-aware estimation of relatedness in admixed populations. Brief Bioinform 2022; 23:bbac473. [PMID: 36384083 PMCID: PMC10144692 DOI: 10.1093/bib/bbac473] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2022] [Revised: 09/07/2022] [Accepted: 10/02/2022] [Indexed: 11/18/2022] Open
Abstract
BACKGROUND Estimation of genetic relatedness, or kinship, is used occasionally for recreational purposes and in forensic applications. While numerous methods were developed to estimate kinship, they suffer from high computational requirements and often make an untenable assumption of homogeneous population ancestry of the samples. Moreover, genetic privacy is generally overlooked in the usage of kinship estimation methods. There can be ethical concerns about finding unknown familial relationships in third-party databases. Similar ethical concerns may arise while estimating and reporting sensitive population-level statistics such as inbreeding coefficients for the concerns around marginalization and stigmatization. RESULTS Here, we present SIGFRIED, which makes use of existing reference panels with a projection-based approach that simplifies kinship estimation in the admixed populations. We use simulated and real datasets to demonstrate the accuracy and efficiency of kinship estimation. We present a secure federated kinship estimation framework and implement a secure kinship estimator using homomorphic encryption-based primitives for computing relatedness between samples in two different sites while genotype data are kept confidential. Source code and documentation for our methods can be found at https://doi.org/10.5281/zenodo.7053352. CONCLUSIONS Analysis of relatedness is fundamentally important for identifying relatives, in association studies, and for estimation of population-level estimates of inbreeding. As the awareness of individual and group genomic privacy is growing, privacy-preserving methods for the estimation of relatedness are needed. Presented methods alleviate the ethical and privacy concerns in the analysis of relatedness in admixed, historically isolated and underrepresented populations. SHORT ABSTRACT Genetic relatedness is a central quantity used for finding relatives in databases, correcting biases in genome wide association studies and for estimating population-level statistics. Methods for estimating genetic relatedness have high computational requirements, and occasionally do not consider individuals from admixed ancestries. Furthermore, the ethical concerns around using genetic data and calculating relatedness are not considered. We present a projection-based approach that can efficiently and accurately estimate kinship. We implement our method using encryption-based techniques that provide provable security guarantees to protect genetic data while kinship statistics are computed among multiple sites.
Collapse
Affiliation(s)
- Su Wang
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| | - Miran Kim
- Department of Mathematics, Hanyang University, Seoul, 04763. Republic of Korea
| | - Wentao Li
- Center for Secure Artificial intelligence For hEalthcare (SAFE), School of Biomedical Informatics, University of Texas Health Science Center, Houston, TX, 77030, USA
| | - Xiaoqian Jiang
- Center for Secure Artificial intelligence For hEalthcare (SAFE), School of Biomedical Informatics, University of Texas Health Science Center, Houston, TX, 77030, USA
| | - Han Chen
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
- Human Genetics Center, Department of Epidemiology, Human Genetics and Environmental Sciences, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| | - Arif Harmanci
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| |
Collapse
|
6
|
Abstract
Genomics data are important for advancing biomedical research, improving clinical care, and informing other disciplines such as forensics and genealogy. However, privacy concerns arise when genomic data are shared. In particular, the identifying nature of genetic information, its direct relationship to health status, and the potential financial harm and stigmatization posed to individuals and their blood relatives call for a survey of the privacy issues related to sharing genetic and related data and potential solutions to overcome these issues. In this work, we provide an overview of the importance of genomic privacy, the information gleaned from genomics data, the sources of potential private information leakages in genomics, and ways to preserve privacy while utilizing the genetic information in research. We discuss the relationship between trust in the scientific community and protecting privacy, illuminating a future roadmap for data sharing and study participation.
Collapse
Affiliation(s)
- Gamze Gürsoy
- Department of Biomedical Informatics, Columbia University, New York, NY, USA; .,New York Genome Center, New York, NY, USA
| |
Collapse
|
7
|
Functional genomics data: privacy risk assessment and technological mitigation. Nat Rev Genet 2022; 23:245-258. [PMID: 34759381 DOI: 10.1038/s41576-021-00428-7] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/18/2021] [Indexed: 12/15/2022]
Abstract
The generation of functional genomics data by next-generation sequencing has increased greatly in the past decade. Broad sharing of these data is essential for research advancement but poses notable privacy challenges, some of which are analogous to those that occur when sharing genetic variant data. However, there are also unique privacy challenges that arise from cryptic information leakage during the processing and summarization of functional genomics data from raw reads to derived quantities, such as gene expression values. Here, we review these challenges and present potential solutions for mitigating privacy risks while allowing broad data dissemination and analysis.
Collapse
|