1
|
Gadotti A, Rocher L, Houssiau F, Creţu AM, de Montjoye YA. Anonymization: The imperfect science of using data while preserving privacy. SCIENCE ADVANCES 2024; 10:eadn7053. [PMID: 39018389 PMCID: PMC466941 DOI: 10.1126/sciadv.adn7053] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/22/2023] [Accepted: 06/10/2024] [Indexed: 07/19/2024]
Abstract
Information about us, our actions, and our preferences is created at scale through surveys or scientific studies or as a result of our interaction with digital devices such as smartphones and fitness trackers. The ability to safely share and analyze such data is key for scientific and societal progress. Anonymization is considered by scientists and policy-makers as one of the main ways to share data while minimizing privacy risks. In this review, we offer a pragmatic perspective on the modern literature on privacy attacks and anonymization techniques. We discuss traditional de-identification techniques and their strong limitations in the age of big data. We then turn our attention to modern approaches to share anonymous aggregate data, such as data query systems, synthetic data, and differential privacy. We find that, although no perfect solution exists, applying modern techniques while auditing their guarantees against attacks is the best approach to safely use and share data today.
Collapse
Affiliation(s)
- Andrea Gadotti
- Imperial College London, Exhibition Road, London SW7 2AZ, UK
- University of Oxford, Wellington Square, Oxford OX1 2JD, UK
| | - Luc Rocher
- Imperial College London, Exhibition Road, London SW7 2AZ, UK
- University of Oxford, Wellington Square, Oxford OX1 2JD, UK
| | - Florimond Houssiau
- Imperial College London, Exhibition Road, London SW7 2AZ, UK
- Alan Turing Institute, 96 Euston Road, London NW1 2DB, UK
| | - Ana-Maria Creţu
- Imperial College London, Exhibition Road, London SW7 2AZ, UK
- EPFL, CH-1015 Lausanne, Switzerland
| | | |
Collapse
|
2
|
Asif H, Vaidya J, Papakonstantinou PA. Identifying Anomalies while Preserving Privacy. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 2023; 35:12264-12281. [PMID: 37974954 PMCID: PMC10651053 DOI: 10.1109/tkde.2021.3129633] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/19/2023]
Abstract
Identifying anomalies in data is vital in many domains, including medicine, finance, and national security. However, privacy concerns pose a significant roadblock to carrying out such an analysis. Since existing privacy definitions do not allow good accuracy when doing outlier analysis, the notion of sensitive privacy has been recently proposed to deal with this problem. Sensitive privacy makes it possible to analyze data for anomalies with practically meaningful accuracy while providing a strong guarantee similar to differential privacy, which is the prevalent privacy standard today. In this work, we relate sensitive privacy to other important notions of data privacy so that one can port the technical developments and private mechanism constructions from these related concepts to sensitive privacy. Sensitive privacy critically depends on the underlying anomaly model. We develop a novel n-step lookahead mechanism to efficiently answer arbitrary outlier queries, which provably guarantees sensitive privacy if we restrict our attention to common a class of anomaly models. We also provide general constructions to give sensitively private mechanisms for identifying anomalies and show the conditions under which the constructions would be optimal.
Collapse
|
3
|
Jarmin RS, Abowd JM, Ashmead R, Cumings-Menon R, Goldschlag N, Hawes MB, Keller SA, Kifer D, Leclerc P, Reiter JP, Rodríguez RA, Schmutte I, Velkoff VA, Zhuravlev P. An in-depth examination of requirements for disclosure risk assessment. Proc Natl Acad Sci U S A 2023; 120:e2220558120. [PMID: 37831744 PMCID: PMC10614951 DOI: 10.1073/pnas.2220558120] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/15/2023] Open
Abstract
The use of formal privacy to protect the confidentiality of responses in the 2020 Decennial Census of Population and Housing has triggered renewed interest and debate over how to measure the disclosure risks and societal benefits of the published data products. We argue that any proposal for quantifying disclosure risk should be based on prespecified, objective criteria. We illustrate this approach to evaluate the absolute disclosure risk framework, the counterfactual framework underlying differential privacy, and prior-to-posterior comparisons. We conclude that satisfying all the desiderata is impossible, but counterfactual comparisons satisfy the most while absolute disclosure risk satisfies the fewest. Furthermore, we explain that many of the criticisms levied against differential privacy would be levied against any technology that is not equivalent to direct, unrestricted access to confidential data. More research is needed, but in the near term, the counterfactual approach appears best-suited for privacy versus utility analysis.
Collapse
Affiliation(s)
- Ron S. Jarmin
- U.S. Census Bureau, Office of the Deputy Director, Washington, DC20233
| | - John M. Abowd
- Department of Economics, Cornell University, Ithaca, NY14853
| | - Robert Ashmead
- U.S. Census Bureau, Office of the Deputy Director, Washington, DC20233
| | | | - Nathan Goldschlag
- U.S. Census Bureau, Office of the Deputy Director, Washington, DC20233
| | - Michael B. Hawes
- U.S. Census Bureau, Office of the Deputy Director, Washington, DC20233
| | - Sallie Ann Keller
- U.S. Census Bureau, Office of the Deputy Director, Washington, DC20233
- Biocomplexity Institute, University of Virginia, Charlottesville, VA22904
| | - Daniel Kifer
- U.S. Census Bureau, Office of the Deputy Director, Washington, DC20233
- Department of Computer Science and Engineering, Penn State University, University Park, PA16802
| | - Philip Leclerc
- U.S. Census Bureau, Office of the Deputy Director, Washington, DC20233
| | - Jerome P. Reiter
- U.S. Census Bureau, Office of the Deputy Director, Washington, DC20233
- Department of Statistical Science, Duke University, Durham, NC27708
| | | | - Ian Schmutte
- Department of Economics, University of Georgia, Athens, GA30602
| | | | - Pavel Zhuravlev
- U.S. Census Bureau, Office of the Deputy Director, Washington, DC20233
| |
Collapse
|
4
|
Tasnim N, Mohammadi J, Sarwate AD, Imtiaz H. Approximating Functions with Approximate Privacy for Applications in Signal Estimation and Learning. ENTROPY (BASEL, SWITZERLAND) 2023; 25:e25050825. [PMID: 37238580 DOI: 10.3390/e25050825] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/21/2023] [Revised: 04/16/2023] [Accepted: 04/26/2023] [Indexed: 05/28/2023]
Abstract
Large corporations, government entities and institutions such as hospitals and census bureaus routinely collect our personal and sensitive information for providing services. A key technological challenge is designing algorithms for these services that provide useful results, while simultaneously maintaining the privacy of the individuals whose data are being shared. Differential privacy (DP) is a cryptographically motivated and mathematically rigorous approach for addressing this challenge. Under DP, a randomized algorithm provides privacy guarantees by approximating the desired functionality, leading to a privacy-utility trade-off. Strong (pure DP) privacy guarantees are often costly in terms of utility. Motivated by the need for a more efficient mechanism with better privacy-utility trade-off, we propose Gaussian FM, an improvement to the functional mechanism (FM) that offers higher utility at the expense of a weakened (approximate) DP guarantee. We analytically show that the proposed Gaussian FM algorithm can offer orders of magnitude smaller noise compared to the existing FM algorithms. We further extend our Gaussian FM algorithm to decentralized-data settings by incorporating the CAPE protocol and propose capeFM. Our method can offer the same level of utility as its centralized counterparts for a range of parameter choices. We empirically show that our proposed algorithms outperform existing state-of-the-art approaches on synthetic and real datasets.
Collapse
Affiliation(s)
- Naima Tasnim
- Department of Electrical and Electronic Engineering, Bangladesh University of Engineering and Technology, Dhaka P.O. Box 1205, Bangladesh
| | | | - Anand D Sarwate
- Department of Electrical and Computer Engineering, Rutgers, The State University of New Jersey, 94 Brett Road, Piscataway, NJ 08854-8058, USA
| | - Hafiz Imtiaz
- Department of Electrical and Electronic Engineering, Bangladesh University of Engineering and Technology, Dhaka P.O. Box 1205, Bangladesh
| |
Collapse
|
5
|
Comparing approximate and probabilistic differential privacy parameters. INFORM PROCESS LETT 2023. [DOI: 10.1016/j.ipl.2023.106380] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/19/2023]
|
6
|
Drechsler J. Differential Privacy for Government Agencies—Are We There Yet? J Am Stat Assoc 2023. [DOI: 10.1080/01621459.2022.2161385] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023]
Affiliation(s)
- Jörg Drechsler
- Institute for Employment Research and University of Maryland
| |
Collapse
|
7
|
Imtiaz H, Mohammadi J, Silva R, Baker B, Plis SM, Sarwate AD, Calhoun VD. A Correlated Noise-assisted Decentralized Differentially Private Estimation Protocol, and its application to fMRI Source Separation. IEEE TRANSACTIONS ON SIGNAL PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2021; 69:6355-6370. [PMID: 35755147 PMCID: PMC9232162 DOI: 10.1109/tsp.2021.3126546] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Abstract
Blind source separation algorithms such as independent component analysis (ICA) are widely used in the analysis of neuroimaging data. To leverage larger sample sizes, different data holders/sites may wish to collaboratively learn feature representations. However, such datasets are often privacy-sensitive, precluding centralized analyses that pool the data at one site. In this work, we propose a differentially private algorithm for performing ICA in a decentralized data setting. Due to the high dimension and small sample size, conventional approaches to decentralized differentially private algorithms suffer in terms of utility. When centralizing the data is not possible, we investigate the benefit of enabling limited collaboration in the form of generating jointly distributed random noise. We show that such (anti) correlated noise improves the privacy-utility trade-off, and can reach the same level of utility as the corresponding non-private algorithm for certain parameter choices. We validate this benefit using synthetic and real neuroimaging datasets. We conclude that it is possible to achieve meaningful utility while preserving privacy, even in complex signal processing systems.
Collapse
Affiliation(s)
- Hafiz Imtiaz
- Department of Electrical and Electronic Engineering, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh
| | | | - Rogers Silva
- Tri-institutional Center for Translational Research in Neuroimaging and Data Science (TReNDS), Georgia State University, Georgia Institute of Technology, and Emory University, 55 Park Place NE, Atlanta, GA 30303
| | - Bradley Baker
- Tri-institutional Center for Translational Research in Neuroimaging and Data Science (TReNDS), Georgia State University, Georgia Institute of Technology, and Emory University, 55 Park Place NE, Atlanta, GA 30303
| | - Sergey M Plis
- Tri-institutional Center for Translational Research in Neuroimaging and Data Science (TReNDS), Georgia State University, Georgia Institute of Technology, and Emory University, 55 Park Place NE, Atlanta, GA 30303
| | - Anand D Sarwate
- Department of Electrical and Computer Engineering, Rutgers University, 94 Brett Road, Piscataway, NJ 08854
| | - Vince D Calhoun
- Tri-institutional Center for Translational Research in Neuroimaging and Data Science (TReNDS), Georgia State University, Georgia Institute of Technology, and Emory University, 55 Park Place NE, Atlanta, GA 30303
| |
Collapse
|
8
|
Wang JT, Lin WY. Privacy-Preserving Anonymity for Periodical Releases of Spontaneous Adverse Drug Event Reporting Data: Algorithm Development and Validation. JMIR Med Inform 2021; 9:e28752. [PMID: 34709197 PMCID: PMC8587328 DOI: 10.2196/28752] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2021] [Revised: 07/30/2021] [Accepted: 08/02/2021] [Indexed: 11/20/2022] Open
Abstract
Background Spontaneous reporting systems (SRSs) have been increasingly established to collect adverse drug events for fostering adverse drug reaction (ADR) detection and analysis research. SRS data contain personal information, and so their publication requires data anonymization to prevent the disclosure of individuals’ privacy. We have previously proposed a privacy model called MS(k, θ*)-bounding and the associated MS-Anonymization algorithm to fulfill the anonymization of SRS data. In the real world, the SRS data usually are released periodically (eg, FDA Adverse Event Reporting System [FAERS]) to accommodate newly collected adverse drug events. Different anonymized releases of SRS data available to the attacker may thwart our single-release-focus method, that is, MS(k, θ*)-bounding. Objective We investigate the privacy threat caused by periodical releases of SRS data and propose anonymization methods to prevent the disclosure of personal privacy information while maintaining the utility of published data. Methods We identify potential attacks on periodical releases of SRS data, namely, BFL-attacks, mainly caused by follow-up cases. We present a new privacy model called PPMS(k, θ*)-bounding, and propose the associated PPMS-Anonymization algorithm and 2 improvements: PPMS+-Anonymization and PPMS++-Anonymization. Empirical evaluations were performed using 32 selected FAERS quarter data sets from 2004Q1 to 2011Q4. The performance of the proposed versions of PPMS-Anonymization was inspected against MS-Anonymization from some aspects, including data distortion, measured by normalized information loss; privacy risk of anonymized data, measured by dangerous identity ratio and dangerous sensitivity ratio; and data utility, measured by the bias of signal counting and strength (proportional reporting ratio). Results The best version of PPMS-Anonymization, PPMS++-Anonymization, achieves nearly the same quality as MS-Anonymization in both privacy protection and data utility. Overall, PPMS++-Anonymization ensures zero privacy risk on record and attribute linkage, and exhibits 51%-78% and 59%-82% improvements on information loss over PPMS+-Anonymization and PPMS-Anonymization, respectively, and significantly reduces the bias of ADR signal. Conclusions The proposed PPMS(k, θ*)-bounding model and PPMS-Anonymization algorithm are effective in anonymizing SRS data sets in the periodical data publishing scenario, preventing the series of releases from disclosing personal sensitive information caused by BFL-attacks while maintaining the data utility for ADR signal detection.
Collapse
Affiliation(s)
- Jie-Teng Wang
- Department of Computer Science and Information Engineering, National University of Kaohsiung, Kaohsiung, Taiwan
| | - Wen-Yang Lin
- Department of Computer Science and Information Engineering, National University of Kaohsiung, Kaohsiung, Taiwan
| |
Collapse
|
9
|
Differential Privacy at Risk: Bridging Randomness and Privacy Budget. PROCEEDINGS ON PRIVACY ENHANCING TECHNOLOGIES 2020. [DOI: 10.2478/popets-2021-0005] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
Abstract
The calibration of noise for a privacy-preserving mechanism depends on the sensitivity of the query and the prescribed privacy level. A data steward must make the non-trivial choice of a privacy level that balances the requirements of users and the monetary constraints of the business entity.
Firstly, we analyse roles of the sources of randomness, namely the explicit randomness induced by the noise distribution and the implicit randomness induced by the data-generation distribution, that are involved in the design of a privacy-preserving mechanism. The finer analysis enables us to provide stronger privacy guarantees with quantifiable risks. Thus, we propose privacy at risk that is a probabilistic calibration of privacy-preserving mechanisms. We provide a composition theorem that leverages privacy at risk. We instantiate the probabilistic calibration for the Laplace mechanism by providing analytical results.
Secondly, we propose a cost model that bridges the gap between the privacy level and the compensation budget estimated by a GDPR compliant business entity. The convexity of the proposed cost model leads to a unique fine-tuning of privacy level that minimises the compensation budget. We show its effectiveness by illustrating a realistic scenario that avoids overestimation of the compensation budget by using privacy at risk for the Laplace mechanism. We quantitatively show that composition using the cost optimal privacy at risk provides stronger privacy guarantee than the classical advanced composition. Although the illustration is specific to the chosen cost model, it naturally extends to any convex cost model. We also provide realistic illustrations of how a data steward uses privacy at risk to balance the trade-off between utility and privacy.
Collapse
|
10
|
Differentially Private SQL with Bounded User Contribution. PROCEEDINGS ON PRIVACY ENHANCING TECHNOLOGIES 2020. [DOI: 10.2478/popets-2020-0025] [Citation(s) in RCA: 24] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
Abstract
Abstract
Differential privacy (DP) provides formal guarantees that the output of a database query does not reveal too much information about any individual present in the database. While many differentially private algorithms have been proposed in the scientific literature, there are only a few end-to-end implementations of differentially private query engines. Crucially, existing systems assume that each individual is associated with at most one database record, which is unrealistic in practice. We propose a generic and scalable method to perform differentially private aggregations on databases, even when individuals can each be associated with arbitrarily many rows. We express this method as an operator in relational algebra, and implement it in an SQL engine. To validate this system, we test the utility of typical queries on industry benchmarks, and verify its correctness with a stochastic test framework we developed. We highlight the promises and pitfalls learned when deploying such a system in practice, and we publish its core components as open-source software.
Collapse
|