1
|
Wan S, Wang J. A Sequence Obfuscation Method for Protecting Personal Genomic Privacy. Front Genet 2022; 13:876686. [PMID: 35495121 PMCID: PMC9043694 DOI: 10.3389/fgene.2022.876686] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2022] [Accepted: 03/14/2022] [Indexed: 11/23/2022] Open
Abstract
With the technological advances in recent decades, determining whole genome sequencing of a person has become feasible and affordable. As a result, large-scale individual genomic sequences are produced and collected for genetic medical diagnoses and cancer drug discovery, which, however, simultaneously poses serious challenges to the protection of personal genomic privacy. It is highly urgent to develop methods which make the personal genomic data both utilizable and confidential. Existing genomic privacy-protection methods are either time-consuming for encryption or with low accuracy of data recovery. To tackle these problems, this paper proposes a sequence similarity-based obfuscation method, namely IterMegaBLAST, for fast and reliable protection of personal genomic privacy. Specifically, given a randomly selected sequence from a dataset of genomic sequences, we first use MegaBLAST to find its most similar sequence from the dataset. These two aligned sequences form a cluster, for which an obfuscated sequence was generated via a DNA generalization lattice scheme. These procedures are iteratively performed until all of the sequences in the dataset are clustered and their obfuscated sequences are generated. Experimental results on benchmark datasets demonstrate that under the same degree of anonymity, IterMegaBLAST significantly outperforms existing state-of-the-art approaches in terms of both utility accuracy and time complexity.
Collapse
Affiliation(s)
- Shibiao Wan
- Center for Applied Bioinformatics, St. Jude Children’s Research Hospital, Memphis, TN, United States
- *Correspondence: Shibiao Wan, ; Jieqiong Wang,
| | - Jieqiong Wang
- Department of Radiology, University of Pennsylvania, Philadelphia, PA, United States
- *Correspondence: Shibiao Wan, ; Jieqiong Wang,
| |
Collapse
|
2
|
Azencott CA. Machine learning and genomics: precision medicine versus patient privacy. PHILOSOPHICAL TRANSACTIONS. SERIES A, MATHEMATICAL, PHYSICAL, AND ENGINEERING SCIENCES 2018; 376:rsta.2017.0350. [PMID: 30082298 DOI: 10.1098/rsta.2017.0350] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Accepted: 06/07/2018] [Indexed: 06/08/2023]
Abstract
Machine learning can have a major societal impact in computational biology applications. In particular, it plays a central role in the development of precision medicine, whereby treatment is tailored to the clinical or genetic features of the patient. However, these advances require collecting and sharing among researchers large amounts of genomic data, which generates much concern about privacy. Researchers, study participants and governing bodies should be aware of the ways in which the privacy of participants might be compromised, as well as of the large body of research on technical solutions to these issues. We review how breaches in patient privacy can occur, present recent developments in computational data protection and discuss how they can be combined with legal and ethical perspectives to provide secure frameworks for genomic data sharing.This article is part of a discussion meeting issue 'The growing ubiquity of algorithms in society: implications, impacts and innovations'.
Collapse
Affiliation(s)
- C-A Azencott
- MINES ParisTech, PSL Research University, CBIO-Centre for Computational Biology, 75006 Paris, France
- Institut Curie, PSL Research University, 75005 Paris, France
- INSERM, U900, 75005 Paris, France
| |
Collapse
|
3
|
Decouchant J, Fernandes M, Völp M, Couto FM, Esteves-Veríssimo P. Accurate filtering of privacy-sensitive information in raw genomic data. J Biomed Inform 2018; 82:1-12. [PMID: 29660494 DOI: 10.1016/j.jbi.2018.04.006] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2018] [Accepted: 04/07/2018] [Indexed: 10/17/2022]
Abstract
Sequencing thousands of human genomes has enabled breakthroughs in many areas, among them precision medicine, the study of rare diseases, and forensics. However, mass collection of such sensitive data entails enormous risks if not protected to the highest standards. In this article, we follow the position and argue that post-alignment privacy is not enough and that data should be automatically protected as early as possible in the genomics workflow, ideally immediately after the data is produced. We show that a previous approach for filtering short reads cannot extend to long reads and present a novel filtering approach that classifies raw genomic data (i.e., whose location and content is not yet determined) into privacy-sensitive (i.e., more affected by a successful privacy attack) and non-privacy-sensitive information. Such a classification allows the fine-grained and automated adjustment of protective measures to mitigate the possible consequences of exposure, in particular when relying on public clouds. We present the first filter that can be indistinctly applied to reads of any length, i.e., making it usable with any recent or future sequencing technologies. The filter is accurate, in the sense that it detects all known sensitive nucleotides except those located in highly variable regions (less than 10 nucleotides remain undetected per genome instead of 100,000 in previous works). It has far less false positives than previously known methods (10% instead of 60%) and can detect sensitive nucleotides despite sequencing errors (86% detected instead of 56% with 2% of mutations). Finally, practical experiments demonstrate high performance, both in terms of throughput and memory consumption.
Collapse
Affiliation(s)
- Jérémie Decouchant
- SnT - Interdisciplinary Centre for Security, Reliability and Trust, University of Luxembourg, Luxembourg.
| | - Maria Fernandes
- SnT - Interdisciplinary Centre for Security, Reliability and Trust, University of Luxembourg, Luxembourg.
| | - Marcus Völp
- SnT - Interdisciplinary Centre for Security, Reliability and Trust, University of Luxembourg, Luxembourg.
| | - Francisco M Couto
- LASIGE, Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa, Portugal.
| | - Paulo Esteves-Veríssimo
- SnT - Interdisciplinary Centre for Security, Reliability and Trust, University of Luxembourg, Luxembourg.
| |
Collapse
|
4
|
Sariyar M, Schlünder I. Reconsidering Anonymization-Related Concepts and the Term "Identification" Against the Backdrop of the European Legal Framework. Biopreserv Biobank 2016; 14:367-374. [PMID: 27104620 PMCID: PMC5073223 DOI: 10.1089/bio.2015.0100] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open
Abstract
Sharing data in biomedical contexts has become increasingly relevant, but privacy concerns set constraints for free sharing of individual-level data. Data protection law protects only data relating to an identifiable individual, whereas “anonymous” data are free to be used by everybody. Usage of many terms related to anonymization is often not consistent among different domains such as statistics and law. The crucial term “identification” seems especially hard to define, since its definition presupposes the existence of identifying characteristics, leading to some circularity. In this article, we present a discussion of important terms based on a legal perspective that it is outlined before we present issues related to the usage of terms such as unique “identifiers,” “quasi-identifiers,” and “sensitive attributes.” Based on these terms, we have tried to circumvent a circular definition for the term “identification” by making two decisions: first, deciding which (natural) identifier should stand for the individual; second, deciding how to recognize the individual. In addition, we provide an overview of anonymization techniques/methods for preventing re-identification. The discussion of basic notions related to anonymization shows that there is some work to be done in order to achieve a mutual understanding between legal and technical experts concerning some of these notions. Using a dialectical definition process in order to merge technical and legal perspectives on terms seems important for enhancing mutual understanding.
Collapse
Affiliation(s)
- Murat Sariyar
- 1 Institute of Pathology, Charité-University Medicine Berlin , Berlin, Germany .,2 TMF (Technologie- und Methodenplattform e.V.) , Berlin, Germany
| | - Irene Schlünder
- 2 TMF (Technologie- und Methodenplattform e.V.) , Berlin, Germany
| |
Collapse
|
5
|
Naveed M, Ayday E, Clayton EW, Fellay J, Gunter CA, Hubaux JP, Malin BA, Wang X. Privacy in the Genomic Era. ACM COMPUTING SURVEYS 2015; 48:6. [PMID: 26640318 PMCID: PMC4666540 DOI: 10.1145/2767007] [Citation(s) in RCA: 78] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/01/2014] [Accepted: 04/01/2015] [Indexed: 05/19/2023]
Abstract
Genome sequencing technology has advanced at a rapid pace and it is now possible to generate highly-detailed genotypes inexpensively. The collection and analysis of such data has the potential to support various applications, including personalized medical services. While the benefits of the genomics revolution are trumpeted by the biomedical community, the increased availability of such data has major implications for personal privacy; notably because the genome has certain essential features, which include (but are not limited to) (i) an association with traits and certain diseases, (ii) identification capability (e.g., forensics), and (iii) revelation of family relationships. Moreover, direct-to-consumer DNA testing increases the likelihood that genome data will be made available in less regulated environments, such as the Internet and for-profit companies. The problem of genome data privacy thus resides at the crossroads of computer science, medicine, and public policy. While the computer scientists have addressed data privacy for various data types, there has been less attention dedicated to genomic data. Thus, the goal of this paper is to provide a systematization of knowledge for the computer science community. In doing so, we address some of the (sometimes erroneous) beliefs of this field and we report on a survey we conducted about genome data privacy with biomedical specialists. Then, after characterizing the genome privacy problem, we review the state-of-the-art regarding privacy attacks on genomic data and strategies for mitigating such attacks, as well as contextualizing these attacks from the perspective of medicine and public policy. This paper concludes with an enumeration of the challenges for genome data privacy and presents a framework to systematize the analysis of threats and the design of countermeasures as the field moves forward.
Collapse
|
6
|
Marie C. Strength analysis of clavicle fracture fixation devices and fixation techniques using finite element analysis with musculoskeletal force input. Med Biol Eng Comput 2015; 53:759-69. [PMID: 25850983 DOI: 10.1007/s11517-015-1288-5] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2013] [Accepted: 03/26/2015] [Indexed: 11/27/2022]
Abstract
In the cases, when clavicle fractures are treated with a fixation plate, opinions are divided about the best position of the plate, type of plate and type of screw units. Results from biomechanical studies of clavicle fixation devices are contradictory, probably partly because of simplified and varying load cases used in different studies. The anatomy of the shoulder region is complex, which makes it difficult and expensive to perform realistic experimental tests; hence, reliable simulation is an important complement to experimental tests. In this study, a method for finite element simulations of stresses in the clavicle plate and bone is used, in which muscle and ligament force data are imported from a multibody musculoskeletal model. The stress distribution in two different commercial plates, superior and anterior plating position and fixation including using a lag screw in the fracture gap or not, was compared. Looking at the clavicle fixation from a mechanical point of view, the results indicate that it is a major benefit to use a lag screw to fixate the fracture. The anterior plating position resulted in lower stresses in the plate, and the anatomically shaped plate is more stress resistant and stable than a regular reconstruction plate.
Collapse
Affiliation(s)
- Cronskär Marie
- Department of Quality, Mechanics and Mathematics, Mid Sweden University, Akademigatan 1, 831 25, Östersund, Sweden,
| |
Collapse
|
7
|
Cronskär M, Rasmussen J, Tinnsten M. Combined finite element and multibody musculoskeletal investigation of a fractured clavicle with reconstruction plate. Comput Methods Biomech Biomed Engin 2013; 18:740-8. [PMID: 24156391 DOI: 10.1080/10255842.2013.845175] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
Abstract
This paper addresses the evaluation of clavicle fixation devices, by means of computational models. The aim was to develop a method for comparison of stress distribution in various fixation devices, to determine whether the use of multibody musculoskeletal input in such model is applicable and to report the approach. The focus was on realistic loading and the motivation for the work is that the treatment can be enhanced by a better understanding of the loading of the clavicle and fixation device. The method can be used to confirm the strength of customised plates, for optimisation of new plates and to complement experimental studies. A finite element (FE) mesh of the clavicle geometry was created from computed tomography data and imported into the FE solver where the model was subjected to muscle forces and other boundary conditions from a multibody musculoskeletal model performing a typical activity of daily life. A reconstruction plate and screws were also imported into the model. The combination models returned stresses and displacements of plausible magnitudes in all included parts and the result, upon further development and validation, may serve as a design guideline for improved clavicle fixation.
Collapse
Affiliation(s)
- Marie Cronskär
- a Department of Technology and Sustainable Development , Mid Sweden University , 83125 Östersund , Sweden
| | | | | |
Collapse
|
8
|
Malin B, Loukides G, Benitez K, Clayton EW. Identifiability in biobanks: models, measures, and mitigation strategies. Hum Genet 2011; 130:383-92. [PMID: 21739176 PMCID: PMC3621020 DOI: 10.1007/s00439-011-1042-5] [Citation(s) in RCA: 56] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2011] [Accepted: 06/12/2011] [Indexed: 12/29/2022]
Abstract
The collection and sharing of person-specific biospecimens has raised significant questions regarding privacy. In particular, the question of identifiability, or the degree to which materials stored in biobanks can be linked to the name of the individuals from which they were derived, is under scrutiny. The goal of this paper is to review the extent to which biospecimens and affiliated data can be designated as identifiable. To achieve this goal, we summarize recent research in identifiability assessment for DNA sequence data, as well as associated demographic and clinical data, shared via biobanks. We demonstrate the variability of the degree of risk, the factors that contribute to this variation, and potential ways to mitigate and manage such risk. Finally, we discuss the policy implications of these findings, particularly as they pertain to biobank security and access policies. We situate our review in the context of real data sharing scenarios and biorepositories.
Collapse
Affiliation(s)
- Bradley Malin
- Department of Biomedical Informatics, School of Medicine, Vanderbilt University, 2525 West End Avenue, Suite 600, Nashville, TN 37203, USA. Department of Electrical Engineering and Computer Science, School of Engineering, Vanderbilt University, Nashville, USA
| | - Grigorios Loukides
- Department of Biomedical Informatics, School of Medicine, Vanderbilt University, 2525 West End Avenue, Suite 600, Nashville, TN 37203, USA
| | - Kathleen Benitez
- Department of Biomedical Informatics, School of Medicine, Vanderbilt University, 2525 West End Avenue, Suite 600, Nashville, TN 37203, USA
| | - Ellen Wright Clayton
- Department of Pediatrics, School of Medicine, Vanderbilt, USA. Center for Biomedical Ethics and Society, School of Medicine, Vanderbilt University, 2525 West End Avenue, Suite 400, Nashville, TN 37203, USA. School of Law, Vanderbilt University, Nashville, USA
| |
Collapse
|