1
|
Rempel JL, Belfer E, Ray I, Morello-Frosch R. Access for sale? Overlying rights, land transactions, and groundwater in California. ENVIRONMENTAL RESEARCH LETTERS : ERL [WEB SITE] 2024; 19:024017. [PMID: 38283952 PMCID: PMC10811753 DOI: 10.1088/1748-9326/ad0f71] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/13/2023] [Revised: 10/04/2023] [Accepted: 11/23/2023] [Indexed: 01/30/2024]
Abstract
Climate change intensifies longstanding tensions over groundwater sustainability and equity of access among users. Though private land ownership is a primary mechanism for accessing groundwater in many regions, few studies have systematically examined the extent to which farmland markets transform groundwater access patterns over time. This study begins to fill this gap by examining farmland transactions overlying groundwater from 2003-17 in California. We construct a novel dataset that downscales well construction behavior to the parcel level, and we use it to characterize changes in groundwater access patterns by buyer type on newly transacted parcels in the San Joaquin Valley groundwater basin during the 2011-17 drought. Our results demonstrate large-scale transitions in farmland ownership, with 21.1% of overlying agricultural acreage statewide sold at least once during the study period and with the highest rates of turnover occurring in critically overdrafted basins. By 2017, annual individual farmland acquisitions had halved, while acquisitions by limited liability companies increased to one-third of all overlying acres purchased. Together, these trends signal increasing corporate farmland acquisitions; new corporate farmland owners are associated with the construction, on comparable parcels, of agricultural wells 77-81 feet deeper than those drilled by new individual landowners. We discuss the implications of our findings for near-term governance of groundwater, and their relevance for understanding structural inequities in exposure to future groundwater level declines.
Collapse
Affiliation(s)
- Jenny Linder Rempel
- Energy & Resources Group, University of California, Berkeley, CA, United States of America
| | - Ella Belfer
- Energy & Resources Group, University of California, Berkeley, CA, United States of America
| | - Isha Ray
- Energy & Resources Group, University of California, Berkeley, CA, United States of America
| | - Rachel Morello-Frosch
- Department of Environmental Science, Policy, and Management, University of California, Berkeley, CA, United States of America
- School of Public Health, University of California, Berkeley, CA, United States of America
| |
Collapse
|
2
|
Xu H, Li X, Zhang Z, Grannis S. Score test for assessing the conditional dependence in latent class models and its application to record linkage. J R Stat Soc Ser C Appl Stat 2022. [DOI: 10.1111/rssc.12590] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Affiliation(s)
- Huiping Xu
- Department of Biostatistics and Health Data Science Indiana University Indianapolis Indiana USA
| | - Xiaochun Li
- Department of Biostatistics and Health Data Science Indiana University Indianapolis Indiana USA
| | | | | |
Collapse
|
3
|
|
4
|
A Record Linkage-Based Data Deduplication Framework with DataCleaner Extension. MULTIMODAL TECHNOLOGIES AND INTERACTION 2022. [DOI: 10.3390/mti6040027] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
The data management process is characterised by a set of tasks where data quality management (DQM) is one of the core components. Data quality, however, is a multidimensional concept, where the nature of the data quality issues is very diverse. One of the most widely anticipated data quality challenges, which becomes particularly vital when data come from multiple data sources which is a typical situation in the current data-driven world, is duplicates or non-uniqueness. Even more, duplicates were recognised to be one of the key domain-specific data quality dimensions in the context of the Internet of Things (IoT) application domains, where smart grids and health dominate most. Duplicate data lead to inaccurate analyses, leading to wrong decisions, negatively affect data-driven and/or data processing activities such as the development of models, forecasts, simulations, have a negative impact on customer service, risk and crisis management, service personalisation in terms of both their accuracy and trustworthiness, decrease user adoption and satisfaction, etc. The process of determination and elimination of duplicates is known as deduplication, while the process of finding duplicates in one or more databases that refer to the same entities is known as Record Linkage. To find the duplicates, the data sets are compared with each other using similarity functions that are usually used to compare two input strings to find similarities between them, which requires quadratic time complexity. To defuse the quadratic complexity of the problem, especially in large data sources, record linkage methods, such as blocking and sorted neighbourhood, are used. In this paper, we propose a six-step record linkage deduplication framework. The operation of the framework is demonstrated on a simplified example of research data artifacts, such as publications, research projects and others of the real-world research institution representing Research Information Systems (RIS) domain. To make the proposed framework usable we integrated it into a tool that is already used in practice, by developing a prototype of an extension for the well-known DataCleaner. The framework detects and visualises duplicates thereby identifying and providing the user with identified redundancies in a user-friendly manner allowing their further elimination. By removing the redundancies, the quality of the data is improved therefore improving analyses and decision-making. This study makes a call for other researchers to take a step towards the “golden record” that can be achieved when all data quality issues are recognised and resolved, thus moving towards absolute data quality.
Collapse
|
5
|
Abstract
Whether the goal is to estimate the number of people that live in a congressional district, to estimate the number of individuals that have died in an armed conflict, or to disambiguate individual authors using bibliographic data, all these applications have a common theme-integrating information from multiple sources. Before such questions can be answered, databases must be cleaned and integrated in a systematic and accurate way, commonly known as structured entity resolution (record linkage or deduplication). Here, we review motivational applications and seminal papers that have led to the growth of this area. We review modern probabilistic and Bayesian methods in statistics, computer science, machine learning, database management, economics, political science, and other disciplines that are used throughout industry and academia in applications such as human rights, official statistics, medicine, and citation networks, among others. Last, we discuss current research topics of practical importance.
Collapse
Affiliation(s)
- Olivier Binette
- Department of Statistical Science, Duke University, Durham, NC, USA
| | - Rebecca C Steorts
- Department of Statistical Science, Computer Science, Biostatistics and Bioinformatics, the Rhodes Information Initiative at Duke (iiD) and the Social Science Research Institute (SSRI), Duke University, Durham, NC, USA
- Principal Mathematical Statistician, United States Census Bureau, Washington, DC, USA
| |
Collapse
|
6
|
Abstract
Knowledge graphs (KGs) have rapidly emerged as an important area in AI over the last ten years. Building on a storied tradition of graphs in the AI community, a KG may be simply defined as a directed, labeled, multi-relational graph with some form of semantics. In part, this has been fueled by increased publication of structured datasets on the Web, and well-publicized successes of large-scale projects such as the Google Knowledge Graph and the Amazon Product Graph. However, another factor that is less discussed, but which has been equally instrumental in the success of KGs, is the cross-disciplinary nature of academic KG research. Arguably, because of the diversity of this research, a synthesis of how different KG research strands all tie together could serve a useful role in enabling more ‘moonshot’ research and large-scale collaborations. This review of the KG research landscape attempts to provide such a synthesis by first showing what the major strands of research are, and how those strands map to different communities, such as Natural Language Processing, Databases and Semantic Web. A unified framework is suggested in which to view the distinct, but overlapping, foci of KG research within these communities.
Collapse
|
7
|
Desmet C, Cook DJ. Recent Developments in Privacy-Preserving Mining of Clinical Data. ACM/IMS TRANSACTIONS ON DATA SCIENCE 2021; 2:28. [PMID: 35018368 PMCID: PMC8746818 DOI: 10.1145/3447774] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/01/2020] [Accepted: 01/01/2021] [Indexed: 06/14/2023]
Abstract
With the dramatic increases in both the capability to collect personal data and the capability to analyze large amounts of data, increasingly sophisticated and personal insights are being drawn. These insights are valuable for clinical applications but also open up possibilities for identification and abuse of personal information. In this paper, we survey recent research on classical methods of privacy-preserving data mining. Looking at dominant techniques and recent innovations to them, we examine the applicability of these methods to the privacy-preserving analysis of clinical data. We also discuss promising directions for future research in this area.
Collapse
|
8
|
Xu H, Li X, Grannis S. A simple two-step procedure using the Fellegi-Sunter model for frequency-based record linkage. J Appl Stat 2021; 49:2789-2804. [PMID: 35909667 DOI: 10.1080/02664763.2021.1922615] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
Abstract
The widely used Fellegi-Sunter model for probabilistic record linkage does not leverage information contained in field values and consequently leads to identical classification of match status regardless of whether records agree on rare or common values. Since agreement on rare values is less likely to occur by chance than agreement on common values, records agreeing on rare values are more likely to be matches. Existing frequency-based methods typically rely on knowledge of error probabilities associated with field values and frequencies of agreed field values among matches, often derived using prior studies or training data. When such information is unavailable, applications of these methods are challenging. In this paper, we propose a simple two-step procedure for frequency-based matching using the Fellegi-Sunter framework to overcome these challenges. Matching weights are adjusted based on frequency distributions of the agreed field values among matches and non-matches, estimated by the Fellegi-Sunter model without relying on prior studies or training data. Through a real-world application and simulation, our method is found to produce comparable or better performance than the unadjusted method. Furthermore, frequency-based matching provides greater improvement in matching accuracy when using poorly discriminating fields with diminished benefit as the discriminating power of matching fields increases.
Collapse
Affiliation(s)
- Huiping Xu
- Department of Biostatistics, Indiana University School of Medicine, Indianapolis, IN, USA
| | - Xiaochun Li
- Department of Biostatistics, Indiana University School of Medicine, Indianapolis, IN, USA
| | | |
Collapse
|
9
|
Marchant NG, Kaplan A, Elazar DN, Rubinstein BIP, Steorts RC. d-blink: Distributed End-to-End Bayesian Entity Resolution. J Comput Graph Stat 2021. [DOI: 10.1080/10618600.2020.1825451] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
Affiliation(s)
- Neil G. Marchant
- School of Computing and Information Systems, University of Melbourne , Parkville , VIC , Australia
| | - Andee Kaplan
- Department of Statistics, Colorado State University , Fort Collins , CO
| | - Daniel N. Elazar
- Methodology Division, Australian Bureau of Statistics , Belconnen , ACT , Australia
| | | | - Rebecca C. Steorts
- Department of Statistical Science and Computer Science, Duke University , Durham , NC
- Principal Mathematical Statistician, United States Census Bureau (DRB #: CBDRB-FY20-309)
| |
Collapse
|
10
|
Salvati N, Fabrizi E, Ranalli MG, Chambers RL. Small area estimation with linked data. J R Stat Soc Series B Stat Methodol 2020. [DOI: 10.1111/rssb.12401] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
Affiliation(s)
- N. Salvati
- Dipartimento di Economia e Management Università di Pisa Pisa Italy
| | - E. Fabrizi
- Dipartimento di Scienze Economiche e Sociali Università Cattolica del Sacro Cuore Milan Italy
| | - M. G. Ranalli
- Dipartimento di Scienze Politiche Università degli Studi di Perugia Perugia Italy
| | - R. L. Chambers
- National Institute for Applied Statistics Research Australia School of Mathematics and Applied Statistics University of Wollongong Wollongong Australia
| |
Collapse
|
11
|
Stammler S, Kussel T, Schoppmann P, Stampe F, Tremper G, Katzenbeisser S, Hamacher K, Lablans M. Mainzelliste SecureEpiLinker (MainSEL): Privacy-Preserving Record Linkage using Secure Multi-Party Computation. Bioinformatics 2020; 38:1657-1668. [PMID: 32871006 PMCID: PMC8896632 DOI: 10.1093/bioinformatics/btaa764] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2019] [Revised: 07/24/2020] [Accepted: 08/25/2020] [Indexed: 11/17/2022] Open
Abstract
Motivation Record Linkage has versatile applications in real-world data analysis contexts, where several datasets need to be linked on the record level in the absence of any exact identifier connecting related records. An example are medical databases of patients, spread across institutions, that have to be linked on personally identifiable entries like name, date of birth or ZIP code. At the same time, privacy laws may prohibit the exchange of this personally identifiable information (PII) across institutional boundaries, ruling out the outsourcing of the record linkage task to a trusted third party. We propose to employ privacy-preserving record linkage (PPRL) techniques that prevent, to various degrees, the leakage of PII while still allowing for the linkage of related records. Results We develop a framework for fault-tolerant PPRL using secure multi-party computation with the medical record keeping software Mainzelliste as the data source. Our solution does not rely on any trusted third party and all PII is guaranteed to not leak under common cryptographic security assumptions. Benchmarks show the feasibility of our approach in realistic networking settings: linkage of a patient record against a database of 10 000 records can be done in 48 s over a heavily delayed (100 ms) network connection, or 3.9 s with a low-latency connection. Availability and implementation The source code of the sMPC node is freely available on Github at https://github.com/medicalinformatics/SecureEpilinker subject to the AGPLv3 license. The source code of the modified Mainzelliste is available at https://github.com/medicalinformatics/MainzellisteSEL. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | - Martin Lablans
- German Cancer Research Center, Heidelberg, Germany.,University Medical Centre Mannheim, Germany
| |
Collapse
|
12
|
Ong TC, Duca LM, Kahn MG, Crume TL. A hybrid approach to record linkage using a combination of deterministic and probabilistic methodology. J Am Med Inform Assoc 2020; 27:505-513. [PMID: 32049329 DOI: 10.1093/jamia/ocz232] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2019] [Revised: 12/02/2019] [Accepted: 01/06/2020] [Indexed: 11/14/2022] Open
Abstract
OBJECTIVE The disjointed healthcare system and the nonexistence of a universal patient identifier across systems necessitates accurate record linkage (RL). We aim to describe the implementation and evaluation of a hybrid record linkage method in a statewide surveillance system for congenital heart disease. MATERIALS AND METHODS Clear-text personally identifiable information on individuals in the Colorado Congenital Heart Disease surveillance system was obtained from 5 electronic health record and medical claims data sources. Two deterministic methods and 1 probabilistic RL method using first name, last name, social security number, date of birth, and house number were initially implemented independently and then sequentially in a hybrid approach to assess RL performance. RESULTS 16 480 nonunique individuals with congenital heart disease were ascertained. Deterministic linkage methods, when performed independently, yielded 4505 linked pairs (consisting of 2 records linked together within or across data sources). Probabilistic RL, using 3 initial characters of last name and gender for blocking, yielded 6294 linked pairs when executed independently. Using a hybrid linkage routine resulted in 6451 linkages and an additional 18%-24% correct linked pairs as compared to the independent methods. A hybrid linkage routine resulted in higher recall and F-measure scores compared to probabilistic and deterministic methods performed independently. DISCUSSION The hybrid approach resulted in increased linkage accuracy and identified pairs of linked record that would have otherwise been missed when using any independent linkage technique. CONCLUSION When performing RL within and across disparate data sources, the hybrid RL routine outperformed independent deterministic and probabilistic methods.
Collapse
Affiliation(s)
- Toan C Ong
- Department of Pediatrics, School of Medicine, University of Colorado Anschutz Medical Campus, Aurora, Colorado, USA
| | - Lindsey M Duca
- Department of Epidemiology, Colorado School of Public Health, University of Colorado Anschutz Medical Campus, Aurora, Colorado, USA
| | - Michael G Kahn
- Department of Pediatrics, School of Medicine, University of Colorado Anschutz Medical Campus, Aurora, Colorado, USA
| | - Tessa L Crume
- Department of Epidemiology, Colorado School of Public Health, University of Colorado Anschutz Medical Campus, Aurora, Colorado, USA
| |
Collapse
|
13
|
Fernández-Álvarez D, Gayo JEL, Gayo-Avello D, Ordóñez de Pablos P. MERA. INT J SEMANT WEB INF 2017. [DOI: 10.4018/ijswis.2017100103] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
In this paper, the authors describe Musical Entities Reconciliation Architecture (MERA), an architecture designed to link music-related databases adapting the reconciliation techniques to each particular case. MERA includes mechanisms to manage third party sources to improve the results and it makes use of semantic technologies, storing and organizing the information in RDF graphs. They have implemented a prototype of their approach and have used it to link sources with different levels of data quality. The prototype has been effective in more than 94% of the cases under the conditions of their experiments. The authors have also compared their prototype with a well-known music-specialized search engine, outperforming the search results in the two experiments that they performed.
Collapse
Affiliation(s)
| | | | | | - Patricia Ordóñez de Pablos
- Department of Business Administration. Faculty of Economics and Business, University of Oviedo, Oviedo, Spain
| |
Collapse
|
14
|
Linkage disequilibrium matches forensic genetic records to disjoint genomic marker sets. Proc Natl Acad Sci U S A 2017; 114:5671-5676. [PMID: 28507140 PMCID: PMC5465933 DOI: 10.1073/pnas.1619944114] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/02/2023] Open
Abstract
Combining genotypes across datasets is central in facilitating advances in genetics. Data aggregation efforts often face the challenge of record matching-the identification of dataset entries that represent the same individual. We show that records can be matched across genotype datasets that have no shared markers based on linkage disequilibrium between loci appearing in different datasets. Using two datasets for the same 872 people-one with 642,563 genome-wide SNPs and the other with 13 short tandem repeats (STRs) used in forensic applications-we find that 90-98% of forensic STR records can be connected to corresponding SNP records and vice versa. Accuracy increases to 99-100% when ∼30 STRs are used. Our method expands the potential of data aggregation, but it also suggests privacy risks intrinsic in maintenance of databases containing even small numbers of markers-including databases of forensic significance.
Collapse
|
15
|
|
16
|
Croset S, Rupp J, Romacker M. Flexible data integration and curation using a graph-based approach. Bioinformatics 2016; 32:918-25. [PMID: 26556384 DOI: 10.1093/bioinformatics/btv644] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2015] [Accepted: 10/21/2015] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION The increasing diversity of data available to the biomedical scientist holds promise for better understanding of diseases and discovery of new treatments for patients. In order to provide a complete picture of a biomedical question, data from many different origins needs to be combined into a unified representation. During this data integration process, inevitable errors and ambiguities present in the initial sources compromise the quality of the resulting data warehouse, and greatly diminish the scientific value of the content. Expensive and time-consuming manual curation is then required to improve the quality of the information. However, it becomes increasingly difficult to dedicate and optimize the resources for data integration projects as available repositories are growing both in size and in number everyday. RESULTS We present a new generic methodology to identify problematic records, causing what we describe as 'data hairball' structures. The approach is graph-based and relies on two metrics traditionally used in social sciences: the graph density and the betweenness centrality. We evaluate and discuss these measures and show their relevance for flexible, optimized and automated data curation and linkage. The methodology focuses on information coherence and correctness to improve the scientific meaningfulness of data integration endeavors, such as knowledge bases and large data warehouses. CONTACT samuel.croset@roche.com SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Samuel Croset
- Roche Innovation Center Basel, F. Hoffmann-La Roche AG, CH-4070 Basel, Switzerland
| | - Joachim Rupp
- Roche Innovation Center Basel, F. Hoffmann-La Roche AG, CH-4070 Basel, Switzerland
| | - Martin Romacker
- Roche Innovation Center Basel, F. Hoffmann-La Roche AG, CH-4070 Basel, Switzerland
| |
Collapse
|
17
|
|