1
|
A Record Linkage-Based Data Deduplication Framework with DataCleaner Extension. MULTIMODAL TECHNOLOGIES AND INTERACTION 2022. [DOI: 10.3390/mti6040027] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
The data management process is characterised by a set of tasks where data quality management (DQM) is one of the core components. Data quality, however, is a multidimensional concept, where the nature of the data quality issues is very diverse. One of the most widely anticipated data quality challenges, which becomes particularly vital when data come from multiple data sources which is a typical situation in the current data-driven world, is duplicates or non-uniqueness. Even more, duplicates were recognised to be one of the key domain-specific data quality dimensions in the context of the Internet of Things (IoT) application domains, where smart grids and health dominate most. Duplicate data lead to inaccurate analyses, leading to wrong decisions, negatively affect data-driven and/or data processing activities such as the development of models, forecasts, simulations, have a negative impact on customer service, risk and crisis management, service personalisation in terms of both their accuracy and trustworthiness, decrease user adoption and satisfaction, etc. The process of determination and elimination of duplicates is known as deduplication, while the process of finding duplicates in one or more databases that refer to the same entities is known as Record Linkage. To find the duplicates, the data sets are compared with each other using similarity functions that are usually used to compare two input strings to find similarities between them, which requires quadratic time complexity. To defuse the quadratic complexity of the problem, especially in large data sources, record linkage methods, such as blocking and sorted neighbourhood, are used. In this paper, we propose a six-step record linkage deduplication framework. The operation of the framework is demonstrated on a simplified example of research data artifacts, such as publications, research projects and others of the real-world research institution representing Research Information Systems (RIS) domain. To make the proposed framework usable we integrated it into a tool that is already used in practice, by developing a prototype of an extension for the well-known DataCleaner. The framework detects and visualises duplicates thereby identifying and providing the user with identified redundancies in a user-friendly manner allowing their further elimination. By removing the redundancies, the quality of the data is improved therefore improving analyses and decision-making. This study makes a call for other researchers to take a step towards the “golden record” that can be achieved when all data quality issues are recognised and resolved, thus moving towards absolute data quality.
Collapse
|
2
|
Abstract
Whether the goal is to estimate the number of people that live in a congressional district, to estimate the number of individuals that have died in an armed conflict, or to disambiguate individual authors using bibliographic data, all these applications have a common theme-integrating information from multiple sources. Before such questions can be answered, databases must be cleaned and integrated in a systematic and accurate way, commonly known as structured entity resolution (record linkage or deduplication). Here, we review motivational applications and seminal papers that have led to the growth of this area. We review modern probabilistic and Bayesian methods in statistics, computer science, machine learning, database management, economics, political science, and other disciplines that are used throughout industry and academia in applications such as human rights, official statistics, medicine, and citation networks, among others. Last, we discuss current research topics of practical importance.
Collapse
Affiliation(s)
- Olivier Binette
- Department of Statistical Science, Duke University, Durham, NC, USA
| | - Rebecca C Steorts
- Department of Statistical Science, Computer Science, Biostatistics and Bioinformatics, the Rhodes Information Initiative at Duke (iiD) and the Social Science Research Institute (SSRI), Duke University, Durham, NC, USA
- Principal Mathematical Statistician, United States Census Bureau, Washington, DC, USA
| |
Collapse
|
3
|
Ilyas IF, Rekatsinas T. Machine Learning and Data Cleaning: Which Serves the Other? ACM JOURNAL OF DATA AND INFORMATION QUALITY 2022. [DOI: 10.1145/3506712] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
4
|
Ali A, Emran NA, Asmai SA. Missing values compensation in duplicates detection using hot deck method. JOURNAL OF BIG DATA 2021; 8:112. [DOI: 10.1186/s40537-021-00502-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/01/2021] [Accepted: 08/08/2021] [Indexed: 09/01/2023]
Abstract
AbstractDuplicate record is a common problem within data sets especially in huge volume databases. The accuracy of duplicate detection determines the efficiency of duplicate removal process. However, duplicate detection has become more challenging due to the presence of missing values within the records where during the clustering and matching process, missing values can cause records deemed similar to be inserted into the wrong group, hence, leading to undetected duplicates. In this paper, duplicate detection improvement was proposed despite the presence of missing values within a data set through Duplicate Detection within the Incomplete Data set (DDID) method. The missing values were hypothetically added to the key attributes of three data sets under study, using an arbitrary pattern to simulate both complete and incomplete data sets. The results were analyzed, then, the performance of duplicate detection was evaluated by using the Hot Deck method to compensate for the missing values in the key attributes. It was hypothesized that by using Hot Deck, duplicate detection performance would be improved. Furthermore, the DDID performance was compared to an early duplicate detection method namely DuDe, in terms of its accuracy and speed. The findings yielded that even though the data sets were incomplete, DDID was able to offer a better accuracy and faster duplicate detection as compared to DuDe. The results of this study offer insights into constraints of duplicate detection within incomplete data sets.
Collapse
|
5
|
Niknam M, Minaei-Bidgoli B, Dianat R. The role of transitive closure in evaluating blocking methods for dirty entity resolution. J Intell Inf Syst 2021. [DOI: 10.1007/s10844-021-00676-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
6
|
Naranjo-Zeledón L, Chacón-Rivas M, Peral J, Ferrández A. Architecture design of a reinforcement environment for learning sign languages. PeerJ Comput Sci 2021; 7:e740. [PMID: 34722873 PMCID: PMC8530094 DOI: 10.7717/peerj-cs.740] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2021] [Accepted: 09/17/2021] [Indexed: 06/13/2023]
Abstract
Different fields such as linguistics, teaching, and computing have demonstrated special interest in the study of sign languages (SL). However, the processes of teaching and learning these languages turn complex since it is unusual to find people teaching these languages that are fluent in both SL and the native language of the students. The teachings from deaf individuals become unique. Nonetheless, it is important for the student to lean on supportive mechanisms while being in the process of learning an SL. Bidirectional communication between deaf and hearing people through SL is a hot topic to achieve a higher level of inclusion. However, all the processes that convey teaching and learning SL turn difficult and complex since it is unusual to find SL teachers that are fluent also in the native language of the students, making it harder to provide computer teaching tools for different SL. Moreover, the main aspects that a second language learner of an SL finds difficult are phonology, non-manual components, and the use of space (the latter two are specific to SL, not to spoken languages). This proposal appears to be the first of the kind to favor the Costa Rican Sign Language (LESCO, for its Spanish acronym), as well as any other SL. Our research focus stands on reinforcing the learning process of final-user hearing people through a modular architectural design of a learning environment, relying on the concept of phonological proximity within a graphical tool with a high degree of usability. The aim of incorporating phonological proximity is to assist individuals in learning signs with similar handshapes. This architecture separates the logic and processing aspects from those associated with the access and generation of data, which makes it portable to other SL in the future. The methodology used consisted of defining 26 phonological parameters (13 for each hand), thus characterizing each sign appropriately. Then, a similarity formula was applied to compare each pair of signs. With these pre-calculations, the tool displays each sign and its top ten most similar signs. A SUS usability test and an open qualitative question were applied, as well as a numerical evaluation to a group of learners, to validate the proposal. In order to reach our research aims, we have analyzed previous work on proposals for teaching tools meant for the student to practice SL, as well as previous work on the importance of phonological proximity in this teaching process. This previous work justifies the necessity of our proposal, whose benefits have been proved through the experimentation conducted by different users on the usability and usefulness of the tool. To meet these needs, homonymous words (signs with the same starting handshape) and paronyms (signs with highly similar handshape), have been included to explore their impact on learning. It allows the possibility to apply the same perspective of our existing line of research to other SL in the future.
Collapse
Affiliation(s)
- Luis Naranjo-Zeledón
- Inclutec, Costa Rica Institute of Technology, Cartago, Costa Rica
- Department of Languages and Computing Systems, University of Alicante, Alicante, Spain
| | | | - Jesús Peral
- Department of Languages and Computing Systems, University of Alicante, Alicante, Spain
| | - Antonio Ferrández
- Department of Languages and Computing Systems, University of Alicante, Alicante, Spain
| |
Collapse
|
7
|
Fellah A. All-Three: Near-optimal and domain-independent algorithms for near-duplicate detection. ARRAY 2021. [DOI: 10.1016/j.array.2021.100070] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
|
8
|
Al-Masaeed M, Alghawanmeh M, Al-Singlawi A, Alsababha R, Alqudah M. An Examination of COVID-19 Medications' Effectiveness in Managing and Treating COVID-19 Patients: A Comparative Review. Healthcare (Basel) 2021; 9:557. [PMID: 34068474 PMCID: PMC8151388 DOI: 10.3390/healthcare9050557] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2021] [Revised: 04/25/2021] [Accepted: 05/03/2021] [Indexed: 12/23/2022] Open
Abstract
Background: The review seeks to shed light on the administered and recommended COVID-19 treatment medications through an evaluation of their efficacy. Methods: Data were collected from key databases, including Scopus, Medline, Google Scholar, and CINAHL. Other platforms included WHO and FDA publications. The review's literature search was guided by the WHO solidarity clinical trials for COVID-19 scope and trial-assessment parameters. Results: The findings indicate that the use of antiretroviral drugs as an early treatment for COVID-19 patients has been useful. It has reduced hospital time, hastened the clinical cure period, delayed and reduced the need for mechanical and invasive ventilation, and reduced mortality rates. The use of vitamins, minerals, and supplements has been linked to increased immunity and thus offering the body a fighting chance. Nevertheless, antibiotics do not correlate with improving patients' wellbeing and are highly discouraged from the developed clinical trials. Conclusions: The review demonstrates the need for additional clinical trials with a randomized, extensive sample base and over a more extended period to examine the potential side effects of the medications administered. Critically, the findings underscore the need for vaccination as the only viable medication to limit the SARS-CoV-2 virus spread.
Collapse
Affiliation(s)
- Mahmoud Al-Masaeed
- Faculty of Health and Medicine, University of Newcastle, Callaghan 2308, Australia;
| | | | | | - Rawan Alsababha
- School of nursing and Midwifery, Western Sydney University, Sydney 2560, Australia;
| | - Muhammad Alqudah
- Faculty of Health and Medicine, University of Newcastle, Callaghan 2308, Australia;
| |
Collapse
|
9
|
Li Y, Li J, Suhara Y, Wang J, Hirota W, Tan WC. Deep Entity Matching. ACM JOURNAL OF DATA AND INFORMATION QUALITY 2021. [DOI: 10.1145/3431816] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
Abstract
Entity matching refers to the task of determining whether two different representations refer to the same real-world entity. It continues to be a prevalent problem for many organizations where data resides in different sources and duplicates the need to be identified and managed. The term “entity matching” also loosely refers to the broader problem of determining whether two heterogeneous representations of
different entities
should be associated together. This problem has an even wider scope of applications, from determining the subsidiaries of companies to matching jobs to job seekers, which has impactful consequences.
In this article, we first report our recent system D
ITTO
, which is an example of a modern entity matching system based on pretrained language models. Then we summarize recent solutions in applying deep learning and pre-trained language models for solving the entity matching task. Finally, we discuss research directions beyond entity matching, including the promise of synergistically integrating blocking and entity matching steps together, the need to examine methods to alleviate steep training data requirements that are typical of deep learning or pre-trained language models, and the importance of generalizing entity matching solutions to handle the broader entity matching problem, which leads to an even more pressing need to explain matching outcomes.
Collapse
|
10
|
Loster M, Koumarelas I, Naumann F. Knowledge Transfer for Entity Resolution with Siamese Neural Networks. ACM JOURNAL OF DATA AND INFORMATION QUALITY 2021. [DOI: 10.1145/3410157] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
Abstract
The integration of multiple data sources is a common problem in a large variety of applications. Traditionally, handcrafted similarity measures are used to discover, merge, and integrate multiple representations of the same entity—duplicates—into a large homogeneous collection of data. Often, these similarity measures do not cope well with the heterogeneity of the underlying dataset. In addition, domain experts are needed to manually design and configure such measures, which is both time-consuming and requires extensive domain expertise.
We propose a deep Siamese neural network, capable of learning a similarity measure that is tailored to the characteristics of a particular dataset. With the properties of deep learning methods, we are able to eliminate the manual feature engineering process and thus considerably reduce the effort required for model construction. In addition, we show that it is possible to transfer knowledge acquired during the deduplication of one dataset to another, and thus significantly reduce the amount of data required to train a similarity measure. We evaluated our method on multiple datasets and compare our approach to state-of-the-art deduplication methods. Our approach outperforms competitors by up to +26 percent F-measure, depending on task and dataset. In addition, we show that knowledge transfer is not only feasible, but in our experiments led to an improvement in F-measure of up to +4.7 percent.
Collapse
Affiliation(s)
- Michael Loster
- Hasso Plattner Institute, University of Potsdam, Potsdam, Germany
| | | | - Felix Naumann
- Hasso Plattner Institute, University of Potsdam, Potsdam, Germany
| |
Collapse
|
11
|
Araújo D, Santos Pires CE, Cassimiro Nascimento D. Leveraging active learning to reduce human effort in the generation of ground‐truth for entity resolution. Comput Intell 2020. [DOI: 10.1111/coin.12268] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Affiliation(s)
- Diego Araújo
- Center of Electrical Engineering and InformaticsFederal University of Campina Grande Paraíba Brazil
- Center of Exact and Applied Social SciencesState University of Paraíba Paraíba Brazil
| | | | - Dimas Cassimiro Nascimento
- Center of Electrical Engineering and InformaticsFederal University of Campina Grande Paraíba Brazil
- Academic Unit of GaranhunsFederal Rural University of Pernambuco Pernambuco Brazil
| |
Collapse
|
12
|
|
13
|
Bisandu DB, Prasad R, Liman MM. Data clustering using efficient similarity measures. JOURNAL OF STATISTICS & MANAGEMENT SYSTEMS 2019. [DOI: 10.1080/09720510.2019.1565443] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Affiliation(s)
- Desmond Bala Bisandu
- Department of Computer Science, University of Jos, P.M.B. 2084 Jos, Plateau State 930001 Nigeria
| | - Rajesh Prasad
- Department of Computer Science, African University of Science and Technology, P.M.B. 681 Garki, Abuja F.C.T, Nigeria
| | - Musa Muhammad Liman
- Department of Computer Science, University Putra Malaysia, Selangor, 43400 Serdang, Malaysia
| |
Collapse
|
14
|
|
15
|
|
16
|
van Gennip Y, Hunter B, Ma A, Moyer D, de Vera R, Bertozzi AL. Unsupervised record matching with noisy and incomplete data. INTERNATIONAL JOURNAL OF DATA SCIENCE AND ANALYTICS 2018. [DOI: 10.1007/s41060-018-0129-7] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
17
|
Jurek A, Hong J, Chi Y, Liu W. A novel ensemble learning approach to unsupervised record linkage. INFORM SYST 2017. [DOI: 10.1016/j.is.2017.06.006] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
18
|
Sagi T, Gal A, Barkol O, Bergman R, Avram A. Multi-source uncertain entity resolution: Transforming holocaust victim reports into people. INFORM SYST 2017. [DOI: 10.1016/j.is.2016.12.003] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
19
|
Sohail A, Yousaf MM. A proficient cost reduction framework for de-duplication of records in data integration. BMC Med Inform Decis Mak 2016; 16:42. [PMID: 27067004 PMCID: PMC4828843 DOI: 10.1186/s12911-016-0280-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2015] [Accepted: 04/01/2016] [Indexed: 11/30/2022] Open
Abstract
Background Record de-duplication is a process of identifying the records referring to the same entity. It has a pivotal role in data mining applications, which involves the integration of multiple data sources and data cleansing. It has been a challenging task due to its computational complexity and variations in data representations across different data sources. Blocking and windowing are the commonly used methods for reducing the number of record comparisons during record de-duplication. Both blocking and windowing require tuning of a certain set of parameters, such as the choice of a particular variant of blocking or windowing, the selection of appropriate window size for different datasets etc. Methods In this paper, we have proposed a framework that employs blocking and windowing techniques in succession, such that figuring out the parameters is not required. We have also evaluated the impact of different configurations on dirty and massively dirty datasets. To evaluate the proposed framework, experiments are performed using Febrl (Freely Extensible Biomedical Record Linkage). Results The proposed framework is comprehensively evaluated using a variety of quality and complexity parameters such as reduction ratio, precision, recall etc. It is observed that the proposed framework significantly reduces the number of record comparisons. Conclusions The selection of the linkage key is a critical performance factor for record linkage. Electronic supplementary material The online version of this article (doi:10.1186/s12911-016-0280-9) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Asif Sohail
- Punjab University College of Information Technology (PUCIT), University of the Punjab, Lahore, Pakistan.
| | - Muhammad Murtaza Yousaf
- Punjab University College of Information Technology (PUCIT), University of the Punjab, Lahore, Pakistan
| |
Collapse
|
20
|
|
21
|
Randall SM, Boyd JH, Ferrante AM, Bauer JK, Semmens JB. Use of graph theory measures to identify errors in record linkage. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2014; 115:55-63. [PMID: 24768079 DOI: 10.1016/j.cmpb.2014.03.008] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/03/2013] [Revised: 02/18/2014] [Accepted: 03/21/2014] [Indexed: 06/03/2023]
Abstract
Ensuring high linkage quality is important in many record linkage applications. Current methods for ensuring quality are manual and resource intensive. This paper seeks to determine the effectiveness of graph theory techniques in identifying record linkage errors. A range of graph theory techniques was applied to two linked datasets, with known truth sets. The ability of graph theory techniques to identify groups containing errors was compared to a widely used threshold setting technique. This methodology shows promise; however, further investigations into graph theory techniques are required. The development of more efficient and effective methods of improving linkage quality will result in higher quality datasets that can be delivered to researchers in shorter timeframes.
Collapse
Affiliation(s)
- Sean M Randall
- Centre for Data Linkage, Curtin University, Kent Street, Bentley, WA 6102, Australia.
| | - James H Boyd
- Centre for Data Linkage, Curtin University, Kent Street, Bentley, WA 6102, Australia.
| | - Anna M Ferrante
- Centre for Data Linkage, Curtin University, Kent Street, Bentley, WA 6102, Australia.
| | - Jacqueline K Bauer
- Centre for Data Linkage, Curtin University, Kent Street, Bentley, WA 6102, Australia.
| | - James B Semmens
- Centre for Data Linkage, Curtin University, Kent Street, Bentley, WA 6102, Australia.
| |
Collapse
|
22
|
|
23
|
Vatsalan D, Christen P, Verykios VS. A taxonomy of privacy-preserving record linkage techniques. INFORM SYST 2013. [DOI: 10.1016/j.is.2012.11.005] [Citation(s) in RCA: 168] [Impact Index Per Article: 15.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
24
|
|
25
|
Panse F, van Keulen M, Ritter N. Indeterministic Handling of Uncertain Decisions in Deduplication. ACM JOURNAL OF DATA AND INFORMATION QUALITY 2013. [DOI: 10.1145/2435221.2435225] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Abstract
In current research and practice, deduplication is usually considered as a deterministic approach in which database tuples are either declared to be duplicates or not. In ambiguous situations, however, it is often not completely clear-cut, which tuples represent the same real-world entity. In deterministic approaches, many realistic possibilities may be ignored, which in turn can lead to false decisions. In this article, we present an indeterministic approach for deduplication by using a probabilistic target model including techniques for proper probabilistic interpretation of similarity matching results. Thus, instead of deciding for one of the most likely situations, all realistic situations are modeled in the resultant data. This approach minimizes the negative impact of false decisions. Moreover, the deduplication process becomes almost fully automatic and human effort can be largely reduced. To increase applicability, we introduce several semi-indeterministic methods that heuristically reduce the set of indeterministically handled decisions in several meaningful ways. We also describe a full-indeterministic method for theoretical and presentational reasons.
Collapse
|
26
|
A Supervised Machine Learning Approach for Duplicate Detection over Gazetteer Records. GEOSPATIAL SEMANTICS 2011. [DOI: 10.1007/978-3-642-20630-6_3] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
|