1
|
Souza J, Caballero I, Vasco Santos J, Fernandes Lobo M, Pinto A, Viana J, Sáez C, Lopes F, Freitas A. Multisource and temporal variability in Portuguese hospital administrative datasets: data quality implications. J Biomed Inform 2022; 136:104242. [DOI: 10.1016/j.jbi.2022.104242] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2021] [Revised: 08/18/2022] [Accepted: 11/06/2022] [Indexed: 11/13/2022]
|
2
|
Deep ensemble multitask classification of emergency medical call incidents combining multimodal data improves emergency medical dispatch. Artif Intell Med 2021; 117:102088. [PMID: 34127234 DOI: 10.1016/j.artmed.2021.102088] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2020] [Revised: 04/19/2021] [Accepted: 05/03/2021] [Indexed: 11/20/2022]
Abstract
The objective of this work was to develop a predictive model to aid non-clinical dispatchers to classify emergency medical call incidents by their life-threatening level (yes/no), admissible response delay (undelayable, minutes, hours, days) and emergency system jurisdiction (emergency system/primary care) in real time. We used a total of 1 244 624 independent incidents from the Valencian emergency medical dispatch service in Spain, compiled in retrospective from 2009 to 2012, including clinical features, demographics, circumstantial factors and free text dispatcher observations. Based on them, we designed and developed DeepEMC2, a deep ensemble multitask model integrating four subnetworks: three specialized to context, clinical and text data, respectively, and another to ensemble the former. The four subnetworks are composed in turn by multi-layer perceptron modules, bidirectional long short-term memory units and a bidirectional encoding representations from transformers module. DeepEMC2 showed a macro F1-score of 0.759 in life-threatening classification, 0.576 in admissible response delay and 0.757 in emergency system jurisdiction. These results show a substantial performance increase of 12.5 %, 17.5 % and 5.1 %, respectively, with respect to the current in-house triage protocol of the Valencian emergency medical dispatch service. Besides, DeepEMC2 significantly outperformed a set of baseline machine learning models, including naive bayes, logistic regression, random forest and gradient boosting (α = 0.05). Hence, DeepEMC2 is able to: 1) capture information present in emergency medical calls not considered by the existing triage protocol, and 2) model complex data dependencies not feasible by the tested baseline models. Likewise, our results suggest that most of this unconsidered information is present in the free text dispatcher observations. To our knowledge, this study describes the first deep learning model undertaking emergency medical call incidents classification. Its adoption in medical dispatch centers would potentially improve emergency dispatch processes, resulting in a positive impact in patient wellbeing and health services sustainability.
Collapse
|
3
|
Sáez C, Romero N, Conejero JA, García-Gómez JM. Potential limitations in COVID-19 machine learning due to data source variability: A case study in the nCov2019 dataset. J Am Med Inform Assoc 2021; 28:360-364. [PMID: 33027509 PMCID: PMC7797735 DOI: 10.1093/jamia/ocaa258] [Citation(s) in RCA: 27] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2020] [Revised: 09/07/2020] [Accepted: 09/28/2020] [Indexed: 02/02/2023] Open
Abstract
OBJECTIVE The lack of representative coronavirus disease 2019 (COVID-19) data is a bottleneck for reliable and generalizable machine learning. Data sharing is insufficient without data quality, in which source variability plays an important role. We showcase and discuss potential biases from data source variability for COVID-19 machine learning. MATERIALS AND METHODS We used the publicly available nCov2019 dataset, including patient-level data from several countries. We aimed to the discovery and classification of severity subgroups using symptoms and comorbidities. RESULTS Cases from the 2 countries with the highest prevalence were divided into separate subgroups with distinct severity manifestations. This variability can reduce the representativeness of training data with respect the model target populations and increase model complexity at risk of overfitting. CONCLUSIONS Data source variability is a potential contributor to bias in distributed research networks. We call for systematic assessment and reporting of data source variability and data quality in COVID-19 data sharing, as key information for reliable and generalizable machine learning.
Collapse
Affiliation(s)
- Carlos Sáez
- Biomedical Data Science Lab, Instituto Universitario de Tecnologías de la Información y Comunicaciones, Universitat Politècnica de València, Camino de Vera s/n, Valencia 46022, España
| | - Nekane Romero
- Biomedical Data Science Lab, Instituto Universitario de Tecnologías de la Información y Comunicaciones, Universitat Politècnica de València, Camino de Vera s/n, Valencia 46022, España
| | - J Alberto Conejero
- Instituto Universitario de Matemática Pura y Aplicada, Universitat Politécnica de València, Valencia, Spain
| | - Juan M García-Gómez
- Biomedical Data Science Lab, Instituto Universitario de Tecnologías de la Información y Comunicaciones, Universitat Politècnica de València, Camino de Vera s/n, Valencia 46022, España
| |
Collapse
|
4
|
Sáez C, Gutiérrez-Sacristán A, Kohane I, García-Gómez JM, Avillach P. EHRtemporalVariability: delineating temporal data-set shifts in electronic health records. Gigascience 2020; 9:giaa079. [PMID: 32729900 PMCID: PMC7391413 DOI: 10.1093/gigascience/giaa079] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2019] [Revised: 05/28/2020] [Accepted: 07/03/2020] [Indexed: 11/18/2022] Open
Abstract
BACKGROUND Temporal variability in health-care processes or protocols is intrinsic to medicine. Such variability can potentially introduce dataset shifts, a data quality issue when reusing electronic health records (EHRs) for secondary purposes. Temporal data-set shifts can present as trends, as well as abrupt or seasonal changes in the statistical distributions of data over time. The latter are particularly complicated to address in multimodal and highly coded data. These changes, if not delineated, can harm population and data-driven research, such as machine learning. Given that biomedical research repositories are increasingly being populated with large sets of historical data from EHRs, there is a need for specific software methods to help delineate temporal data-set shifts to ensure reliable data reuse. RESULTS EHRtemporalVariability is an open-source R package and Shiny app designed to explore and identify temporal data-set shifts. EHRtemporalVariability estimates the statistical distributions of coded and numerical data over time; projects their temporal evolution through non-parametric information geometric temporal plots; and enables the exploration of changes in variables through data temporal heat maps. We demonstrate the capability of EHRtemporalVariability to delineate data-set shifts in three impact case studies, one of which is available for reproducibility. CONCLUSIONS EHRtemporalVariability enables the exploration and identification of data-set shifts, contributing to the broad examination and repurposing of large, longitudinal data sets. Our goal is to help ensure reliable data reuse for a wide range of biomedical data users. EHRtemporalVariability is designed for technical users who are programmatically utilizing the R package, as well as users who are not familiar with programming via the Shiny user interface.Availability: https://github.com/hms-dbmi/EHRtemporalVariability/Reproducible vignette: https://cran.r-project.org/web/packages/EHRtemporalVariability/vignettes/EHRtemporalVariability.htmlOnline demo: http://ehrtemporalvariability.upv.es/.
Collapse
Affiliation(s)
- Carlos Sáez
- Biomedical Data Science Lab, Instituto Universitario de Tecnologías de la Información y Comunicaciones, Universitat Politècnica de València, Camino de Vera s/n, Valencia 46022, España
- Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts, USA
| | | | - Isaac Kohane
- Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts, USA
| | - Juan M García-Gómez
- Biomedical Data Science Lab, Instituto Universitario de Tecnologías de la Información y Comunicaciones, Universitat Politècnica de València, Camino de Vera s/n, Valencia 46022, España
| | - Paul Avillach
- Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts, USA
- Computational Health Informatics Program, Boston Children’s Hospital, Boston, Massachusetts, USA
| |
Collapse
|
5
|
Rockenschaub P, Nguyen V, Aldridge RW, Acosta D, García-Gómez JM, Sáez C. Data-driven discovery of changes in clinical code usage over time: a case-study on changes in cardiovascular disease recording in two English electronic health records databases (2001-2015). BMJ Open 2020; 10:e034396. [PMID: 32060159 PMCID: PMC7045100 DOI: 10.1136/bmjopen-2019-034396] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 12/17/2022] Open
Abstract
OBJECTIVES To demonstrate how data-driven variability methods can be used to identify changes in disease recording in two English electronic health records databases between 2001 and 2015. DESIGN Repeated cross-sectional analysis that applied data-driven temporal variability methods to assess month-by-month changes in routinely collected medical data. A measure of difference between months was calculated based on joint distributions of age, gender, socioeconomic status and recorded cardiovascular diseases. Distances between months were used to identify temporal trends in data recording. SETTING 400 English primary care practices from the Clinical Practice Research Datalink (CPRD GOLD) and 451 hospital providers from the Hospital Episode Statistics (HES). MAIN OUTCOMES The proportion of patients (CPRD GOLD) and hospital admissions (HES) with a recorded cardiovascular disease (CPRD GOLD: coronary heart disease, heart failure, peripheral arterial disease, stroke; HES: International Classification of Disease codes I20-I69/G45). RESULTS Both databases showed gradual changes in cardiovascular disease recording between 2001 and 2008. The recorded prevalence of included cardiovascular diseases in CPRD GOLD increased by 47%-62%, which partially reversed after 2008. For hospital records in HES, there was a relative decrease in angina pectoris (-34.4%) and unspecified stroke (-42.3%) over the same time period, with a concomitant increase in chronic coronary heart disease (+14.3%). Multiple abrupt changes in the use of myocardial infarction codes in hospital were found in March/April 2010, 2012 and 2014, possibly linked to updates of clinical coding guidelines. CONCLUSIONS Identified temporal variability could be related to potentially non-medical causes such as updated coding guidelines. These artificial changes may introduce temporal correlation among diagnoses inferred from routine data, violating the assumptions of frequently used statistical methods. Temporal variability measures provide an objective and robust technique to identify, and subsequently account for, those changes in electronic health records studies without any prior knowledge of the data collection process.
Collapse
Affiliation(s)
- Patrick Rockenschaub
- Institute of Health Informatics, University College London, London, UK
- Health Data Research UK, London, UK
| | - Vincent Nguyen
- Institute of Health Informatics, University College London, London, UK
- Health Data Research UK, London, UK
| | - Robert W Aldridge
- Institute of Health Informatics, University College London, London, UK
- Health Data Research UK, London, UK
| | - Dionisio Acosta
- Institute of Health Informatics, University College London, London, UK
- Health Data Research UK, London, UK
| | - Juan Miguel García-Gómez
- Instituto de Aplicaciones de las Tecnologías de la Información y de las Comunicaciones Avanzadas (ITACA), Universitat Politècnica de València, Valencia, Spain
| | - Carlos Sáez
- Instituto de Aplicaciones de las Tecnologías de la Información y de las Comunicaciones Avanzadas (ITACA), Universitat Politècnica de València, Valencia, Spain
| |
Collapse
|
6
|
Sáez C, Liaw ST, Kimura E, Coorevits P, Garcia-Gomez JM. Guest editorial: Special issue in biomedical data quality assessment methods. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2019; 181:104954. [PMID: 31242965 DOI: 10.1016/j.cmpb.2019.06.013] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Affiliation(s)
- Carlos Sáez
- Biomedical Data Science Lab, Instituto Universitario de Tecnologías de la Información y Comunicaciones (ITACA), Universitat Politècnica de València (UPV), Camino de Vera s/n, Valencia 46022, Spain.
| | - Siaw-Teng Liaw
- WHO Collaborating Centre on eHealth, School of Public Health & Community Medicine, UNSW Sydney, Australia.
| | | | | | - Juan M Garcia-Gomez
- Biomedical Data Science Lab, Instituto Universitario de Tecnologías de la Información y Comunicaciones (ITACA), Universitat Politècnica de València (UPV), Camino de Vera s/n, Valencia 46022, Spain.
| |
Collapse
|
7
|
Pérez-Benito FJ, Sáez C, Conejero JA, Tortajada S, Valdivieso B, García-Gómez JM. Temporal variability analysis reveals biases in electronic health records due to hospital process reengineering interventions over seven years. PLoS One 2019; 14:e0220369. [PMID: 31390350 PMCID: PMC6685618 DOI: 10.1371/journal.pone.0220369] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2019] [Accepted: 07/15/2019] [Indexed: 12/28/2022] Open
Abstract
OBJECTIVE To evaluate the effects of Process-Reengineering interventions on the Electronic Health Records (EHR) of a hospital over 7 years. MATERIALS AND METHODS Temporal Variability Assessment (TVA) based on probabilistic data quality assessment was applied to the historic monthly-batched admission data of Hospital La Fe Valencia, Spain from 2010 to 2016. Routine healthcare data with a complete EHR was expanded by processed variables such as the Charlson Comorbidity Index. RESULTS Four Process-Reengineering interventions were detected by quantifiable effects on the EHR: (1) the hospital relocation in 2011 involved progressive reduction of admissions during the next four months, (2) the hospital services re-configuration incremented the number of inter-services transfers, (3) the care-services re-distribution led to transfers between facilities (4) the assignment to the hospital of a new area with 80,000 patients in 2015 inspired the discharge to home for follow up and the update of the pre-surgery planned admissions protocol that produced a significant decrease of the patient length of stay. DISCUSSION TVA provides an indicator of the effect of process re-engineering interventions on healthcare practice. Evaluating the effect of facilities' relocation and increment of citizens (findings 1, 3-4), the impact of strategies (findings 2-3), and gradual changes in protocols (finding 4) may help on the hospital management by optimizing interventions based on their effect on EHRs or on data reuse. CONCLUSIONS The effects on hospitals EHR due to process re-engineering interventions can be evaluated using the TVA methodology. Being aware of conditioned variations in EHR is of the utmost importance for the reliable reuse of routine hospitalization data.
Collapse
Affiliation(s)
- Francisco Javier Pérez-Benito
- Biomedical Data Science Lab, Instituto Universitario de Tecnologías de Información y Comunicaciones Avanzadas (ITACA), Univeritat Politécnica de València, València, Spain
- Instituto Universitario de Matemática Pura y Aplicada, Universitat Politécnica de València, València, Spain
| | - Carlos Sáez
- Biomedical Data Science Lab, Instituto Universitario de Tecnologías de Información y Comunicaciones Avanzadas (ITACA), Univeritat Politécnica de València, València, Spain
| | - J. Alberto Conejero
- Instituto Universitario de Matemática Pura y Aplicada, Universitat Politécnica de València, València, Spain
- * E-mail:
| | - Salvador Tortajada
- Biomedical Data Science Lab, Instituto Universitario de Tecnologías de Información y Comunicaciones Avanzadas (ITACA), Univeritat Politécnica de València, València, Spain
- Unidad conjunta de investigación en reingeniería de procesos socio-sanitarios, Instituto de Investigación Sanitaria La Fe, Hospital Universitario La Fe, València, Spain
- Red de Investigación en Servicios de Salud en Enfermedades Crónicas (REDISSEC), València, Spain
| | - Bernardo Valdivieso
- Unidad conjunta de investigación en reingeniería de procesos socio-sanitarios, Instituto de Investigación Sanitaria La Fe, Hospital Universitario La Fe, València, Spain
| | - Juan M. García-Gómez
- Biomedical Data Science Lab, Instituto Universitario de Tecnologías de Información y Comunicaciones Avanzadas (ITACA), Univeritat Politécnica de València, València, Spain
- Unidad conjunta de investigación en reingeniería de procesos socio-sanitarios, Instituto de Investigación Sanitaria La Fe, Hospital Universitario La Fe, València, Spain
- Red de Investigación en Servicios de Salud en Enfermedades Crónicas (REDISSEC), València, Spain
| |
Collapse
|
8
|
Sáez C, García-Gómez JM. Kinematics of Big Biomedical Data to characterize temporal variability and seasonality of data repositories: Functional Data Analysis of data temporal evolution over non-parametric statistical manifolds. Int J Med Inform 2018; 119:109-124. [DOI: 10.1016/j.ijmedinf.2018.09.015] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2018] [Revised: 09/05/2018] [Accepted: 09/13/2018] [Indexed: 01/26/2023]
|
9
|
Sáez C, Zurriaga O, Pérez-Panadés J, Melchor I, Robles M, García-Gómez JM. Applying probabilistic temporal and multisite data quality control methods to a public health mortality registry in Spain: a systematic approach to quality control of repositories. J Am Med Inform Assoc 2016; 23:1085-1095. [PMID: 27107447 DOI: 10.1093/jamia/ocw010] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2015] [Revised: 12/21/2015] [Accepted: 01/17/2016] [Indexed: 11/14/2022] Open
Abstract
Abstract
Objective To assess the variability in data distributions among data sources and over time through a case study of a large multisite repository as a systematic approach to data quality (DQ).
Materials and Methods Novel probabilistic DQ control methods based on information theory and geometry are applied to the Public Health Mortality Registry of the Region of Valencia, Spain, with 512 143 entries from 2000 to 2012, disaggregated into 24 health departments. The methods provide DQ metrics and exploratory visualizations for (1) assessing the variability among multiple sources and (2) monitoring and exploring changes with time. The methods are suited to big data and multitype, multivariate, and multimodal data.
Results The repository was partitioned into 2 probabilistically separated temporal subgroups following a change in the Spanish National Death Certificate in 2009. Punctual temporal anomalies were noticed due to a punctual increment in the missing data, along with outlying and clustered health departments due to differences in populations or in practices.
Discussion Changes in protocols, differences in populations, biased practices, or other systematic DQ problems affected data variability. Even if semantic and integration aspects are addressed in data sharing infrastructures, probabilistic variability may still be present. Solutions include fixing or excluding data and analyzing different sites or time periods separately. A systematic approach to assessing temporal and multisite variability is proposed.
Conclusion Multisite and temporal variability in data distributions affects DQ, hindering data reuse, and an assessment of such variability should be a part of systematic DQ procedures.
Collapse
Affiliation(s)
- Carlos Sáez
- Instituto Universitario de Aplicaciones de las Tecnologías de la Información y de las Comunicaciones Avanzadas. Universitat Politècnica de València. Camino de Vera s/n. 46022 Valencia, España
- Centre for Health Technologies and Services Research, University of Porto, Porto, Portugal
| | - Oscar Zurriaga
- Dirección General de Salud Pública, Conselleria de Sanidad, Valencia, Spain
- FISABIO – Salud Pública, Consellería de Sanidad, Valencia, Spain
- CIBERESP, Madrid, Spain
| | | | - Inma Melchor
- Dirección General de Salud Pública, Conselleria de Sanidad, Valencia, Spain
| | - Montserrat Robles
- Instituto Universitario de Aplicaciones de las Tecnologías de la Información y de las Comunicaciones Avanzadas. Universitat Politècnica de València. Camino de Vera s/n. 46022 Valencia, España
| | - Juan M García-Gómez
- Instituto Universitario de Aplicaciones de las Tecnologías de la Información y de las Comunicaciones Avanzadas. Universitat Politècnica de València. Camino de Vera s/n. 46022 Valencia, España
- Unidad Mixta de Investigación en TICs aplicadas a la Reingeniería de Procesos Sociosanitarios (eRPSS), Instituto de Investigación Sanitaria del Hospital Universitario y Politécnico La Fe, Valencia, Spain
| |
Collapse
|
10
|
García-de-León-Chocano R, Muñoz-Soler V, Sáez C, García-de-León-González R, García-Gómez JM. Construction of quality-assured infant feeding process of care data repositories: Construction of the perinatal repository (Part 2). Comput Biol Med 2016; 71:214-22. [DOI: 10.1016/j.compbiomed.2016.01.007] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2015] [Revised: 12/03/2015] [Accepted: 01/06/2016] [Indexed: 10/22/2022]
|