1
|
How to harness the power of web scraping for medical and surgical research: An application in estimating international collaboration. World J Surg 2024. [PMID: 38794809 DOI: 10.1002/wjs.12220] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2024] [Accepted: 05/10/2024] [Indexed: 05/26/2024]
Abstract
The transformative potential of web scraping in surgical research through a comprehensive analysis of its revolutionary applications and profound impact is now within reach. This manuscript unveils the pivotal role of web scraping in driving innovation, enabling more effective management of human capital dynamics, and enhancing patient outcomes in the surgical field. As an example, we demonstrate how web scraping can uncover insights into international collaboration in surgery research revealing limited collaboration between surgeons in developed and developing countries.
Collapse
|
2
|
iCEED: Integrated customized extraction of enzyme data. J Bioinform Comput Biol 2024:2450005. [PMID: 38779780 DOI: 10.1142/s0219720024500057] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/25/2024]
Abstract
Enzymes catalyze diverse biochemical reactions and are building blocks of cellular and metabolic pathways. Data and metadata of enzymes are distributed across databases and are archived in various formats. The enzyme databases provide utilities for efficient searches and downloading enzyme records in batch mode but do not support organism-specific extraction of subsets of data. Users are required to write scripts for parsing entries for customized data extraction prior to downstream analysis. Integrated Customized Extraction of Enzyme Data (iCEED) has been developed to provide organism-specific customized data extraction utilities for seven commonly used enzyme databases and brings these resources under an integrated portal. iCEED provides dropdown menus and search boxes using typehead utility for submission of queries as well as enzyme class-based browsing utility. A utility to facilitate mapping and visualization of functionally important features on the three-dimensional (3D) structures of enzymes is integrated. The customized data extraction utilities provided in iCEED are expected to be useful for biochemists, biotechnologists, computational biologists, and life science researchers to build curated datasets of their choice through an easy to navigate web-based interface. The integrated feature visualization system is useful for a fine-grained understanding of the enzyme structure-function relationship. Desired subsets of data, extracted and curated using iCEED can be subsequently used for downstream processing, analyses, and knowledge discovery. iCEED can also be used for training and teaching purposes.
Collapse
|
3
|
Definitions and Measurements for Atypical Presentations at Risk for Diagnostic Errors in Internal Medicine: Protocol for a Scoping Review. JMIR Res Protoc 2024; 13:e56933. [PMID: 38526541 PMCID: PMC11002735 DOI: 10.2196/56933] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2024] [Accepted: 02/26/2024] [Indexed: 03/26/2024] Open
Abstract
BACKGROUND Atypical presentations have been increasingly recognized as a significant contributing factor to diagnostic errors in internal medicine. However, research to address associations between atypical presentations and diagnostic errors has not been evaluated due to the lack of widely applicable definitions and criteria for what is considered an atypical presentation. OBJECTIVE The aim of the study is to describe how atypical presentations are defined and measured in studies of diagnostic errors in internal medicine and use this new information to develop new criteria to identify atypical presentations at high risk for diagnostic errors. METHODS This study will follow an established framework for conducting scoping reviews. Inclusion criteria are developed according to the participants, concept, and context framework. This review will consider studies that fulfill all of the following criteria: include adult patients (participants); explore the association between atypical presentations and diagnostic errors using any definition, criteria, or measurement to identify atypical presentations and diagnostic errors (concept); and focus on internal medicine (context). Regarding the type of sources, this scoping review will consider quantitative, qualitative, and mixed methods study designs; systematic reviews; and opinion papers for inclusion. Case reports, case series, and conference abstracts will be excluded. The data will be extracted through MEDLINE, Web of Science, CINAHL, Embase, Cochrane Library, and Google Scholar searches. No limits will be applied to language, and papers indexed from database inception to December 31, 2023, will be included. Two independent reviewers (YH and RK) will conduct study selection and data extraction. The data extracted will include specific details about the patient characteristics (eg, age, sex, and disease), the definitions and measuring methods for atypical presentations and diagnostic errors, clinical settings (eg, department and outpatient or inpatient), type of evidence source, and the association between atypical presentations and diagnostic errors relevant to the review question. The extracted data will be presented in tabular format with descriptive statistics, allowing us to identify the key components or types of atypical presentations and develop new criteria to identify atypical presentations for future studies of diagnostic errors. Developing the new criteria will follow guidance for a basic qualitative content analysis with an inductive approach. RESULTS As of January 2024, a literature search through multiple databases is ongoing. We will complete this study by December 2024. CONCLUSIONS This scoping review aims to provide rigorous evidence to develop new criteria to identify atypical presentations at high risk for diagnostic errors in internal medicine. Such criteria could facilitate the development of a comprehensive conceptual model to understand the associations between atypical presentations and diagnostic errors in internal medicine. TRIAL REGISTRATION Open Science Framework; www.osf.io/27d5m. INTERNATIONAL REGISTERED REPORT IDENTIFIER (IRRID) DERR1-10.2196/56933.
Collapse
|
4
|
Data extraction for evidence synthesis using a large language model: A proof-of-concept study. Res Synth Methods 2024. [PMID: 38432227 DOI: 10.1002/jrsm.1710] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2023] [Revised: 12/18/2023] [Accepted: 01/26/2024] [Indexed: 03/05/2024]
Abstract
Data extraction is a crucial, yet labor-intensive and error-prone part of evidence synthesis. To date, efforts to harness machine learning for enhancing efficiency of the data extraction process have fallen short of achieving sufficient accuracy and usability. With the release of large language models (LLMs), new possibilities have emerged to increase efficiency and accuracy of data extraction for evidence synthesis. The objective of this proof-of-concept study was to assess the performance of an LLM (Claude 2) in extracting data elements from published studies, compared with human data extraction as employed in systematic reviews. Our analysis utilized a convenience sample of 10 English-language, open-access publications of randomized controlled trials included in a single systematic review. We selected 16 distinct types of data, posing varying degrees of difficulty (160 data elements across 10 studies). We used the browser version of Claude 2 to upload the portable document format of each publication and then prompted the model for each data element. Across 160 data elements, Claude 2 demonstrated an overall accuracy of 96.3% with a high test-retest reliability (replication 1: 96.9%; replication 2: 95.0% accuracy). Overall, Claude 2 made 6 errors on 160 data items. The most common errors (n = 4) were missed data items. Importantly, Claude 2's ease of use was high; it required no technical expertise or labeled training data for effective operation (i.e., zero-shot learning). Based on findings of our proof-of-concept study, leveraging LLMs has the potential to substantially enhance the efficiency and accuracy of data extraction for evidence syntheses.
Collapse
|
5
|
Methods for using Bing's AI-powered search engine for data extraction for a systematic review. Res Synth Methods 2024; 15:347-353. [PMID: 38066713 DOI: 10.1002/jrsm.1689] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2023] [Revised: 11/08/2023] [Accepted: 11/20/2023] [Indexed: 12/21/2023]
Abstract
Data extraction is a time-consuming and resource-intensive task in the systematic review process. Natural language processing (NLP) artificial intelligence (AI) techniques have the potential to automate data extraction saving time and resources, accelerating the review process, and enhancing the quality and reliability of extracted data. In this paper, we propose a method for using Bing AI and Microsoft Edge as a second reviewer to verify and enhance data items first extracted by a single human reviewer. We describe a worked example of the steps involved in instructing the Bing AI Chat tool to extract study characteristics as data items from a PDF document into a table so that they can be compared with data extracted manually. We show that this technique may provide an additional verification process for data extraction where there are limited resources available or for novice reviewers. However, it should not be seen as a replacement to already established and validated double independent data extraction methods without further evaluation and verification. Use of AI techniques for data extraction in systematic reviews should be transparently and accurately described in reports. Future research should focus on the accuracy, efficiency, completeness, and user experience of using Bing AI for data extraction compared with traditional methods using two or more reviewers independently.
Collapse
|
6
|
Overview and quality assessment of health economic evaluations for homeopathic therapy: an updated systematic review. Expert Rev Pharmacoecon Outcomes Res 2024; 24:117-142. [PMID: 37795998 DOI: 10.1080/14737167.2023.2266136] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2023] [Accepted: 09/28/2023] [Indexed: 10/06/2023]
Abstract
INTRODUCTION Likewise other medical interventions, economic evaluations of homeopathy contribute to the evidence base of therapeutic concepts and are needed for socioeconomic decision-making. A 2013 review was updated and extended to gain a current overview. METHODS A systematic literature search of the terms 'cost' and 'homeopathy' from January 2012 to July 2022 was performed in electronic databases. Two independent reviewers checked records, extracted data, and assessed study quality using the Consensus on Health Economic Criteria (CHEC) list. RESULTS Six studies were added to 15 from the previous review. Synthesizing both health outcomes and costs showed homeopathic treatment being at least equally effective for less or similar costs than control in 14 of 21 studies. Three found improved outcomes at higher costs, two of which showed cost-effectiveness for homeopathy by incremental analysis. One found similar results and three similar outcomes at higher costs for homeopathy. CHEC values ranged between two and 16, with studies before 2009 having lower values (Mean ± SD: 6.7 ± 3.4) than newer studies (9.4 ± 4.3). CONCLUSION Although results of the CHEC assessment show a positive chronological development, the favorable cost-effectiveness of homeopathic treatments seen in a small number of high-quality studies is undercut by too many examples of methodologically poor research.
Collapse
|
7
|
Catchii: Empowering literature review screening in healthcare. Res Synth Methods 2024; 15:157-165. [PMID: 37771210 DOI: 10.1002/jrsm.1675] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2023] [Revised: 08/17/2023] [Accepted: 09/18/2023] [Indexed: 09/30/2023]
Abstract
A systematic review is a type of literature review that aims to collect and analyse all available evidence from the literature on a particular topic. The process of screening and identifying eligible articles from the vast amounts of literature is a time-consuming task. Specialised software has been developed to aid in the screening process and save significant time and labour. However, the most suitable software tools that are available often come with a cost or only offer either a limited or a trial version for free. In this paper, we report the release of a new software application, Catchii, which contains all the important features of a systematic review screening application while being completely free. It supports a user at different stages of screening, from detecting duplicates to creating the final flowchart for a publication. Catchii is designed to provide a good user experience and streamline the screening process through its clean and user-friendly interface on both computers and mobile devices. All in all, Catchii is a valuable addition to the current selection of systematic review screening applications. It enables researchers without financial resources to access features found in the best paid tools, while also diminishing costs for those who have previously relied on paid applications. Catchii is available at https://catchii.org.
Collapse
|
8
|
Update of the Xylella spp. host plant database - systematic literature search up to 30 June 2023. EFSA J 2023; 21:e8477. [PMID: 38107375 PMCID: PMC10722330 DOI: 10.2903/j.efsa.2023.8477] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2023] Open
Abstract
This scientific report provides an update of the Xylella spp. host plant database, aiming to provide information and scientific support to risk assessors, risk managers and researchers dealing with Xylella spp. Upon a mandate of the European Commission, EFSA created and regularly updates a database of host plant species of Xylella spp. The current mandate covers the period 2021-2026. This report is related to the ninth version of the database published in Zenodo in the EFSA Knowledge Junction community, covering literature published from 1 January 2023 up to 30 June 2023, and recent Europhyt outbreak notifications. Informative data have been extracted from 47 selected publications. Seven new host plants were identified and added to the database. These plant species were naturally infected by X. fastidiosa subsp. multiplex in France, Spain and the United States. No additional data were retrieved for X. taiwanensis, and no additional multilocus sequence tipes (STs) were identified worldwide. New information on the tolerant/resistant response of plant species to X. fastidiosa infection were added to the database. The Xylella spp. host plant species were listed in different categories based on the number and type of detection methods applied for each finding. The overall number of Xylella spp. host plants determined with at least two different detection methods or positive with one method (between sequencing and pure culture isolation (category A), reaches now 439 plant species, 200 genera and 69 families. Such numbers rise to 696 plant species, 307 genera and 88 families if considered regardless of the detection methods applied (category E).
Collapse
|
9
|
Development and Evaluation of a Natural Language Processing System for Curating a Trans-Thoracic Echocardiogram (TTE) Database. Bioengineering (Basel) 2023; 10:1307. [PMID: 38002431 PMCID: PMC10669818 DOI: 10.3390/bioengineering10111307] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2023] [Revised: 11/03/2023] [Accepted: 11/09/2023] [Indexed: 11/26/2023] Open
Abstract
BACKGROUND Although electronic health records (EHR) provide useful insights into disease patterns and patient treatment optimisation, their reliance on unstructured data presents a difficulty. Echocardiography reports, which provide extensive pathology information for cardiovascular patients, are particularly challenging to extract and analyse, because of their narrative structure. Although natural language processing (NLP) has been utilised successfully in a variety of medical fields, it is not commonly used in echocardiography analysis. OBJECTIVES To develop an NLP-based approach for extracting and categorising data from echocardiography reports by accurately converting continuous (e.g., LVOT VTI, AV VTI and TR Vmax) and discrete (e.g., regurgitation severity) outcomes in a semi-structured narrative format into a structured and categorised format, allowing for future research or clinical use. METHODS 135,062 Trans-Thoracic Echocardiogram (TTE) reports were derived from 146967 baseline echocardiogram reports and split into three cohorts: Training and Validation (n = 1075), Test Dataset (n = 98) and Application Dataset (n = 133,889). The NLP system was developed and was iteratively refined using medical expert knowledge. The system was used to curate a moderate-fidelity database from extractions of 133,889 reports. A hold-out validation set of 98 reports was blindly annotated and extracted by two clinicians for comparison with the NLP extraction. Agreement, discrimination, accuracy and calibration of outcome measure extractions were evaluated. RESULTS Continuous outcomes including LVOT VTI, AV VTI and TR Vmax exhibited perfect inter-rater reliability using intra-class correlation scores (ICC = 1.00, p < 0.05) alongside high R2 values, demonstrating an ideal alignment between the NLP system and clinicians. A good level (ICC = 0.75-0.9, p < 0.05) of inter-rater reliability was observed for outcomes such as LVOT Diam, Lateral MAPSE, Peak E Velocity, Lateral E' Velocity, PV Vmax, Sinuses of Valsalva and Ascending Aorta diameters. Furthermore, the accuracy rate for discrete outcome measures was 91.38% in the confusion matrix analysis, indicating effective performance. CONCLUSIONS The NLP-based technique yielded good results when it came to extracting and categorising data from echocardiography reports. The system demonstrated a high degree of agreement and concordance with clinician extractions. This study contributes to the effective use of semi-structured data by providing a useful tool for converting semi-structured text to a structured echo report that can be used for data management. Additional validation and implementation in healthcare settings can improve data availability and support research and clinical decision-making.
Collapse
|
10
|
Advice for improving the reproducibility of data extraction in meta-analysis. Res Synth Methods 2023; 14:911-915. [PMID: 37571802 DOI: 10.1002/jrsm.1663] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2023] [Revised: 07/26/2023] [Accepted: 07/27/2023] [Indexed: 08/13/2023]
Abstract
Extracting data from studies is the norm in meta-analyses, enabling researchers to generate effect sizes when raw data are otherwise not available. While there has been a general push for increased reproducibility in meta-analysis, the transparency and reproducibility of the data extraction phase is still lagging behind. Unfortunately, there is little guidance of how to make this process more transparent and shareable. To address this, we provide several steps to help increase the reproducibility of data extraction in meta-analysis. We also provide suggestions of R software that can further help with reproducible data policies: the shinyDigitise and juicr packages. Adopting the guiding principles listed here and using the appropriate software will provide a more transparent form of data extraction in meta-analyses.
Collapse
|
11
|
Robotic Process Automation Based Data Extraction from Handwritten Medical Forms. Stud Health Technol Inform 2023; 309:68-72. [PMID: 37869808 DOI: 10.3233/shti230741] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2023]
Abstract
This paper proposes to create an RPA(robotic process automation) based software robot that can digitalize and extract data from handwritten medical forms. The RPA robot uses a taxonomy that is specific for the medical form and associates the extracted data with the taxonomy. This is accomplished using UiPath studio to create the robot, Google Cloud Vision OCR(optical character recognition) to create the DOM (digital object model) file and UiPath machine learning (ML) API to extract the data from the medical form. Due to the fact that the medical form is in a non-standard format a data extraction template had to be applied. After the extraction process the data can be saved into databases or into a spreadsheets.
Collapse
|
12
|
Effectiveness of eHealth Smoking Cessation Interventions: Systematic Review and Meta-Analysis. J Med Internet Res 2023; 25:e45111. [PMID: 37505802 PMCID: PMC10422176 DOI: 10.2196/45111] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2022] [Revised: 04/13/2023] [Accepted: 04/24/2023] [Indexed: 07/29/2023] Open
Abstract
BACKGROUND Rapid advancements in eHealth and mobile health (mHealth) technologies have driven researchers to design and evaluate numerous technology-based interventions to promote smoking cessation. The evolving nature of cessation interventions emphasizes a strong need for knowledge synthesis. OBJECTIVE This systematic review and meta-analysis aimed to summarize recent evidence from randomized controlled trials regarding the effectiveness of eHealth-based smoking cessation interventions in promoting abstinence and assess nonabstinence outcome indicators, such as cigarette consumption and user satisfaction, via narrative synthesis. METHODS We searched for studies published in English between 2017 and June 30, 2022, in 4 databases: PubMed (including MEDLINE), PsycINFO, Embase, and Cochrane Library. Two independent reviewers performed study screening, data extraction, and quality assessment based on the GRADE (Grading of Recommendations, Assessment, Development, and Evaluations) framework. We pooled comparable studies based on the population, follow-up time, intervention, and control characteristics. Two researchers performed an independent meta-analysis on smoking abstinence using the Sidik-Jonkman random-effects model and log risk ratio (RR) as the effect measurement. For studies not included in the meta-analysis, the outcomes were narratively synthesized. RESULTS A total of 464 studies were identified through an initial database search after removing duplicates. Following screening and full-text assessments, we deemed 39 studies (n=37,341 participants) eligible for this review. Of these, 28 studies were shortlisted for meta-analysis. According to the meta-analysis, SMS or app text messaging can significantly increase both short-term (3 months) abstinence (log RR=0.50, 95% CI 0.25-0.75; I2=0.72%) and long-term (6 months) abstinence (log RR=0.77, 95% CI 0.49-1.04; I2=8.65%), relative to minimal cessation support. The frequency of texting did not significantly influence treatment outcomes. mHealth apps may significantly increase abstinence in the short term (log RR=0.76, 95% CI 0.09-1.42; I2=88.02%) but not in the long term (log RR=0.15, 95% CI -0.18 to 0.48; I2=80.06%), in contrast to less intensive cessation support. In addition, personalized or interactive interventions showed a moderate increase in cessation for both the short term (log RR=0.62, 95% CI 0.30-0.94; I2=66.50%) and long term (log RR=0.28, 95% CI 0.04-0.53; I2=73.42%). In contrast, studies without any personalized or interactive features had no significant impact. Finally, the treatment effect was similar between trials that used biochemically verified or self-reported abstinence. Among studies reporting outcomes besides abstinence (n=20), a total of 11 studies reported significantly improved nonabstinence outcomes in cigarette consumption (3/14, 21%) or user satisfaction (8/19, 42%). CONCLUSIONS Our review of 39 randomized controlled trials found that recent eHealth interventions might promote smoking cessation, with mHealth being the dominant approach. Despite their success, the effectiveness of such interventions may diminish with time. The design of more personalized interventions could potentially benefit future studies. TRIAL REGISTRATION PROSPERO CRD42022347104; https://www.crd.york.ac.uk/prospero/display_record.php?RecordID=347104.
Collapse
|
13
|
Identifying Patient Populations in Texts Describing Drug Approvals Through Deep Learning-Based Information Extraction: Development of a Natural Language Processing Algorithm. JMIR Form Res 2023; 7:e44876. [PMID: 37347514 DOI: 10.2196/44876] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2022] [Revised: 03/30/2023] [Accepted: 04/17/2023] [Indexed: 06/23/2023] Open
Abstract
BACKGROUND New drug treatments are regularly approved, and it is challenging to remain up-to-date in this rapidly changing environment. Fast and accurate visualization is important to allow a global understanding of the drug market. Automation of this information extraction provides a helpful starting point for the subject matter expert, helps to mitigate human errors, and saves time. OBJECTIVE We aimed to semiautomate disease population extraction from the free text of oncology drug approval descriptions from the BioMedTracker database for 6 selected drug targets. More specifically, we intended to extract (1) line of therapy, (2) stage of cancer of the patient population described in the approval, and (3) the clinical trials that provide evidence for the approval. We aimed to use these results in downstream applications, aiding the searchability of relevant content against related drug project sources. METHODS We fine-tuned a state-of-the-art deep learning model, Bidirectional Encoder Representations from Transformers, for each of the 3 desired outputs. We independently applied rule-based text mining approaches. We compared the performances of deep learning and rule-based approaches and selected the best method, which was then applied to new entries. The results were manually curated by a subject matter expert and then used to train new models. RESULTS The training data set is currently small (433 entries) and will enlarge over time when new approval descriptions become available or if a choice is made to take another drug target into account. The deep learning models achieved 61% and 56% 5-fold cross-validated accuracies for line of therapy and stage of cancer, respectively, which were treated as classification tasks. Trial identification is treated as a named entity recognition task, and the 5-fold cross-validated F1-score is currently 87%. Although the scores of the classification tasks could seem low, the models comprise 5 classes each, and such scores are a marked improvement when compared to random classification. Moreover, we expect improved performance as the input data set grows, since deep learning models need to be trained on a large enough amount of data to be able to learn the task they are taught. The rule-based approach achieved 60% and 74% 5-fold cross-validated accuracies for line of therapy and stage of cancer, respectively. No attempt was made to define a rule-based approach for trial identification. CONCLUSIONS We developed a natural language processing algorithm that is currently assisting subject matter experts in disease population extraction, which supports health authority approvals. This algorithm achieves semiautomation, enabling subject matter experts to leverage the results for deeper analysis and to accelerate information retrieval in a crowded clinical environment such as oncology.
Collapse
|
14
|
Abstract
This scientific report provides an update of the Xylella spp. host plant database, aiming to provide information and scientific support to risk assessors, risk managers and researchers dealing with Xylella spp. Upon a mandate of the European Commission, EFSA created and regularly updates a database of host plant species of Xylella spp. The current mandate covers the period 2021-2026. This report is related to the eighth version of the database published in Zenodo in the EFSA Knowledge Junction community, covering literature published from 1 July 2022 up to 31 December 2022, and recent Europhyt outbreak notifications. Informative data have been extracted from 21 selected publications. Twelve new host plants were identified and added to the database. Nine plant species were reported from Portugal and naturally infected by subsp. multiplex or unknown (i.e. not reported). Three plant species were successfully artificially infected by subsp. fastidiosa. No additional data were retrieved for X. taiwanensis, and no additional STs were identified worldwide. New information on the tolerant/resistant response of plant species to X. fastidiosa infection were added to the database. The overall number of Xylella spp. host plants determined with at least two different detection methods or positive with one method (between sequencing and pure culture isolation) reaches now 433 plant species, 197 genera and 68 families. Such numbers rise to 690 plant species, 306 genera and 88 families if considered regardless of the detection methods applied.
Collapse
|
15
|
gtfs2net: Extraction of General Transit Feed Specification Data Sets to Abstract Networks and Their Analysis. BIG DATA 2023. [PMID: 37092983 DOI: 10.1089/big.2022.0269] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/03/2023]
Abstract
Mass transportation networks of cities or regions are interesting and important to be studied to get a picture of the properties of a somehow better topology and system of transportation. One way to do this lies on the basis of spatial information of stations and routes. As we show however interesting findings can be gained also if one studies the abstract network topologies of these systems. To get these abstract types of networks, we have developed a tool that can extract a network of connected stops from General Transit Feed Specification feeds. As we found during the development, service providers do not follow the specification in coherent ways, so as a kind of postprocessing we have introduced virtual stations to the abstract networks that gather close stops together. We analyze the effect of these new stations on the abstract map as well.
Collapse
|
16
|
A profile of the Grampian Data Safe Haven, a regional Scottish safe haven for health and population data research. Int J Popul Data Sci 2023; 4:1817. [PMID: 37671386 PMCID: PMC10476148 DOI: 10.23889/ijpds.v4i2.1817] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/18/2023] Open
Abstract
There has been a recent emphasis to establish and codify large-scale or national Trusted Research Environments (TREs) in the United Kingdom, with a view to limit smaller, local TREs. The basis for this argument is that it avoids duplication of infrastructure, information governance, privacy risks, monopolies and will promote innovation, particularly with commercial partners. However, the work around establishing TREs in the UK largely ignores the long-established local TRE landscape in Scotland, and the way in which local TREs can actually improve data quality, solve technical architecture challenges, promote information governance and risk minimisation, and encourage innovation and collaboration (both academic and commercial). This data centre profile focuses on the Grampian Data Safe Haven (DaSH), a secure, virtual healthcare data analysis and storage centre located in Aberdeen, Scotland. DaSH was co-established by the NHS Grampian Health Board and University of Aberdeen to allow for the secure processing and linking of health data for the Grampian and Scottish population when it is not practicable to obtain consent from individual patients. As an established trusted research environment now in its 10th operating year, DaSH technology ensures healthcare, social care data and other types of sensitive data, routinely collected and used without individual patient consent, are made accessible for both academic research and clinical service evaluation and improvements whilst protecting individuals' privacy at the local, national and international levels. DaSH has registered almost 600 projects and facilitated over 200 distinct research projects with data hosting, extraction, and novel linkages to completion. Ongoing innovation and collaboration between DaSH and the NHS Grampian Health Board continues to expand researcher access to new types of data and data linkages, introduce new technologies for advanced statistical research methods, and supports interdisciplinary research using population health and social care data for research, clinical and commercial advancements, and real-world practitioner applications. The purpose of this paper is to present DaSH's data population, operating model, architecture and information technology, governance, legislation and management, privacy-by-design principles and data access, data linkage methods, data sources, noteworthy research outputs, and further developments in order to demonstrate the value of local TREs within the data management and access debate.
Collapse
|
17
|
Update of the Xylella spp. host plant database - systematic literature search up to 30 June 2022. EFSA J 2023; 21:e07726. [PMID: 36628332 PMCID: PMC9827234 DOI: 10.2903/j.efsa.2023.7726] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023] Open
Abstract
This scientific report provides an update of the Xylella spp. host plant database, aiming to provide information and scientific support to risk assessors, risk managers and researchers dealing with Xylella spp. Upon a mandate of the European Commission, EFSA created and regularly updates a database of host plant species of Xylella spp. The current mandate covers the period 2021-2026. This report is related to the seventh version of the database published in Zenodo in the EFSA Knowledge Junction community, covering literature published from 1 January 2022 up to 30 June 2022, and recent Europhyt outbreak notifications. Informative data have been extracted from 30 selected publications. Fifteen new host plants were identified and added to the database. Those plant species were reported from Brazil, France, Italy, Portugal and Spain, and infected by subsp. multiplex, pauca or unknown (i.e. not reported). No additional data were retrieved for X. taiwanensis. Two new STs (namely ST88 and ST89) belonging to subspecies multiplex were identified in host plants in natural conditions, and new information on the tolerant/resistant response of plant species to X. fastidiosa infection were added to the database. The overall number of Xylella spp. host plants determined with at least two different detection methods or positive with one method (between sequencing and pure culture isolation) reaches now 423 plant species, 194 genera and 68 families. Such numbers rise to 679 plant species, 304 genera and 88 families if considered regardless of the detection methods applied.
Collapse
|
18
|
A Hybrid Architecture (CO-CONNECT) to Facilitate Rapid Discovery and Access to Data Across the United Kingdom in Response to the COVID-19 Pandemic: Development Study. J Med Internet Res 2022; 24:e40035. [PMID: 36322788 PMCID: PMC9822177 DOI: 10.2196/40035] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2022] [Revised: 10/12/2022] [Accepted: 11/01/2022] [Indexed: 11/05/2022] Open
Abstract
BACKGROUND COVID-19 data have been generated across the United Kingdom as a by-product of clinical care and public health provision, as well as numerous bespoke and repurposed research endeavors. Analysis of these data has underpinned the United Kingdom's response to the pandemic, and informed public health policies and clinical guidelines. However, these data are held by different organizations, and this fragmented landscape has presented challenges for public health agencies and researchers as they struggle to find relevant data to access and interrogate the data they need to inform the pandemic response at pace. OBJECTIVE We aimed to transform UK COVID-19 diagnostic data sets to be findable, accessible, interoperable, and reusable (FAIR). METHODS A federated infrastructure model (COVID - Curated and Open Analysis and Research Platform [CO-CONNECT]) was rapidly built to enable the automated and reproducible mapping of health data partners' pseudonymized data to the Observational Medical Outcomes Partnership Common Data Model without the need for any data to leave the data controllers' secure environments, and to support federated cohort discovery queries and meta-analysis. RESULTS A total of 56 data sets from 19 organizations are being connected to the federated network. The data include research cohorts and COVID-19 data collected through routine health care provision linked to longitudinal health care records and demographics. The infrastructure is live, supporting aggregate-level querying of data across the United Kingdom. CONCLUSIONS CO-CONNECT was developed by a multidisciplinary team. It enables rapid COVID-19 data discovery and instantaneous meta-analysis across data sources, and it is researching streamlined data extraction for use in a Trusted Research Environment for research and public health analysis. CO-CONNECT has the potential to make UK health data more interconnected and better able to answer national-level research questions while maintaining patient confidentiality and local governance procedures.
Collapse
|
19
|
Construction of Cohorts of Similar Patients From Automatic Extraction of Medical Concepts: Phenotype Extraction Study. JMIR Med Inform 2022; 10:e42379. [PMID: 36534446 PMCID: PMC9808583 DOI: 10.2196/42379] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2022] [Revised: 10/17/2022] [Accepted: 10/22/2022] [Indexed: 12/23/2022] Open
Abstract
BACKGROUND Reliable and interpretable automatic extraction of clinical phenotypes from large electronic medical record databases remains a challenge, especially in a language other than English. OBJECTIVE We aimed to provide an automated end-to-end extraction of cohorts of similar patients from electronic health records for systemic diseases. METHODS Our multistep algorithm includes a named-entity recognition step, a multilabel classification using medical subject headings ontology, and the computation of patient similarity. A selection of cohorts of similar patients on a priori annotated phenotypes was performed. Six phenotypes were selected for their clinical significance: P1, osteoporosis; P2, nephritis in systemic erythematosus lupus; P3, interstitial lung disease in systemic sclerosis; P4, lung infection; P5, obstetric antiphospholipid syndrome; and P6, Takayasu arteritis. We used a training set of 151 clinical notes and an independent validation set of 256 clinical notes, with annotated phenotypes, both extracted from the Assistance Publique-Hôpitaux de Paris data warehouse. We evaluated the precision of the 3 patients closest to the index patient for each phenotype with precision-at-3 and recall and average precision. RESULTS For P1-P4, the precision-at-3 ranged from 0.85 (95% CI 0.75-0.95) to 0.99 (95% CI 0.98-1), the recall ranged from 0.53 (95% CI 0.50-0.55) to 0.83 (95% CI 0.81-0.84), and the average precision ranged from 0.58 (95% CI 0.54-0.62) to 0.88 (95% CI 0.85-0.90). P5-P6 phenotypes could not be analyzed due to the limited number of phenotypes. CONCLUSIONS Using a method close to clinical reasoning, we built a scalable and interpretable end-to-end algorithm for extracting cohorts of similar patients.
Collapse
|
20
|
Abstract
This Scientific report provides an update of the Xylella spp. host plant database, aiming to provide information and scientific support to risk assessors, risk managers and researchers dealing with Xylella spp. Upon a mandate of the European Commission, EFSA created and regularly updated a database of host plant species of Xylella spp. The current mandate covers the period 2021–2026. This report is related to the sixth version of the database published in Zenodo in the EFSA Knowledge Junction community, covering literature published from 1 July 2021 up to 31 December 2021, and recent Europhyt outbreak notifications. Informative data have been extracted from 29 selected publications. Eleven new host plants were identified and added to the database: six plant species naturally infected by subsp. multiplex of X. fastidiosa in the EU (France, Italy and Portugal) and five plant species artificially infected by different X. fastidiosa subspecies (multiplex, pauca, fastidiosa and sandyi). No additional data were retrieved for X. taiwanensis. New information on the tolerant/resistant response of plant species to X. fastidiosa infection were added, while no new STs have been identified worldwide compared to the previous update published in January 2022. The overall number of Xylella spp. host plants determined with at least two different detection methods or positive with one method (between: sequencing, pure culture isolation) reaches now 412 plant species, 190 genera and 68 families. Such numbers rise to 664 plant species, 299 genera and 88 families if considered regardless of the detection methods applied.
Collapse
|
21
|
An Adverse Drug Reaction Database for Clinical Use - Potential of and Difficulties with the Summary of Product Characteristics. Stud Health Technol Inform 2022; 294:450-454. [PMID: 35612120 DOI: 10.3233/shti220499] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Adverse drug reactions (ADRs) for all drugs in Europe are described in the legally approved Summary of Product Characteristics (SmPC). An overview of all ADRs of the patients' drug list can support healthcare staff to link patient symptoms to possible ADRs. We review the possibilities and challenges to extract ADR information from SmPCs and present the development of our semi-automated procedure for extraction of ADRs from the tabulated section of the SmPCs to create a database, named Bikt, which is regularly updated and used at point of care in Sweden. The existence of five major table formats for ADRs used in the SmPCs required the development of different parsing scripts. Manual checks for correctness for all content has to be performed. The quality of extraction was investigated for all SmPCs by measuring precision, recall and F1 scores (i.e. the weighted harmonic mean of precision and recall) and compared with other methods published. We conclude that it is possible to semi-automatically extract ADR information from SmPCs. However, clear technical and content guidelines and standards for ADR tables and terms from drug registration authorities would lead to improved extraction and usability of ADR information at point of care.
Collapse
|
22
|
Privacy-preserving local analysis of digital trace data: A proof-of-concept. PATTERNS (NEW YORK, N.Y.) 2022; 3:100444. [PMID: 35510190 PMCID: PMC9058917 DOI: 10.1016/j.patter.2022.100444] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/08/2021] [Revised: 11/22/2021] [Accepted: 01/13/2022] [Indexed: 11/16/2022]
Abstract
We present PORT, a software platform for local data extraction and analysis of digital trace data. While digital trace data hold huge potential for social-scientific discovery, their most useful parts have been unattainable for scientists because of privacy concerns and prohibitive access to application programming interfaces. Recently, a workflow was introduced allowing citizens to donate their digital traces to scientists. In this workflow, citizens’ digital traces are processed locally on their machines before providing informed consent to share a subset of the data with researchers. In this paper, we present the newly developed software PORT that implements the local processing part of this workflow, protecting privacy by shielding sensitive data from outside observers, including the researchers themselves. When using PORT, researchers can tailor the local processing procedure suitable to the data download package and research question. Thus, PORT enables a host of potential applications of social data science to hitherto unobtainable data. Software that allows for privacy-preserving analysis of digital trace data Participants can give true informed consent regarding data they share with researchers The software is provided via open source The software can be tailored toward different research questions or data sources
Since the General Data Protection Regulation, individuals can request a copy of all the digital traces they leave behind, which are then provided in so-called data download packages (DDPs). This makes it theoretically possible for individuals to share these DDPs for research purposes. However, DDPs can contain very sensitive information, making individuals unwilling to share them with researchers. In addition, researchers are often interested in only a small part of the large amount of information that is found in the DDP. The software introduced in this paper overcomes this privacy issue that currently prevents the use of DDPs for scientific research. By doing so, the huge amount of digital traces that are left behind by individuals in many aspects of their lives are finally becoming available for research purposes, while the participants are involved in the sharing process and can provide true informed consent regarding the information that they share.
Collapse
|
23
|
An "order of data acquisition" for digital forensic investigations. J Forensic Sci 2022; 67:1215-1220. [PMID: 34997585 DOI: 10.1111/1556-4029.14979] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2021] [Revised: 11/23/2021] [Accepted: 12/14/2021] [Indexed: 11/30/2022]
Abstract
Data acquisition is a fundamental stage of the digital forensic workflow, where without it, it may not be possible to conduct many criminal inquiries effectively. While any investigative team may want access to all digital data available, it is no longer an approach that is considered justifiable or proportionate in all cases. There is now an increasing narrative highlighting the invasiveness of digital data acquisition processes and their impact upon privacy, with calls to ensure greater scrutiny is placed upon their use. This work proposes the "Order of Data Acquisition" which defines 10 digital data acquisition methods that are available to practitioners as a part of a forensic examination, derived from a review of existing literature and best practice acquisition approaches, and arranged by their "invasiveness." Each method is discussed with examples provided in order to clarify and formalize the process of determining a suitable acquisition method in every case while acknowledging privacy invasion concerns. Finally, conclusions are drawn.
Collapse
|
24
|
Automated data extraction of electronic medical records: Validity of data mining to construct research databases for eligibility in gastroenterological clinical trials. Ups J Med Sci 2022; 127:8260. [PMID: 35173908 PMCID: PMC8809051 DOI: 10.48101/ujms.v127.8260] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/20/2021] [Revised: 10/03/2021] [Accepted: 12/07/2021] [Indexed: 11/21/2022] Open
Abstract
BACKGROUND Electronic medical records (EMRs) are adopted for storing patient-related healthcare information. Using data mining techniques, it is possible to make use of and derive benefit from this massive amount of data effectively. We aimed to evaluate validity of data extracted by the Customized eXtraction Program (CXP). METHODS The CXP extracts and structures data in rapid standardised processes. The CXP was programmed to extract TNFα-native active ulcerative colitis (UC) patients from EMRs using defined International Classification of Disease-10 (ICD-10) codes. Extracted data were read in parallel with manual assessment of the EMR to compare with CXP-extracted data. RESULTS From the complete EMR set, 2,802 patients with code K51 (UC) were extracted. Then, CXP extracted 332 patients according to inclusion and exclusion criteria. Of these, 97.5% were correctly identified, resulting in a final set of 320 cases eligible for the study. When comparing CXP-extracted data against manually assessed EMRs, the recovery rate was 95.6-101.1% over the years with 96.1% weighted average sensitivity. CONCLUSION Utilisation of the CXP software can be considered as an effective way to extract relevant EMR data without significant errors. Hence, by extracting from EMRs, CXP accurately identifies patients and has the capacity to facilitate research studies and clinical trials by finding patients with the requested code as well as funnel down itemised individuals according to specified inclusion and exclusion criteria. Beyond this, medical procedures and laboratory data can rapidly be retrieved from the EMRs to create tailored databases of extracted material for immediate use in clinical trials.
Collapse
|
25
|
Abstract
Following a request from the European Commission, EFSA was asked to create and regularly update a database of host plant species of Xylella spp. The mandate now covers the period 2021-2026 and EFSA is requested to release an update of the database twice per year. The aim of the database is to provide information and scientific support to risk assessors, risk managers and researchers dealing with Xylella spp. This report is related to the fifth version of the database published in Zenodo in the EFSA Knowledge Junction community, covering literature published from 1 January 2021 up to 30 June 2021, and recent Europhyt outbreak notifications. Informative data have been extracted from 41 selected publications. Nineteen new host plants were identified and added to the database since the previous update published in June 2021. Those plant species were reported naturally infected by subsp. multiplex or unknown (i.e. not reported in the publication) of X. fastidiosa in the UE (France, Spain and Portugal). No additional data were retrieved for X. taiwanensis. New information on the tolerant/resistant response of plant species to X. fastidiosa infection were added, while no new STs have been identified worldwide compared to the previous update published in May 2021. The overall number of Xylella spp. host plants determined with at least two different detection methods or positive with one method (between: sequencing, pure culture isolation) now reaches 407 plant species, 185 genera and 68 families. Such numbers raise to 655 plant species, 293 genera and 88 families if considered regardless of the detection method applied.
Collapse
|
26
|
Blueprint for aligned data exchange for research and public health. J Am Med Inform Assoc 2021; 28:2702-2706. [PMID: 34613371 DOI: 10.1093/jamia/ocab210] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2021] [Revised: 08/05/2021] [Accepted: 09/28/2021] [Indexed: 11/13/2022] Open
Abstract
Making EHR Data More Available for Research and Public Health (MedMorph) is a Centers for Disease Control and Prevention-led initiative developing and demonstrating a reference architecture (RA) and implementation, including Health Level Seven International Fast Healthcare Interoperability Resources (HL7 FHIR) implementation guides (IGs), describing how to leverage FHIR for aligned research and public health access to clinical data for automated data exchange. MedMorph engaged a technical expert panel of more than 100 members to model representative use cases, develop IGs (architectural and content), align with existing efforts in the FHIR community, and demonstrate the RA in research and public health uses. The RA IG documents common workflows needed to automatically send research data to Research Patient Data Repositories for multiple use cases. Sharing a common RA and canonical data model will improve data sharing for research and public health needs and generate evidence. MedMorph delivers a robust, reusable method to utilize data from electronic health records addressing multiple research and public health needs.
Collapse
|
27
|
Hand-Object Interaction: From Human Demonstrations to Robot Manipulation. Front Robot AI 2021; 8:714023. [PMID: 34660702 PMCID: PMC8517111 DOI: 10.3389/frobt.2021.714023] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2021] [Accepted: 09/14/2021] [Indexed: 11/13/2022] Open
Abstract
Human-object interaction is of great relevance for robots to operate in human environments. However, state-of-the-art robotic hands are far from replicating humans skills. It is, therefore, essential to study how humans use their hands to develop similar robotic capabilities. This article presents a deep dive into hand-object interaction and human demonstrations, highlighting the main challenges in this research area and suggesting desirable future developments. To this extent, the article presents a general definition of the hand-object interaction problem together with a concise review for each of the main subproblems involved, namely: sensing, perception, and learning. Furthermore, the article discusses the interplay between these subproblems and describes how their interaction in learning from demonstration contributes to the success of robot manipulation. In this way, the article provides a broad overview of the interdisciplinary approaches necessary for a robotic system to learn new manipulation skills by observing human behavior in the real world.
Collapse
|
28
|
Sysrev: A FAIR Platform for Data Curation and Systematic Evidence Review. Front Artif Intell 2021; 4:685298. [PMID: 34423285 PMCID: PMC8374944 DOI: 10.3389/frai.2021.685298] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2021] [Accepted: 07/13/2021] [Indexed: 11/16/2022] Open
Abstract
Well-curated datasets are essential to evidence based decision making and to the integration of artificial intelligence with human reasoning across disciplines. However, many sources of data remain siloed, unstructured, and/or unavailable for complementary and secondary research. Sysrev was developed to address these issues. First, Sysrev was built to aid in systematic evidence reviews (SER), where digital documents are evaluated according to a well defined process, and where Sysrev provides an easy to access, publicly available and free platform for collaborating in SER projects. Secondly, Sysrev addresses the issue of unstructured, siloed, and inaccessible data in the context of generalized data extraction, where human and machine learning algorithms are combined to extract insights and evidence for better decision making across disciplines. Sysrev uses FAIR - Findability, Accessibility, Interoperability, and Reuse of digital assets - as primary principles in design. Sysrev was developed primarily because of an observed need to reduce redundancy, reduce inefficient use of human time and increase the impact of evidence based decision making. This publication is an introduction to Sysrev as a novel technology, with an overview of the features, motivations and use cases of the tool. Methods: Sysrev. com is a FAIR motivated web platform for data curation and SER. Sysrev allows users to create data curation projects called "sysrevs" wherein users upload documents, define review tasks, recruit reviewers, perform review tasks, and automate review tasks. Conclusion: Sysrev is a web application designed to facilitate data curation and SERs. Thousands of publicly accessible Sysrev projects have been created, accommodating research in a wide variety of disciplines. Described use cases include data curation, managed reviews, and SERs.
Collapse
|
29
|
Abstract
Following a request from the European Commission, EFSA was asked to create and regularly update a database of host plant species of Xylella spp. Complying with an extension of the previous mandate, which now covers the period 2021-2026, the current version of Xylella spp. host plant database updates the previous release dated April 2020. Informative data have been extracted from 86 recent publications retrieved through an extensive literature search. This report is related to the fourth version of the database published in Zenodo in the EFSA Knowledge Junction community, covering articles selected from: a systematic literature review conducted up to 31 December 2020, Europhyt outbreak notifications up to 18 March 2021 and communications from research groups and national authorities. Forty-three new host plant species of X. fastidiosa, identified through the data extracted from the selected publications, have been added to the database. Those plant species were reported as naturally or artificially infected by subsp. fastidiosa, multiplex, pauca or unknown (i.e. not reported in the publication) subspecies of X. fastidiosa. New information on the tolerant/resistant response of plant species or varieties to X. fastidiosa infection is also reported. No additional data were retrieved for X. taiwanensis. This new version of the database includes no update on the number of Sequence Types (STs) identified so far, which remains unchanged. The overall number of Xylella spp. host plants determined with at least two different detection methods or positive with one method (between: sequencing, pure culture isolation) reaches now 385 plant species, 179 genera and 67 families. Such numbers rise to 638 plant species, 289 genera and 87 families if considered regardless of the detection method applied. The database will be issued twice per year, with the aim to provide information and scientific support to risk assessors, risk managers and researchers dealing with Xylella spp.
Collapse
|
30
|
Automating Stroke Data Extraction From Free-Text Radiology Reports Using Natural Language Processing: Instrument Validation Study. JMIR Med Inform 2021; 9:e24381. [PMID: 33944791 PMCID: PMC8132979 DOI: 10.2196/24381] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2020] [Revised: 11/10/2020] [Accepted: 04/16/2021] [Indexed: 01/06/2023] Open
Abstract
BACKGROUND Diagnostic neurovascular imaging data are important in stroke research, but obtaining these data typically requires laborious manual chart reviews. OBJECTIVE We aimed to determine the accuracy of a natural language processing (NLP) approach to extract information on the presence and location of vascular occlusions as well as other stroke-related attributes based on free-text reports. METHODS From the full reports of 1320 consecutive computed tomography (CT), CT angiography, and CT perfusion scans of the head and neck performed at a tertiary stroke center between October 2017 and January 2019, we manually extracted data on the presence of proximal large vessel occlusion (primary outcome), as well as distal vessel occlusion, ischemia, hemorrhage, Alberta stroke program early CT score (ASPECTS), and collateral status (secondary outcomes). Reports were randomly split into training (n=921) and validation (n=399) sets, and attributes were extracted using rule-based NLP. We reported the sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and the overall accuracy of the NLP approach relative to the manually extracted data. RESULTS The overall prevalence of large vessel occlusion was 12.2%. In the training sample, the NLP approach identified this attribute with an overall accuracy of 97.3% (95.5% sensitivity, 98.1% specificity, 84.1% PPV, and 99.4% NPV). In the validation set, the overall accuracy was 95.2% (90.0% sensitivity, 97.4% specificity, 76.3% PPV, and 98.5% NPV). The accuracy of identifying distal or basilar occlusion as well as hemorrhage was also high, but there were limitations in identifying cerebral ischemia, ASPECTS, and collateral status. CONCLUSIONS NLP may improve the efficiency of large-scale imaging data collection for stroke surveillance and research.
Collapse
|
31
|
Extracting data from graphs: A case-study on animal research with implications for meta-analyses. Res Synth Methods 2021; 12:701-710. [PMID: 33555134 DOI: 10.1002/jrsm.1481] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2020] [Revised: 01/11/2021] [Accepted: 01/27/2021] [Indexed: 02/06/2023]
Abstract
Systematic reviews with meta-analyses are powerful tools that can answer research questions based on data from published studies. Ideally, all relevant data is directly available in the text or tables, but often it is only presented in graphs. In those cases, the data can be extracted from graphs, but this potentially introduces errors. Here, we investigate to what extent the extracted outcome and error values differ from the original data and if these differences could affect the results of a meta-analysis. Six extractors extracted 36 outcome values and corresponding errors from 22 articles. Differences between extractors were compared using overall concordance correlation coefficients (OCCC), differences between the original and extracted data were compared using concordance correlation coefficients (CCC). To test the possible influence on meta-analyses, random-effects meta-analyses on mean difference comparing original and extracted data were performed. The OCCCs and CCCs were high for both outcome values and errors, CCCs were >0.99 for the outcome and >0.92 for errors. The meta-analyses showed that the overall effect on outcome was very small (median: 0.025, interquartile range: 0.016-0.046). Therefore, data extraction from graphs is a good method to harvest data if it is not provided in the text or tables, and the original authors cannot provide the data.
Collapse
|
32
|
ADR databases for on-site clinical use: Potentials of summary of products characteristics. Basic Clin Pharmacol Toxicol 2021; 128:557-567. [PMID: 33523597 DOI: 10.1111/bcpt.13564] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2020] [Revised: 01/25/2021] [Accepted: 01/25/2021] [Indexed: 11/28/2022]
Abstract
Adverse drug reactions (ADRs) for all drugs in Europe are described in the legally approved Summary of Product Characteristics (SmPC). An overview of all ADRs of the patients' drug list can support healthcare staff to link patient symptoms to possible ADRs. We review the possibilities and challenges to extract ADR information from SmPCs or American Structured Product Labels and present the development of our semi-automated procedure for extraction of ADRs from the tabulated section in the SmPCs to create a database, named Bikt, which is regularly updated and used at point of care in Sweden. The existence of five major table formats for ADRs used in the SmPCs required the development of different parsing scripts. Manual checks for correctness for all content have to be performed. The quality of extraction was investigated for all SmPCs by measuring precision, recall and F1 scores and compared with other methods published. We conclude that it is possible to semi-automatically extract ADR information from SmPCs. However, clear technical and content guidelines and standards for ADR tables and terms from drug registration authorities would lead to improved extraction and usability of ADR information at point of care.
Collapse
|
33
|
Instrumental Odour Monitoring System Classification Performance Optimization by Analysis of Different Pattern-Recognition and Feature Extraction Techniques. SENSORS 2020; 21:s21010114. [PMID: 33375421 PMCID: PMC7794822 DOI: 10.3390/s21010114] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/24/2020] [Revised: 12/14/2020] [Accepted: 12/24/2020] [Indexed: 11/17/2022]
Abstract
Instrumental odour monitoring systems (IOMS) are intelligent electronic sensing tools for which the primary application is the generation of odour metrics that are indicators of odour as perceived by human observers. The quality of the odour sensor signal, the mathematical treatment of the acquired data, and the validation of the correlation of the odour metric are key topics to control in order to ensure a robust and reliable measurement. The research presents and discusses the use of different pattern recognition and feature extraction techniques in the elaboration and effectiveness of the odour classification monitoring model (OCMM). The effect of the rise, intermediate, and peak period from the original response curve, in collaboration with Linear Discriminant Analysis (LDA) and Artificial Neural Networks (ANN) as a pattern recognition algorithm, were investigated. Laboratory analyses were performed with real odour samples collected in a complex industrial plant, using an advanced smart IOMS. The results demonstrate the influence of the choice of method on the quality of the OCMM produced. The peak period in combination with the Artificial Neural Network (ANN) highlighted the best combination on the basis of high classification rates. The paper provides information to develop a solution to optimize the performance of IOMS.
Collapse
|
34
|
Developing Criteria and Associated Instructions for Consistent and Useful Quality Improvement Study Data Extraction for Health Systems. J Gen Intern Med 2020; 35:802-807. [PMID: 32808207 PMCID: PMC7652974 DOI: 10.1007/s11606-020-06098-1] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/26/2019] [Revised: 03/13/2020] [Accepted: 07/30/2020] [Indexed: 10/23/2022]
Abstract
BACKGROUND The Agency for Healthcare Research and Quality (AHRQ) could devote resources to collate and assess quality improvement studies to support learning health systems (LHS) but there is no reliable data on the consistency of data extraction for important criteria. METHODS We identified quality improvement studies and evaluated the consistency of data extraction from two experienced independent reviewers at three time points: baseline, first revision (where explicit instructions for each criterion were created), and final revision (where the instructions were revised). Six investigators looked at the data extracted by the two systematic reviewers and determined the extent of similarity on a scale of 0 to 10 (where 0 represented no similarity and 10 perfect similarity). There were 42 assessments for baseline, 42 assessments for the first revision, and 42 assessments for the final revision. We asked two LHS participants to assess the relative value of our criteria. RESULTS The consistency of extraction improved from 1.17 ± 1.85 at baseline to 6.07 ± 2.76 after revision 1 (P < 0.001) and to 6.81 ± 1.94 out of 10 for the final revision (P < 0.001). However, the final revision was not significantly improved over the first revision (P = 0.14). One key informant rated the difficulty in finding and using quality improvement studies a 6 (moderately difficult) while the other a 4 (moderately difficult). When asked how valuable it would be if AHRQ found and collated the demographic information about the health systems and the interventions used in published quality improvement studies, they rated it a 9 (highly valuable) and a 6 (moderately valuable). CONCLUSION Creating explicit instructions for extracting data for quality improvement studies helps enhance the consistency of data extraction. This is important because it is difficult for LHS to vet these quality improvement studies on their own and they would value AHRQ's support in that regard.
Collapse
|
35
|
Amazon Employees Resources Access Data Extraction via Clonal Selection Algorithm and Logic Mining Approach. ENTROPY 2020; 22:e22060596. [PMID: 33286368 PMCID: PMC7517133 DOI: 10.3390/e22060596] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/23/2020] [Revised: 04/16/2020] [Accepted: 04/20/2020] [Indexed: 11/16/2022]
Abstract
Amazon.com Inc. seeks alternative ways to improve manual transactions system of granting employees resources access in the field of data science. The work constructs a modified Artificial Neural Network (ANN) by incorporating a Discrete Hopfield Neural Network (DHNN) and Clonal Selection Algorithm (CSA) with 3-Satisfiability (3-SAT) logic to initiate an Artificial Intelligence (AI) model that executes optimization tasks for industrial data. The selection of 3-SAT logic is vital in data mining to represent entries of Amazon Employees Resources Access (AERA) via information theory. The proposed model employs CSA to improve the learning phase of DHNN by capitalizing features of CSA such as hypermutation and cloning process. This resulting the formation of the proposed model, as an alternative machine learning model to identify factors that should be prioritized in the approval of employees resources applications. Subsequently, reverse analysis method (SATRA) is integrated into our proposed model to extract the relationship of AERA entries based on logical representation. The study will be presented by implementing simulated, benchmark and AERA data sets with multiple performance evaluation metrics. Based on the findings, the proposed model outperformed the other existing methods in AERA data extraction.
Collapse
|
36
|
Automated Motion Tracking and Data Extraction for Red Blood Cell Biomechanics. ACTA ACUST UNITED AC 2020; 93:e75. [PMID: 32391975 DOI: 10.1002/cpcy.75] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Abstract
Red blood cell biomechanics can provide us with a deeper understanding of macroscopic physiology and have the potential of being used for diagnostic purposes. In diseases like sickle cell anemia and malaria, reduced red blood cell deformability can be used as a biomarker, leading to further assays and diagnoses. A microfluidic system is useful for studying these biomechanical properties. We can observe detailed red blood cell mechanical behavior as they flow through microcapillaries using high-speed imaging and microscopy. Microfluidic devices are advantageous over traditional methods because they can serve as high-throughput tests. However, to rapidly analyze thousands of cells, there is a need for powerful image processing tools and software automation. We describe a workflow process using Image-Pro to identify and track red blood cells in a video, take measurements, and export the data for use in statistical analysis tools. The information in this protocol can be applied to large-scale blood studies where entire cell populations need to be analyzed from many cohorts of donors. © 2020 The Authors. Basic Protocol 1: Enhancing raw video for motion tracking Basic Protocol 2: Extracting motion tracking data from enhanced video.
Collapse
|
37
|
Abstract
Following a request from the European Commission, EFSA was asked to create and regularly update a database of host plant species of Xylella spp. In 2018, EFSA released a new Xylella spp. host plant database that was now updated with informative data extracted from 76 recent publications retrieved through an extensive literature search. This report is related to the third version of the database published in Zenodo in the EFSA Knowledge Junction community, covering articles selected from: a systematic literature review conducted up of 30 June 2019; Europhyt database up to 15 October 2019; and relevant articles identified by EFSA Horizon scanning and personal communications from experts. Some data on Xylella fastidiosa strains and geographical coordinates included in the already published database were updated or modified with the purpose of increasing the accuracy and consistency of the database itself. Thirty-seven new host plant species of X. fastidiosa, identified through the data extracted from the selected publications, have been added to the database. Those plant species were reported as naturally infected, artificially infected or infected under unspecified conditions by subsp. multiplex, pauca or unknown (i.e. not reported in the publication) subspecies of X. fastidiosa. No additional data were retrieved for Xylella taiwanensis. Six new Sequence Types (STs) have been identified in Brazil, Italy and the USA. Information on the tolerant/resistant response of plant species or varieties to X. fastidiosa infection are also reported in the database. The overall number of Xylella spp. host plants reaches now 343 plant species, 163 genera and 64 families determined with two different detection methods, till 595 plant species, 275 genera and 85 families regardless the detection method applied. The EFSA database on Xylella spp. host plants is updated regularly with the aim to provide information and scientific support to risk assessors, risk managers and researchers dealing with Xylella spp.
Collapse
|
38
|
Advancing Evidence Synthesis from Effectiveness to Implementation: Integration of Implementation Measures into Evidence Reviews. J Gen Intern Med 2020; 35:1219-1226. [PMID: 31848862 PMCID: PMC7174479 DOI: 10.1007/s11606-019-05586-3] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/22/2019] [Revised: 10/07/2019] [Accepted: 11/26/2019] [Indexed: 10/25/2022]
Abstract
BACKGROUND In order to close the gap between discoveries that could improve health, and widespread impact on routine health care practice, there is a need for greater attention to the factors that influence dissemination and implementation of evidence-based practices. Evidence synthesis projects (e.g., systematic reviews) could contribute to this effort by collecting and synthesizing data relevant to dissemination and implementation. Such an advance would facilitate the spread of high-value, effective, and sustainable interventions. OBJECTIVE The objective of this paper is to evaluate the feasibility of extracting factors related to implementation during evidence synthesis in order to enhance the replicability of successes of studies of interventions in health care settings. DESIGN Drawing on the implementation science literature, we suggest 10 established implementation measures that should be considered when conducting evidence synthesis projects. We describe opportunities to assess these constructs in current literature and illustrate these methods through an example of a systematic review. SUBJECTS Twenty-nine studies of interventions aimed at improving clinician-patient communication in clinical settings. KEY RESULTS We identified acceptability, adoption, appropriateness, feasibility, fidelity, implementation cost, intervention complexity, penetration, reach, and sustainability as factors that are feasible and appropriate to extract during an evidence synthesis project. CONCLUSIONS To fully understand the potential value of a health care innovation, it is important to consider not only its effectiveness, but also the process, demands, and resource requirements involved in downstream implementation. While there is variation in the degree to which intervention studies currently report implementation factors, there is a growing demand for this information. Abstracting information about these factors may enhance the value of systematic reviews and other evidence synthesis efforts, improving the dissemination and adoption of interventions that are effective, feasible, and sustainable across different contexts.
Collapse
|
39
|
Using Online Survey Software to Enhance Rigor and Efficiency of Knowledge Synthesis Reviews. West J Nurs Res 2020; 42:838-845. [PMID: 32129156 DOI: 10.1177/0193945920904442] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
With the explosion of scientific literature, information technologies, and the rise of evidence-based health care, methodologies for literature reviews continue to advance. Yet there remains a lack of clarity about techniques to rigorously and efficiently extract and synthesize data from primary sources. We developed a new method for data extraction and synthesis for completing rigorous, knowledge synthesis using freely available online survey software that results in a review-specific, online data extraction, and synthesis tool. The purpose of this paper is to delineate this method using our published integrative review as an exemplar. Although the purpose of online survey software is to obtain and analyze survey responses, these software programs allows for the efficient extraction and synthesize of disparate study features from primary sources. Importantly, use of the method has the potential to increase the rigor and efficiency of published reviews bringing the promise of advancing multiple areas of health science.
Collapse
|
40
|
Healthcare Associated Infections: An Interoperable Infrastructure for Multidrug Resistant Organism Surveillance. INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH 2020; 17:E465. [PMID: 31936787 PMCID: PMC7013448 DOI: 10.3390/ijerph17020465] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/19/2019] [Revised: 12/25/2019] [Accepted: 12/30/2019] [Indexed: 01/26/2023]
Abstract
Prevention and surveillance of healthcare associated infections caused by multidrug resistant organisms (MDROs) has been given increasing attention in recent years and is nowadays a major priority for health care systems. The creation of automated regional, national and international surveillance networks plays a key role in this respect. A surveillance system has been designed for the Abruzzo region in Italy, focusing on the monitoring of the MDROs prevalence in patients, on the appropriateness of antibiotic prescription in hospitalized patients and on foreseeable interactions with other networks at national and international level. The system has been designed according to the Service Oriented Architecture (SOA) principles, and Healthcare Service Specification (HSSP) standards and Clinical Document Architecture Release 2 (CDAR2) have been adopted. A description is given with special reference to implementation state, specific design and implementation choices and next foreseeable steps. The first release will be delivered at the Complex Operating Unit of Infectious Diseases of the Local Health Authority of Pescara (Italy).
Collapse
|
41
|
Comparison of Word Embeddings for Extraction from Medical Records. INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH 2019; 16:ijerph16224360. [PMID: 31717300 PMCID: PMC6888408 DOI: 10.3390/ijerph16224360] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/04/2019] [Revised: 10/31/2019] [Accepted: 11/04/2019] [Indexed: 11/24/2022]
Abstract
This paper is an extension of the work originally presented in the 16th International Conference on Wearable, Micro and Nano Technologies for Personalized Health. Despite using electronic medical records, free narrative text is still widely used for medical records. To make data from texts available for decision support systems, supervised machine learning algorithms might be successfully applied. In this work, we developed and compared a prototype of a medical data extraction system based on different artificial neural network architectures to process free medical texts in the Russian language. Three classifiers were applied to extract entities from snippets of text. Multi-layer perceptron (MLP) and convolutional neural network (CNN) classifiers showed similar results to all three embedding models. MLP exceeded convolutional network on pipelines that used the embedding model trained on medical records with preliminary lemmatization. Nevertheless, the highest F-score was achieved by CNN. CNN slightly exceeded MLP when the biggest word2vec model was applied (F-score 0.9763).
Collapse
|
42
|
Exploring why patients with cancer consult GPs: a 1-year data extraction. BJGP Open 2019; 3:bjgpopen19X101663. [PMID: 31581120 PMCID: PMC6995854 DOI: 10.3399/bjgpopen19x101663] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2019] [Accepted: 06/23/2019] [Indexed: 11/12/2022] Open
Abstract
BACKGROUND Survival rates of patients with cancer are increasing owing to improvements in diagnostics and therapies. The traditional hospital-based follow-up model faces challenges because of the consequent increasing workload, and it has been suggested that selected patients with cancer could be followed up by GPs.The hypothesis of the study was that, regardless of the hospital-based follow-up care, GPs see their patients with cancer both for cancer-related problems as well as for other reasons. Thus, a formalised follow-up by GPs would not mean too large a change in GPs' workloads. AIM To explore to what extent patients with cancer consult their GPs, and for what reasons. DESIGN & SETTING A 1-year explorative study was undertaken, based on data from 91 Norwegian GPs from 2016-2017. METHOD The data were electronically extracted from GPs' electronic medical records (EMR). RESULTS Data were collected from 91 GPs. There were 11 074 consultations in total, generated by 1932 patients with cancer. The mean consultation rate was higher among the patients with cancer compared with Norwegian patients in general. In one-third of the consultations, cancer was the main diagnosis. Apart from cancer, cardiovascular and musculoskeletal diagnoses were common. Patients with cancer who had multiple diagnoses or psychological diagnoses did not consult their GP significantly more often than patients with cancer without such comorbidity. CONCLUSION This study confirms that patients with cancer consult their GP more often than other patients, both for cancer-related reasons and for various comorbidities. A formalised follow-up by GPs would probably be feasible, and GPs should prepare for this responsibility.
Collapse
|
43
|
The development and evaluation of an online application to assist in the extraction of data from graphs for use in systematic reviews. Wellcome Open Res 2019; 3:157. [PMID: 30809592 PMCID: PMC6372928 DOI: 10.12688/wellcomeopenres.14738.2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/15/2019] [Indexed: 02/02/2023] Open
Abstract
Background: The extraction of data from the reports of primary studies, on which the results of systematic reviews depend, needs to be carried out accurately. To aid reliability, it is recommended that two researchers carry out data extraction independently. The extraction of statistical data from graphs in PDF files is particularly challenging, as the process is usually completely manual, and reviewers need sometimes to revert to holding a ruler against the page to read off values: an inherently time-consuming and error-prone process. Methods: To mitigate some of the above problems we integrated and customised two existing JavaScript libraries to create a new web-based graphical data extraction tool to assist reviewers in extracting data from graphs. This tool aims to facilitate more accurate and timely data extraction through a user interface which can be used to extract data through mouse clicks. We carried out a non-inferiority evaluation to examine its performance in comparison to standard practice. Results: We found that the customised graphical data extraction tool is not inferior to users' prior preferred current approaches. Our study was not designed to show superiority, but suggests that there may be a saving in time of around 6 minutes per graph, accompanied by a substantial increase in accuracy. Conclusions: Our study suggests that the incorporation of this type of tool in online systematic review software would be beneficial in facilitating the production of accurate and timely evidence synthesis to improve decision-making.
Collapse
|
44
|
The development and evaluation of an online application to assist in the extraction of data from graphs for use in systematic reviews. Wellcome Open Res 2019; 3:157. [PMID: 30809592 PMCID: PMC6372928 DOI: 10.12688/wellcomeopenres.14738.3] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 03/05/2019] [Indexed: 02/02/2023] Open
Abstract
Background: The extraction of data from the reports of primary studies, on which the results of systematic reviews depend, needs to be carried out accurately. To aid reliability, it is recommended that two researchers carry out data extraction independently. The extraction of statistical data from graphs in PDF files is particularly challenging, as the process is usually completely manual, and reviewers need sometimes to revert to holding a ruler against the page to read off values: an inherently time-consuming and error-prone process. Methods: To mitigate some of the above problems we integrated and customised two existing JavaScript libraries to create a new web-based graphical data extraction tool to assist reviewers in extracting data from graphs. This tool aims to facilitate more accurate and timely data extraction through a user interface which can be used to extract data through mouse clicks. We carried out a non-inferiority evaluation to examine its performance in comparison with participants' standard practice for extracting data from graphs in PDF documents. Results: We found that the customised graphical data extraction tool is not inferior to users' (N=10) prior standard practice. Our study was not designed to show superiority, but suggests that, on average, participants saved around 6 minutes per graph using the new tool, accompanied by a substantial increase in accuracy. Conclusions: Our study suggests that the incorporation of this type of tool in online systematic review software would be beneficial in facilitating the production of accurate and timely evidence synthesis to improve decision-making.
Collapse
|
45
|
The development and evaluation of an online application to assist in the extraction of data from graphs for use in systematic reviews. Wellcome Open Res 2019; 3:157. [PMID: 30809592 PMCID: PMC6372928 DOI: 10.12688/wellcomeopenres.14738.1] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/13/2018] [Indexed: 02/02/2023] Open
Abstract
Background: The extraction of data from the reports of primary studies, on which the results of systematic reviews depend, needs to be carried out accurately. To aid reliability, it is recommended that two researchers carry out data extraction independently. The extraction of statistical data from graphs in PDF files is particularly challenging, as the process is usually completely manual, and reviewers need sometimes to revert to holding a ruler against the page to read off values: an inherently time-consuming and error-prone process. Methods: To mitigate some of the above problems we developed a new web-based graphical data extraction tool to assist reviewers in extracting data from graphs. This tool aims to facilitate more accurate and timely data extraction through a user interface which can be used to extract data through mouse clicks. We carried out a non-inferiority evaluation to examine its performance in comparison to standard practice. Results: We found that our new graphical data extraction tool is not inferior to users' prior preferred current approaches. Our study was not designed to show superiority, but suggests that there may be a saving in time of around 6 minutes per graph, accompanied by a substantial increase in accuracy. Conclusions: Our study suggests that the incorporation of this type of tool in online systematic review software would be beneficial in facilitating the production of accurate and timely evidence synthesis to improve decision-making.
Collapse
|
46
|
Scalable Extraction of Big Macromolecular Data in Azure Data Lake Environment. Molecules 2019; 24:molecules24010179. [PMID: 30621295 PMCID: PMC6337464 DOI: 10.3390/molecules24010179] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2018] [Revised: 12/29/2018] [Accepted: 01/01/2019] [Indexed: 11/16/2022] Open
Abstract
Calculation of structural features of proteins, nucleic acids, and nucleic acid-protein complexes on the basis of their geometries and studying various interactions within these macromolecules, for which high-resolution structures are stored in Protein Data Bank (PDB), require parsing and extraction of suitable data stored in text files. To perform these operations on large scale in the face of the growing amount of macromolecular data in public repositories, we propose to perform them in the distributed environment of Azure Data Lake and scale the calculations on the Cloud. In this paper, we present dedicated data extractors for PDB files that can be used in various types of calculations performed over protein and nucleic acids structures in the Azure Data Lake. Results of our tests show that the Cloud storage space occupied by the macromolecular data can be successfully reduced by using compression of PDB files without significant loss of data processing efficiency. Moreover, our experiments show that the performed calculations can be significantly accelerated when using large sequential files for storing macromolecular data and by parallelizing the calculations and data extractions that precede them. Finally, the paper shows how all the calculations can be performed in a declarative way in U-SQL scripts for Data Lake Analytics.
Collapse
|
47
|
Extraction from Medical Records. Stud Health Technol Inform 2019; 261:62-67. [PMID: 31156092] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Despite using electronic medical records, free narrative text is still widely used for medical records. Such text cannot be analyzed by statistical tools and be proceed by decision support systems. To make data from texts available for such tasks a supervised machine learning algorithms might be successfully applied. In this work, we develop and compare a prototype of a medical data extraction system based on different artificial neuron networks architectures to process free medical texts in Russian language. The best F-score (0.9763) achieved on a combination of CNN prediction model and large pre-trained word2vec model. The very close result (0.9741) has shown by the MLP model with the same embedding.
Collapse
|
48
|
Abstract
Following a request from the European Commission, EFSA periodically updates the database on the host plants of Xylella spp. While previous editions of the database (2015 and 2016) dealt with the species Xylella fastidiosa only, this database version addresses the whole genus Xylella, including therefore both species X. fastidiosa and Xylella taiwanensis. The database now includes information on host plants of Xylella spp. retrieved from scientific literature up to November 2017 and from EUROPHYT notifications up to May 2018. An extensive literature search was performed to screen the scientific and technical literature published between the previous database update conducted in December 2015 and December 2017. The literature screening was supported by the DistillerSR software platform. The applied protocol for the extensive literature review and extensive information search, together with examples of data extraction, are described in detail in this report. This report also includes published information on resistance or tolerance of plant varieties to Xylella spp. The current database includes 563 plant species reported to be infected by X. fastidiosa, of which for 312 plant species the infection has been determined with at least two different detection methods. These species cover hundreds of host plant genera in 82 botanical families (61 botanical families when considering only records with at least two different detection methods). The update of this database of host plants of Xylella spp. reported world-wide provides a key tool for risk management, risk assessment and research on this polyphagous bacterial plant pathogen.
Collapse
|
49
|
Supporting Prescriptions with Synonym Matching of Section Names in Prospectuses. Stud Health Technol Inform 2018; 251:153-156. [PMID: 29968625] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
The field of medicine still reports errors because of insufficient knowledge or resources, work load or data not available at the right time and place, and this may be fatal for a patient. To improve the healthcare quality, a doctor needs accurate and complex information processing when accessing drug information. Our work builds on improvement of accessing drug information for a better treatment through homogenization of sections in a prospectus. The sections names in a prospectus may be different for one source to another, and in this article, we propose a method to homogenize the content of all drug prospectuses. Once a correct homogenization of the sections has been established, the prospectuses can be used in clinical decision applications to provide the necessary data for physicians. Classification of the section names is using the Cousine similarity method and the Scikit-learn machine learning software. The best results were obtained with the Scikit-learn software.
Collapse
|
50
|
Estimating data from figures with a Web-based program: Considerations for a systematic review. Res Synth Methods 2017; 8:258-262. [PMID: 28268241 DOI: 10.1002/jrsm.1232] [Citation(s) in RCA: 103] [Impact Index Per Article: 14.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2016] [Revised: 10/10/2016] [Accepted: 11/25/2016] [Indexed: 11/08/2022]
Abstract
BACKGROUND Systematic reviewers often encounter incomplete or missing data, and the information desired may be difficult to obtain from a study author. Thus, systematic reviewers may have to resort to estimating data from figures with little or no raw data in a study's corresponding text or tables. METHODS We discuss a case study in which participants used a publically available Web-based program, called webplotdigitizer, to estimate data from 2 figures. We evaluated and used the intraclass coefficient and the accuracy of the estimates to the true data to inform considerations when using estimated data from figures in systematic reviews. RESULTS The estimates for both figures were consistent, although the distribution of estimates in the figure of a continuous outcome was slightly higher. For the continuous outcome, the percent difference ranged from 0.23% to 30.35% while the percent difference of the event rate ranged from 0.22% to 8.92%. For both figures, the intraclass coefficient was excellent (>0.95). CONCLUSIONS Systematic reviewers should consider and be transparent when estimating data from figures when the information cannot be obtained from study authors and perform sensitivity analyses of pooled results to reduce bias.
Collapse
|