Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

Download

Total Articles

120
(from Reference Citation Analysis)

Article PDFs (60)

Cited by ≥ 1 (65)

Searched Name

Karin Verspoor

Year Published

Show more Refine

Ranked By

Results Analysis

Year Published Analysis
Article Type Analysis
Publication Title Analysis
Category Analysis

Results Analysis

Journal Articles

Number	Citation Analysis
51	Chen Q, Zhang X, Wan Y, Zobel J, Verspoor K. Search Effectiveness in Nonredundant Sequence Databases: Assessments and Solutions. J Comput Biol 2018;26:605-617. [PMID: 30585742 DOI: 10.1089/cmb.2018.0198] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022] Open Abstract Duplicate sequence records-that is, records having similar or identical sequences-are a challenge in search of biological sequence databases. They significantly increase database search time and can lead to uninformative search results containing similar sequences. Sequence clustering methods have been used to address this issue to group similar sequences into clusters. These clusters form a nonredundant database consisting of representatives (one record per cluster) and members (the remaining records in a cluster). In this approach, for nonredundant database search, users search against representatives first and optionally expand search results by exploring member records from matching clusters. Existing studies used Precision and Recall to assess the search effectiveness of nonredundant databases. However, the use of Precision and Recall does not model user behavior in practice and thus may not reflect practical search effectiveness. In this study, we first propose innovative evaluation metrics to measure search effectiveness. The findings are that (1) the Precision of expanded sets is consistently lower than that of representatives, with a decrease up to 7% at top ranks; and (2) Recall is uninformative because, for most queries, expanded sets return more records than does search of the original unclustered databases. Motivated by these findings, we propose a solution that returns a user-specified proportion of top similar records, modeled by a ranking function that aggregates sequence and annotation similarities. In experiments undertaken on UniProtKB/Swiss-Prot, the largest expert-curated protein database, we show that our method dramatically reduces the number of returned sequences, increases Precision by 3%, and does not impact effective search time. Collapse Key Words biological databases database search information retrieval machine learning sequence clustering Collapse MESH Headings Collapse Grants Collapse
52	Khumrin P, Ryan A, Juddy T, Verspoor K. DrKnow: A Diagnostic Learning Tool with Feedback from Automated Clinical Decision Support. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2018;2018:1348-1357. [PMID: 30815179 PMCID: PMC6371235] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 06/09/2023] Abstract Providing medical trainees with effective feedback is critical to the successful development of their diagnostic reasoning skills. We present the design of DrKnow, a web-based learning application that utilises a clinical decision support system (CDSS) and virtual cases to support the development of problem-solving and decision-making skills in medical students. Based on the clinical information they request and prioritise, DrKnow provides personalised feedback to help students develop differential and provisional diagnoses at key decision points as they work through the virtual cases. Once students make a final diagnosis, DrKnow presents students with information about their overall diagnostic performance as well as recommendations for diagnosing similar cases. This paper argues that designing DrKnow around a task-sensitive CDSS provides a suitable approach enabling positive student learning outcomes, while simultaneously overcoming the resource challenges of expert clinician-supported bedside teaching. Collapse Key Words Collapse MESH Headings Abdominal Pain/etiology Computer-Assisted Instruction Decision Support Systems, Clinical Diagnosis, Differential Education, Medical, Undergraduate/methods Feedback Humans Internet Machine Learning Simulation Training Students, Medical Teaching Collapse Grants Collapse
53	Pang PCI, Chang S, Verspoor K, Clavisi O. The Use of Web-Based Technologies in Health Research Participation: Qualitative Study of Consumer and Researcher Experiences. J Med Internet Res 2018;20:e12094. [PMID: 30377139 PMCID: PMC6234342 DOI: 10.2196/12094] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2018] [Revised: 10/08/2018] [Accepted: 10/08/2018] [Indexed: 01/23/2023] Open Abstract BACKGROUND Health consumers are often targeted for their involvement in health research including randomized controlled trials, focus groups, interviews, and surveys. However, as reported by many studies, recruitment and engagement of consumers in academic research remains challenging. In addition, there is scarce literature describing what consumers look for and want to achieve by participating in research. OBJECTIVE Understanding and responding to the needs of consumers is crucial to the success of health research projects. In this study, we aim to understand consumers' needs and investigate the opportunities for addressing these needs with Web-based technologies, particularly in the use of Web-based research registers and social networking sites (SNSs). METHODS We undertook a qualitative approach, interviewing both consumer and medical researchers in this study. With the help from an Australian-based organization supporting people with musculoskeletal conditions, we successfully interviewed 23 consumers and 10 researchers. All interviews were transcribed and analyzed with thematic analysis methodology. Data collection was stopped after the data themes reached saturation. RESULTS We found that consumers perceive research as a learning opportunity and, therefore, expect high research transparency and regular updates. They also consider the sources of the information about research projects, the trust between consumers and researchers, and the mobility of consumers before participating in any research. Researchers need to be aware of such needs when designing a campaign for recruitment for their studies. On the other hand, researchers have attempted to establish a rapport with consumer participants, design research for consumers' needs, and use technologies to reach out to consumers. A systematic approach to integrating a variety of technologies is needed. CONCLUSIONS On the basis of the feedback from both consumers and researchers, we propose 3 future directions to use Web-based technologies for addressing consumers' needs and engaging with consumers in health research: (1) researchers can make use of consumer registers and Web-based research portals, (2) SNSs and new media should be frequently used as an aid, and (3) new technologies should be adopted to remotely collect data and reduce administrative work for obtaining consumers' consent. Collapse Key Words consumer behavior patient portals registries research design research subjects social networking Collapse MESH Headings Collapse Grants Collapse
54	Downie LE, Makrai E, Bonggotgetsakul Y, Dirito LJ, Kristo K, Pham MAN, You M, Verspoor K, Pianta MJ. Appraising the Quality of Systematic Reviews for Age-Related Macular Degeneration Interventions: A Systematic Review. JAMA Ophthalmol 2018;136:1051-1061. [PMID: 29978192 DOI: 10.1001/jamaophthalmol.2018.2620] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/20/2023] Abstract Importance Age-related macular degeneration (AMD) is a leading cause of vision impairment. It is imperative that AMD care is timely, appropriate, and evidence-based. It is thus essential that AMD systematic reviews are robust; however, little is known about the quality of this literature. Objectives To investigate the methodological quality of systematic reviews of AMD intervention studies, and to evaluate their use for guiding evidence-based care. Evidence Review This systematic review adhered to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses statement. All studies that self-identified as a systematic review in their title or abstract or were categorized as a systematic review from a medical subject heading and investigated the safety, efficacy and/or effectiveness of an AMD intervention were included. Comprehensive electronic searches were performed in Ovid MEDLINE, Embase, and the Cochrane Library from inception to March 2017. Two reviewers independently assessed titles and abstracts, then full-texts for eligibility. Quality was assessed using the Assessing the Methodological Quality of Systematic Reviews (AMSTAR) tool. Study characteristics (publication year, type of intervention, journal, citation rate, and funding source) were extracted. Findings Of 983 citations retrieved, 71 studies (7.6%) were deemed eligible. The first systematic review relating to an AMD intervention was published in 2003. More than half were published since 2014. Methodological quality was highly variable. The mean (SD) AMSTAR score was 5.8 (3.2) of 11.0, with no significant improvement over time (r = -0.03; 95% CI, -0.26 to 0.21; P = .83). Cochrane systematic reviews were overall of higher quality than reviews in other journals (mean [SD] AMSTAR score, 9.9 [1.2], n = 15 vs 4.7 [2.2], n = 56; P < .001). Overall, there was poor adherence to referring to an a priori design (22 articles [31%]) and reporting conflicts of interest in both the review and included studies (16 articles [23%]). Reviews funded by government grants and/or institutions were generally of higher quality than industry-sponsored reviews or where the funding source was not reported. Conclusions and Relevance There are gaps in the conduct of systematic reviews in the field of AMD. Enhanced endorsement of the Preferred Reporting Items for Systematic Reviews and Meta-Analyses statement by refereed journals may improve review quality and improve the dissemination of reliable evidence relating to AMD interventions to clinicians. Collapse Key Words Collapse MESH Headings Evidence-Based Medicine Humans Macular Degeneration/therapy Meta-Analysis as Topic Periodicals as Topic/standards Periodicals as Topic/statistics & numerical data Quality Control Research Report/standards Systematic Reviews as Topic Collapse Grants Collapse
55	Hameed PN, Verspoor K, Kusljic S, Halgamuge S. A two-tiered unsupervised clustering approach for drug repositioning through heterogeneous data integration. BMC Bioinformatics 2018;19:129. [PMID: 29642848 PMCID: PMC5896044 DOI: 10.1186/s12859-018-2123-4] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2017] [Accepted: 03/21/2018] [Indexed: 01/02/2023] Open Abstract Background Drug repositioning is the process of identifying new uses for existing drugs. Computational drug repositioning methods can reduce the time, costs and risks of drug development by automating the analysis of the relationships in pharmacology networks. Pharmacology networks are large and heterogeneous. Clustering drugs into small groups can simplify large pharmacology networks, these subgroups can also be used as a starting point for repositioning drugs. In this paper, we propose a two-tiered drug-centric unsupervised clustering approach for drug repositioning, integrating heterogeneous drug data profiles: drug-chemical, drug-disease, drug-gene, drug-protein and drug-side effect relationships. Results The proposed drug repositioning approach is threefold; (i) clustering drugs based on their homogeneous profiles using the Growing Self Organizing Map (GSOM); (ii) clustering drugs based on drug-drug relation matrices based on the previous step, considering three state-of-the-art graph clustering methods; and (iii) inferring drug repositioning candidates and assigning a confidence value for each identified candidate. In this paper, we compare our two-tiered clustering approach against two existing heterogeneous data integration approaches with reference to the Anatomical Therapeutic Chemical (ATC) classification, using GSOM. Our approach yields Normalized Mutual Information (NMI) and Standardized Mutual Information (SMI) of 0.66 and 36.11, respectively, while the two existing methods yield NMI of 0.60 and 0.64 and SMI of 22.26 and 33.59. Moreover, the two existing approaches failed to produce useful cluster separations when using graph clustering algorithms while our approach is able to identify useful clusters for drug repositioning. Furthermore, we provide clinical evidence for four predicted results (Chlorthalidone, Indomethacin, Metformin and Thioridazine) to support that our proposed approach can be reliably used to infer ATC code and drug repositioning. Conclusion The proposed two-tiered unsupervised clustering approach is suitable for drug clustering and enables heterogeneous data integration. It also enables identifying reliable repositioning drug candidates with reference to ATC therapeutic classification. The repositioning drug candidates identified consistently by multiple clustering algorithms and with high confidence have a higher possibility of being effective repositioning candidates. Electronic supplementary material The online version of this article (10.1186/s12859-018-2123-4) contains supplementary material, which is available to authorized users. Collapse Key Words ATC classification Data integration Drug clustering Drug repurposing Heterogeneity Collapse MESH Headings Collapse Grants Collapse
56	Panyam NC, Verspoor K, Cohn T, Ramamohanarao K. Exploiting graph kernels for high performance biomedical relation extraction. J Biomed Semantics 2018;9:7. [PMID: 29382397 PMCID: PMC5791373 DOI: 10.1186/s13326-017-0168-3] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2017] [Accepted: 12/01/2017] [Indexed: 11/30/2022] Open Abstract Background Relation extraction from biomedical publications is an important task in the area of semantic mining of text. Kernel methods for supervised relation extraction are often preferred over manual feature engineering methods, when classifying highly ordered structures such as trees and graphs obtained from syntactic parsing of a sentence. Tree kernels such as the Subset Tree Kernel and Partial Tree Kernel have been shown to be effective for classifying constituency parse trees and basic dependency parse graphs of a sentence. Graph kernels such as the All Path Graph kernel (APG) and Approximate Subgraph Matching (ASM) kernel have been shown to be suitable for classifying general graphs with cycles, such as the enhanced dependency parse graph of a sentence. In this work, we present a high performance Chemical-Induced Disease (CID) relation extraction system. We present a comparative study of kernel methods for the CID task and also extend our study to the Protein-Protein Interaction (PPI) extraction task, an important biomedical relation extraction task. We discuss novel modifications to the ASM kernel to boost its performance and a method to apply graph kernels for extracting relations expressed in multiple sentences. Results Our system for CID relation extraction attains an F-score of 60%, without using external knowledge sources or task specific heuristic or rules. In comparison, the state of the art Chemical-Disease Relation Extraction system achieves an F-score of 56% using an ensemble of multiple machine learning methods, which is then boosted to 61% with a rule based system employing task specific post processing rules. For the CID task, graph kernels outperform tree kernels substantially, and the best performance is obtained with APG kernel that attains an F-score of 60%, followed by the ASM kernel at 57%. The performance difference between the ASM and APG kernels for CID sentence level relation extraction is not significant. In our evaluation of ASM for the PPI task, ASM performed better than APG kernel for the BioInfer dataset, in the Area Under Curve (AUC) measure (74% vs 69%). However, for all the other PPI datasets, namely AIMed, HPRD50, IEPA and LLL, ASM is substantially outperformed by the APG kernel in F-score and AUC measures. Conclusions We demonstrate a high performance Chemical Induced Disease relation extraction, without employing external knowledge sources or task specific heuristics. Our work shows that graph kernels are effective in extracting relations that are expressed in multiple sentences. We also show that the graph kernels, namely the ASM and APG kernels, substantially outperform the tree kernels. Among the graph kernels, we showed the ASM kernel as effective for biomedical relation extraction, with comparable performance to the APG kernel for datasets such as the CID-sentence level relation extraction and BioInfer in PPI. Overall, the APG kernel is shown to be significantly more accurate than the ASM kernel, achieving better performance on most datasets. Collapse Key Words APG kernel ASM kernel Graph kernels Relation extraction Collapse MESH Headings Collapse Grants Collapse
57	Chen Q, Panyam NC, Elangovan A, Verspoor K. BioCreative VI Precision Medicine Track system performance is constrained by entity recognition and variations in corpus characteristics. Database (Oxford) 2018;2018:5255181. [PMID: 30576491 PMCID: PMC6301335 DOI: 10.1093/database/bay122] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2018] [Revised: 09/24/2018] [Accepted: 10/16/2018] [Indexed: 01/01/2023] Abstract Precision medicine aims to provide personalized treatments based on individual patient profiles. One critical step towards precision medicine is leveraging knowledge derived from biomedical publications-a tremendous literature resource presenting the latest scientific discoveries on genes, mutations and diseases. Biomedical natural language processing (BioNLP) plays a vital role in supporting automation of this process. BioCreative VI Track 4 brings community effort to the task of automatically identifying and extracting protein-protein interactions (PPi) affected by mutations (PPIm), important in the precision medicine context for capturing individual genotype variation related to disease.We present the READ-BioMed team's approach to identifying PPIm-related publications and to extracting specific PPIm information from those publications in the context of the BioCreative VI PPIm track. We observe that current BioNLP tools are insufficient to recognise entities for these two tasks; the best existing mutation recognition tool achieves only 55% recall in the document triage training set, while relation extraction performance is limited by the low recall performance of gene entity recognition. We develop the models accordingly: for document triage, we develop term lists capturing interactions and mutations to complement BioNLP tools, and select effective features via a feature contribution study, whereas an ensemble of BioNLP tools is employed for relation extraction.Our best document triage model achieves an F-score of 66.77% while our best model for relation extraction achieved an F-score of 35.09% over the final (updated post-task) test set. Impacting the document triage task, the characteristics of mutations are statistically different in the training and testing sets. While a vital new direction for biomedical text mining research, this early attempt to tackle the problem of identifying genetic variation of substantial biological significance highlights the importance of representative training data and the cascading impact of tool limitations in a modular system. Collapse Key Words Collapse MESH Headings Biomedical Research Data Mining/methods Humans Medical Informatics/methods Mutation Natural Language Processing Precision Medicine/methods Collapse Grants Collapse
58	Bouadjenek MR, Verspoor K, Zobel J. Literature consistency of bioinformatics sequence databases is effective for assessing record quality. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2017;2017:3074790. [PMID: 28365737 PMCID: PMC5467556 DOI: 10.1093/database/bax021] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/01/2016] [Accepted: 02/20/2017] [Indexed: 11/18/2022] Abstract Bioinformatics sequence databases such as Genbank or UniProt contain hundreds of millions of records of genomic data. These records are derived from direct submissions from individual laboratories, as well as from bulk submissions from large-scale sequencing centres; their diversity and scale means that they suffer from a range of data quality issues including errors, discrepancies, redundancies, ambiguities, incompleteness and inconsistencies with the published literature. In this work, we seek to investigate and analyze the data quality of sequence databases from the perspective of a curator, who must detect anomalous and suspicious records. Specifically, we emphasize the detection of inconsistent records with respect to the literature. Focusing on GenBank, we propose a set of 24 quality indicators, which are based on treating a record as a query into the published literature, and then use query quality predictors. We then carry out an analysis that shows that the proposed quality indicators and the quality of the records have a mutual relationship, in which one depends on the other. We propose to represent record-literature consistency as a vector of these quality indicators. By reducing the dimensionality of this representation for visualization purposes using principal component analysis, we show that records which have been reported as inconsistent with the literature fall roughly in the same area, and therefore share similar characteristics. By manually analyzing records not previously known to be erroneous that fall in the same area than records know to be inconsistent, we show that one record out of four is inconsistent with respect to the literature. This high density of inconsistent record opens the way towards the development of automatic methods for the detection of faulty records. We conclude that literature inconsistency is a meaningful strategy for identifying suspicious records. Database URL: https://github.com/rbouadjenek/DQBioinformatics Collapse Key Words Collapse MESH Headings Computational Biology/methods Data Curation/methods Data Mining/methods Databases, Nucleic Acid Databases, Protein Collapse Grants Collapse
59	Chen Q, Wan Y, Zhang X, Lei Y, Zobel J, Verspoor K. Comparative Analysis of Sequence Clustering Methods for Deduplication of Biological Databases. ACM JOURNAL OF DATA AND INFORMATION QUALITY 2017. [DOI: 10.1145/3131611] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Abstract The massive volumes of data in biological sequence databases provide a remarkable resource for large-scale biological studies. However, the underlying data quality of these resources is a critical concern. A particular challenge is duplication, in which multiple records have similar sequences, creating a high level of redundancy that impacts database storage, curation, and search. Biological database deduplication has two direct applications: for database curation, where detected duplicates are removed to improve curation efficiency, and for database search, where detected duplicate sequences may be flagged but remain available to support analysis. Clustering methods have been widely applied to biological sequences for database deduplication. Since an exhaustive all-by-all pairwise comparison of sequences cannot scale for a high volume of data, heuristic approaches have been recruited, such as the use of simple similarity thresholds. In this article, we present a comparison between CD-HIT and UCLUST, the two best-known clustering tools for sequence database deduplication. Our contributions include a detailed assessment of the redundancy remaining after deduplication, application of standard clustering evaluation metrics to quantify the cohesion and separation of the clusters generated by each method, and a biological case study that assesses intracluster function annotation consistency to demonstrate the impact of these factors on a practical application of the sequence clustering methods. Our results show that the trade-off between efficiency and accuracy becomes acute when low threshold values are used and when cluster sizes are large. This evaluation leads to practical recommendations for users for more effective uses of the sequence clustering tools for deduplication. Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
60	Cohen KB, Lanfranchi A, Choi MJY, Bada M, Baumgartner WA, Panteleyeva N, Verspoor K, Palmer M, Hunter LE. Coreference annotation and resolution in the Colorado Richly Annotated Full Text (CRAFT) corpus of biomedical journal articles. BMC Bioinformatics 2017;18:372. [PMID: 28818042 PMCID: PMC5561560 DOI: 10.1186/s12859-017-1775-9] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2016] [Accepted: 07/31/2017] [Indexed: 11/10/2022] Open Abstract BACKGROUND Coreference resolution is the task of finding strings in text that have the same referent as other strings. Failures of coreference resolution are a common cause of false negatives in information extraction from the scientific literature. In order to better understand the nature of the phenomenon of coreference in biomedical publications and to increase performance on the task, we annotated the Colorado Richly Annotated Full Text (CRAFT) corpus with coreference relations. RESULTS The corpus was manually annotated with coreference relations, including identity and appositives for all coreferring base noun phrases. The OntoNotes annotation guidelines, with minor adaptations, were used. Interannotator agreement ranges from 0.480 (entity-based CEAF) to 0.858 (Class-B3), depending on the metric that is used to assess it. The resulting corpus adds nearly 30,000 annotations to the previous release of the CRAFT corpus. Differences from related projects include a much broader definition of markables, connection to extensive annotation of several domain-relevant semantic classes, and connection to complete syntactic annotation. Tool performance was benchmarked on the data. A publicly available out-of-the-box, general-domain coreference resolution system achieved an F-measure of 0.14 (B3), while a simple domain-adapted rule-based system achieved an F-measure of 0.42. An ensemble of the two reached F of 0.46. Following the IDENTITY chains in the data would add 106,263 additional named entities in the full 97-paper corpus, for an increase of 76% percent in the semantic classes of the eight ontologies that have been annotated in earlier versions of the CRAFT corpus. CONCLUSIONS The project produced a large data set for further investigation of coreference and coreference resolution in the scientific literature. The work raised issues in the phenomenon of reference in this domain and genre, and the paper proposes that many mentions that would be considered generic in the general domain are not generic in the biomedical domain due to their referents to specific classes in domain-specific ontologies. The comparison of the performance of a publicly available and well-understood coreference resolution system with a domain-adapted system produced results that are consistent with the notion that the requirements for successful coreference resolution in this genre are quite different from those of the general domain, and also suggest that the baseline performance difference is quite large. Collapse Key Words Anaphora Annotation Benchmarking Coreference Corpus Resolution Collapse MESH Headings Collapse Grants Collapse
61	Bouadjenek MR, Verspoor K, Zobel J. Automated detection of records in biological sequence databases that are inconsistent with the literature. J Biomed Inform 2017. [PMID: 28624643 DOI: 10.1016/j.jbi.2017.06.015] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023] Abstract We investigate and analyse the data quality of nucleotide sequence databases with the objective of automatic detection of data anomalies and suspicious records. Specifically, we demonstrate that the published literature associated with each data record can be used to automatically evaluate its quality, by cross-checking the consistency of the key content of the database record with the referenced publications. Focusing on GenBank, we describe a set of quality indicators based on the relevance paradigm of information retrieval (IR). Then, we use these quality indicators to train an anomaly detection algorithm to classify records as "confident" or "suspicious". Our experiments on the PubMed Central collection show assessing the coherence between the literature and database records, through our algorithms, is an effective mechanism for assisting curators to perform data cleansing. Although fewer than 0.25% of the records in our data set are known to be faulty, we would expect that there are many more in GenBank that have not yet been identified. By automated comparison with literature they can be identified with a precision of up to 10% and a recall of up to 30%, while strongly outperforming several baselines. While these results leave substantial room for improvement, they reflect both the very imbalanced nature of the data, and the limited explicitly labelled data that is available. Overall, the obtained results show promise for the development of a new kind of approach to detecting low-quality and suspicious sequence records based on literature analysis and consistency. From a practical point of view, this will greatly help curators in identifying inconsistent records in large-scale sequence databases by highlighting records that are likely to be inconsistent with the literature. Collapse Key Words Anomaly detection Bioinformatics databases Data analysis Data quality Collapse MESH Headings Collapse Grants Collapse
62	Hameed PN, Verspoor K, Kusljic S, Halgamuge S. Positive-Unlabeled Learning for inferring drug interactions based on heterogeneous attributes. BMC Bioinformatics 2017;18:140. [PMID: 28249566 PMCID: PMC5333429 DOI: 10.1186/s12859-017-1546-7] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2016] [Accepted: 02/13/2017] [Indexed: 12/16/2022] Open Abstract BACKGROUND Investigating and understanding drug-drug interactions (DDIs) is important in improving the effectiveness of clinical care. DDIs can occur when two or more drugs are administered together. Experimentally based DDI detection methods require a large cost and time. Hence, there is a great interest in developing efficient and useful computational methods for inferring potential DDIs. Standard binary classifiers require both positives and negatives for training. In a DDI context, drug pairs that are known to interact can serve as positives for predictive methods. But, the negatives or drug pairs that have been confirmed to have no interaction are scarce. To address this lack of negatives, we introduce a Positive-Unlabeled Learning method for inferring potential DDIs. RESULTS The proposed method consists of three steps: i) application of Growing Self Organizing Maps to infer negatives from the unlabeled dataset; ii) using a pairwise similarity function to quantify the overlap between individual features of drugs and iii) using support vector machine classifier for inferring DDIs. We obtained 6036 DDIs from DrugBank database. Using the proposed approach, we inferred 589 drug pairs that are likely to not interact with each other; these drug pairs are used as representative data for the negative class in binary classification for DDI prediction. Moreover, we classify the predicted DDIs as Cytochrome P450 (CYP) enzyme-Dependent and CYP-Independent interactions invoking their locations on the Growing Self Organizing Map, due to the particular importance of these enzymes in clinically significant interaction effects. Further, we provide a case study on three predicted CYP-Dependent DDIs to evaluate the clinical relevance of this study. CONCLUSION Our proposed approach showed an absolute improvement in F1-score of 14 and 38% in comparison to the method that randomly selects unlabeled data points as likely negatives, depending on the choice of similarity function. We inferred 5300 possible CYP-Dependent DDIs and 592 CYP-Independent DDIs with the highest posterior probabilities. Our discoveries can be used to improve clinical care as well as the research outcomes of drug development. Collapse Key Words CYP isoforms Drug-drug interaction Growing self organizing map (GSOM) PU learning Pairwise drug similarity Collapse MESH Headings Collapse Grants Collapse
63	Chen Q, Zobel J, Verspoor K. Duplicates, redundancies and inconsistencies in the primary nucleotide databases: a descriptive study. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2017;2017:baw163. [PMID: 28077566 PMCID: PMC5225397 DOI: 10.1093/database/baw163] [Citation(s) in RCA: 29] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/10/2016] [Revised: 11/17/2016] [Accepted: 11/21/2016] [Indexed: 01/22/2023] Abstract GenBank, the EMBL European Nucleotide Archive and the DNA DataBank of Japan, known collectively as the International Nucleotide Sequence Database Collaboration or INSDC, are the three most significant nucleotide sequence databases. Their records are derived from laboratory work undertaken by different individuals, by different teams, with a range of technologies and assumptions and over a period of decades. As a consequence, they contain a great many duplicates, redundancies and inconsistencies, but neither the prevalence nor the characteristics of various types of duplicates have been rigorously assessed. Existing duplicate detection methods in bioinformatics only address specific duplicate types, with inconsistent assumptions; and the impact of duplicates in bioinformatics databases has not been carefully assessed, making it difficult to judge the value of such methods. Our goal is to assess the scale, kinds and impact of duplicates in bioinformatics databases, through a retrospective analysis of merged groups in INSDC databases. Our outcomes are threefold: (1) We analyse a benchmark dataset consisting of duplicates manually identified in INSDC—a dataset of 67 888 merged groups with 111 823 duplicate pairs across 21 organisms from INSDC databases – in terms of the prevalence, types and impacts of duplicates. (2) We categorize duplicates at both sequence and annotation level, with supporting quantitative statistics, showing that different organisms have different prevalence of distinct kinds of duplicate. (3) We show that the presence of duplicates has practical impact via a simple case study on duplicates, in terms of GC content and melting temperature. We demonstrate that duplicates not only introduce redundancy, but can lead to inconsistent results for certain tasks. Our findings lead to a better understanding of the problem of duplication in biological databases. Database URL: the merged records are available at https://cloudstor.aarnet.edu.au/plus/index.php/s/Xef2fvsebBEAv9w Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
64	Chen Q, Zobel J, Verspoor K. Benchmarks for measurement of duplicate detection methods in nucleotide databases. Database (Oxford) 2017;2023:2870676. [PMID: 28334741 PMCID: PMC10755258 DOI: 10.1093/database/baw164] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2016] [Revised: 11/17/2016] [Accepted: 11/21/2016] [Indexed: 01/01/2023] Abstract Duplication of information in databases is a major data quality challenge. The presence of duplicates, implying either redundancy or inconsistency, can have a range of impacts on the quality of analyses that use the data. To provide a sound basis for research on this issue in databases of nucleotide sequences, we have developed new, large-scale validated collections of duplicates, which can be used to test the effectiveness of duplicate detection methods. Previous collections were either designed primarily to test efficiency, or contained only a limited number of duplicates of limited kinds. To date, duplicate detection methods have been evaluated on separate, inconsistent benchmarks, leading to results that cannot be compared and, due to limitations of the benchmarks, of questionable generality. In this study, we present three nucleotide sequence database benchmarks, based on information drawn from a range of resources, including information derived from mapping to two data sections within the UniProt Knowledgebase (UniProtKB), UniProtKB/Swiss-Prot and UniProtKB/TrEMBL. Each benchmark has distinct characteristics. We quantify these characteristics and argue for their complementary value in evaluation. The benchmarks collectively contain a vast number of validated biological duplicates; the largest has nearly half a billion duplicate pairs (although this is probably only a tiny fraction of the total that is present). They are also the first benchmarks targeting the primary nucleotide databases. The records include the 21 most heavily studied organisms in molecular biology research. Our quantitative analysis shows that duplicates in the different benchmarks, and in different organisms, have different characteristics. It is thus unreliable to evaluate duplicate detection methods against any single benchmark. For example, the benchmark derived from UniProtKB/Swiss-Prot mappings identifies more diverse types of duplicates, showing the importance of expert curation, but is limited to coding sequences. Overall, these benchmarks form a resource that we believe will be of great value for development and evaluation of the duplicate detection or record linkage methods that are required to help maintain these essential resources. DATABASE URL : https://bitbucket.org/biodbqual/benchmarks. Collapse Key Words Collapse MESH Headings Collapse Grants the Australian Research Council Collapse
65	Lu Y, Sinnott RO, Verspoor K. A Semantic-Based K-Anonymity Scheme for Health Record Linkage. Stud Health Technol Inform 2017;239:84-90. [PMID: 28756441] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/07/2023] Abstract Record linkage is a technique for integrating data from sources or providers where direct access to the data is not possible due to security and privacy considerations. This is a very common scenario for medical data, as patient privacy is a significant concern. To avoid privacy leakage, researchers have adopted k-anonymity to protect raw data from re-identification however they cannot avoid associated information loss, e.g. due to generalisation. Given that individual-level data is often not disclosed in the linkage cases, but yet remains potentially re-discoverable, we propose semantic-based linkage k-anonymity to de-identify record linkage with fewer generalisations and eliminate inference disclosure through semantic reasoning. Collapse Key Words Medical record linkage de-identification k-anonymity semantic reasoning Collapse MESH Headings Computer Security Confidentiality Humans Information Storage and Retrieval Medical Records Systems, Computerized Privacy Semantics Collapse Grants Collapse
66	Pitson G, Banks P, Cavedon L, Verspoor K. Developing a Manually Annotated Corpus of Clinical Letters for Breast Cancer Patients on Routine Follow-Up. Stud Health Technol Inform 2017;235:196-200. [PMID: 28423782] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/07/2023] Abstract This paper introduces the annotation schema and annotation process for a corpus of clinical letters describing the disease course and treatment of oestrogen receptor positive breast cancer patients, after completion of primary surgery and radiotherapy treatment. Concepts related to therapy, clinical signs, and recurrence, as well as relationships linking these, are identified and annotated in 200 letters. This corpus will provide the basis for development of natural language processing tools for automatic extraction of key clinical factors from such letters. Collapse Key Words annotated corpus cancer follow-up clinical letter corpus Collapse MESH Headings Breast Neoplasms/pathology Breast Neoplasms/radiotherapy Breast Neoplasms/surgery Female Follow-Up Studies Humans Natural Language Processing Receptors, Estrogen Collapse Grants Collapse
67	Li X, Verspoor K, Gray K, Barnett S. Understanding Health Professionals' Informal Learning in Online Social Networks: A Cross-Sectional Survey. Stud Health Technol Inform 2017;239:77-83. [PMID: 28756440] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/07/2023] Abstract Online social networks (OSNs) enable health professionals to learn informally, for example by sharing medical knowledge, or discussing practice management challenges and clinical issues. Understanding how learning occurs in OSNs is necessary to better support this type of learning. Through a cross-sectional survey, this study found that learning interaction in OSNs is low in general, with a small number of active users. Some health professionals actively used OSNs to support their practice, including sharing practical and experiential knowledge, benchmarking themselves, and to keep up-to-date on policy, advanced information and news in the field. These health professionals had an overall positive learning experience in OSNs. Collapse Key Words Networked learning continuing professional development health professional education online social networks Collapse MESH Headings Cross-Sectional Studies Health Personnel Humans Interprofessional Relations Learning Social Support Surveys and Questionnaires Collapse Grants Collapse
68	Li X, Gray K, Verspoor K, Barnett S. Understanding the Context of Learning in an Online Social Network for Health Professionals' Informal Learning. Stud Health Technol Inform 2017;235:353-357. [PMID: 28423813] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/07/2023] Abstract Online social networks (OSN) enable health professionals to learn informally, for example by sharing medical knowledge, or discussing practice management challenges and clinical issues. Understanding the learning context in OSN is necessary to get a complete picture of the learning process, in order to better support this type of learning. This study proposes critical contextual factors for understanding the learning context in OSN for health professionals, and demonstrates how these contextual factors can be used to analyse the learning context in a designated online learning environment for health professionals. Collapse Key Words Context analysis health professional education online social networks Collapse MESH Headings Health Personnel/education Humans Learning Social Networking Collapse Grants Collapse
69	Kiossoglou P, Borda A, Gray K, Martin-Sanchez F, Verspoor K, Lopez-Campos G. Characterising the Scope of Exposome Research: A Generalisable Approach. Stud Health Technol Inform 2017;245:457-461. [PMID: 29295136] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/07/2023] Abstract Scientific advancement and the development of new research fields bring uncertainties about what the current topics of research emphasis are and thus, what new knowledge might need to be represented. The exposome is an example of one such new field for which these uncertainties exist. The exposome is the analogue to the genome, from an environmental exposure perspective; research on the exposome has gained momentum only since 2011. In this work, we propose a generally applicable methodology that aims to characterise the landscape of a new research area based on linguistic analysis of its associated publications. Using abstracts of 261 exposome research articles, we illustrate a methodology that combines (1) inductive analysis based on word frequency counts, and term analysis to identify the topics, methods and applications of the new field and (2) deductive analysis using the NCBO Ontology Recommender to identify to what extent this new area is covered by current knowledge representation tools. Applying this method to the exposome literature, we uncover both the current focus of exposome research and the ontologies that are most relevant to the domain. Collapse Key Words Biomedical Ontologies Environmental Exposure Medical Informatics Text Mining Collapse MESH Headings Biological Ontologies Humans Knowledge Research Collapse Grants Collapse
70	Khumrin P, Ryan A, Judd T, Verspoor K. Diagnostic Machine Learning Models for Acute Abdominal Pain: Towards an e-Learning Tool for Medical Students. Stud Health Technol Inform 2017;245:447-451. [PMID: 29295134] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/07/2023] Abstract Computer-aided learning systems (e-learning systems) can help medical students gain more experience with diagnostic reasoning and decision making. Within this context, providing feedback that matches students' needs (i.e. personalised feedback) is both critical and challenging. In this paper, we describe the development of a machine learning model to support medical students' diagnostic decisions. Machine learning models were trained on 208 clinical cases presenting with abdominal pain, to predict five diagnoses. We assessed which of these models are likely to be most effective for use in an e-learning tool that allows students to interact with a virtual patient. The broader goal is to utilise these models to generate personalised feedback based on the specific patient information requested by students and their active diagnostic hypotheses. Collapse Key Words Artificial Intelligence Clinical Decision Support Systems Formative Feedback Collapse MESH Headings Abdominal Pain/diagnosis Abdominal Pain/therapy Clinical Competence Decision Making Education, Medical, Undergraduate Feedback Humans Learning Machine Learning Students, Medical Collapse Grants Collapse
71	Bouadjenek MR, Verspoor K. Multi-field query expansion is effective for biomedical dataset retrieval. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2017;2017:4107606. [PMID: 29220457 PMCID: PMC5737205 DOI: 10.1093/database/bax062] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/20/2017] [Accepted: 07/31/2017] [Indexed: 01/01/2023] Abstract In the context of the bioCADDIE challenge addressing information retrieval of biomedical datasets, we propose a method for retrieval of biomedical data sets with heterogenous schemas through query reformulation. In particular, the method proposed transforms the initial query into a multi-field query that is then enriched with terms that are likely to occur in the relevant datasets. We compare and evaluate two query expansion strategies, one based on the Rocchio method and another based on a biomedical lexicon. We then perform a comprehensive comparative evaluation of our method on the bioCADDIE dataset collection for biomedical retrieval. We demonstrate the effectiveness of our multi-field query method compared to two baselines, with MAP improved from 0.2171 and 0.2669 to 0.2996. We also show the benefits of query expansion, where the Rocchio expanstion method improves the MAP for our two baselines from 0.2171 and 0.2669 to 0.335. We show that the Rocchio query expansion method slightly outperforms the one based on the biomedical lexicon as a source of terms, with an improvement of roughly 3% for MAP. However, the query expansion method based on the biomedical lexicon is much less resource intensive since it does not require computation of any relevance feedback set or any initial execution of the query. Hence, in term of trade-off between efficiency, execution time and retrieval accuracy, we argue that the query expansion method based on the biomedical lexicon offers the best performance for a prototype biomedical data search engine intended to be used at a large scale. In the official bioCADDIE challenge results, although our approach is ranked seventh in terms of the infNDCG evaluation metric, it ranks second in term of P@10 and NDCG. Hence, the method proposed here provides overall good retrieval performance in relation to the approaches of other competitors. Consequently, the observations made in this paper should benefit the development of a Data Discovery Index prototype or the improvement of the existing one. Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
72	Verspoor K, Oellrich A, Collier N, Groza T, Rocca-Serra P, Soldatova L, Dumontier M, Shah N. Thematic issue of the Second combined Bio-ontologies and Phenotypes Workshop. J Biomed Semantics 2016;7:66. [PMID: 27955708 PMCID: PMC5154111 DOI: 10.1186/s13326-016-0108-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2016] [Accepted: 11/18/2016] [Indexed: 12/04/2022] Open Abstract This special issue covers selected papers from the 18th Bio-Ontologies Special Interest Group meeting and Phenotype Day, which took place at the Intelligent Systems for Molecular Biology (ISMB) conference in Dublin in 2015. The papers presented in this collection range from descriptions of software tools supporting ontology development and annotation of objects with ontology terms, to applications of text mining for structured relation extraction involving diseases and phenotypes, to detailed proposals for new ontologies and mapping of existing ontologies. Together, the papers consider a range of representational issues in bio-ontology development, and demonstrate the applicability of bio-ontologies to support biological and clinical knowledge-based decision making and analysis. The full set of papers in the Thematic Issue is available at http://www.biomedcentral.com/collections/sig. Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
73	Sun Y, Hameed PN, Verspoor K, Halgamuge S. A physarum-inspired prize-collecting steiner tree approach to identify subnetworks for drug repositioning. BMC SYSTEMS BIOLOGY 2016;10:128. [PMID: 28105946 PMCID: PMC5249043 DOI: 10.1186/s12918-016-0371-3] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/17/2022] Abstract Background Drug repositioning can reduce the time, costs and risks of drug development by identifying new therapeutic effects for known drugs. It is challenging to reposition drugs as pharmacological data is large and complex. Subnetwork identification has already been used to simplify the visualization and interpretation of biological data, but it has not been applied to drug repositioning so far. In this paper, we fill this gap by proposing a new Physarum-inspired Prize-Collecting Steiner Tree algorithm to identify subnetworks for drug repositioning. Results Drug Similarity Networks (DSN) are generated using the chemical, therapeutic, protein, and phenotype features of drugs. In DSNs, vertex prizes and edge costs represent the similarities and dissimilarities between drugs respectively, and terminals represent drugs in the cardiovascular class, as defined in the Anatomical Therapeutic Chemical classification system. A new Physarum-inspired Prize-Collecting Steiner Tree algorithm is proposed in this paper to identify subnetworks. We apply both the proposed algorithm and the widely-used GW algorithm to identify subnetworks in our 18 generated DSNs. In these DSNs, our proposed algorithm identifies subnetworks with an average Rand Index of 81.1%, while the GW algorithm can only identify subnetworks with an average Rand Index of 64.1%. We select 9 subnetworks with high Rand Index to find drug repositioning opportunities. 10 frequently occurring drugs in these subnetworks are identified as candidates to be repositioned for cardiovascular diseases. Conclusions We find evidence to support previous discoveries that nitroglycerin, theophylline and acarbose may be able to be repositioned for cardiovascular diseases. Moreover, we identify seven previously unknown drug candidates that also may interact with the biological cardiovascular system. These discoveries show our proposed Prize-Collecting Steiner Tree approach as a promising strategy for drug repositioning. Electronic supplementary material The online version of this article (doi:10.1186/s12918-016-0371-3) contains supplementary material, which is available to authorized users. Collapse Key Words Big data Drug similarity network Physarum polycephalum Steiner tree problem Subnetwork identification Collapse MESH Headings Collapse Grants Collapse
74	Mizuno S, Ogishima S, Nishigori H, Jamieson DG, Verspoor K, Tanaka H, Yaegashi N, Nakaya J. The Pre-Eclampsia Ontology: A Disease Ontology Representing the Domain Knowledge Specific to Pre-Eclampsia. PLoS One 2016;11:e0162828. [PMID: 27788142 PMCID: PMC5082890 DOI: 10.1371/journal.pone.0162828] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2016] [Accepted: 08/29/2016] [Indexed: 11/24/2022] Open Abstract Pre-eclampsia (PE) is a clinical syndrome characterized by new-onset hypertension and proteinuria at ≥20 weeks of gestation, and is a leading cause of maternal and perinatal morbidity and mortality. Previous studies have gathered abundant data about PE such as risk factors and pathological findings. However, most of these data are not semantically structured. Clinical data on PE patients are often generated with semantic heterogeneity such as using disparate terminology to describe the same phenomena. In clinical studies, interoperability of heterogenic clinical data is required in various situations. In such a situation, it is necessary to develop an interoperable and standardized semantic framework to research the pathology of PE more comprehensively and to achieve interoperability of heterogenic clinical data of PE patients. In this study, we developed an ontology representing clinical features, treatments, genetic factors, environmental factors, and other aspects of the current knowledge in the domain of PE. We call this pre-eclampsia ontology “PEO”. To achieve interoperability with other ontologies, the core structure of PEO was compliant with the hierarchy of the Basic Formal Ontology (BFO). The PEO incorporates a wide range of key concepts and terms of PE from clinical and biomedical research in structuring the knowledge base that is specific to PE; therefore, PEO is expected to enhance PE-specific information retrieval and knowledge discovery in both clinical and biomedical research fields. Collapse Key Words Collapse MESH Headings Biological Ontologies Female Humans Pre-Eclampsia Pregnancy Terminology as Topic Collapse Grants Ministry of Education, Culture, Sports, Science, and Technology Collapse
75	Kocbek S, Cavedon L, Martinez D, Bain C, Manus CM, Haffari G, Zukerman I, Verspoor K. Text mining electronic hospital records to automatically classify admissions against disease: Measuring the impact of linking data sources. J Biomed Inform 2016;64:158-167. [PMID: 27742349 DOI: 10.1016/j.jbi.2016.10.008] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2016] [Revised: 08/20/2016] [Accepted: 10/10/2016] [Indexed: 10/20/2022] Abstract OBJECTIVE Text and data mining play an important role in obtaining insights from Health and Hospital Information Systems. This paper presents a text mining system for detecting admissions marked as positive for several diseases: Lung Cancer, Breast Cancer, Colon Cancer, Secondary Malignant Neoplasm of Respiratory and Digestive Organs, Multiple Myeloma and Malignant Plasma Cell Neoplasms, Pneumonia, and Pulmonary Embolism. We specifically examine the effect of linking multiple data sources on text classification performance. METHODS Support Vector Machine classifiers are built for eight data source combinations, and evaluated using the metrics of Precision, Recall and F-Score. Sub-sampling techniques are used to address unbalanced datasets of medical records. We use radiology reports as an initial data source and add other sources, such as pathology reports and patient and hospital admission data, in order to assess the research question regarding the impact of the value of multiple data sources. Statistical significance is measured using the Wilcoxon signed-rank test. A second set of experiments explores aspects of the system in greater depth, focusing on Lung Cancer. We explore the impact of feature selection; analyse the learning curve; examine the effect of restricting admissions to only those containing reports from all data sources; and examine the impact of reducing the sub-sampling. These experiments provide better understanding of how to best apply text classification in the context of imbalanced data of variable completeness. RESULTS Radiology questions plus patient and hospital admission data contribute valuable information for detecting most of the diseases, significantly improving performance when added to radiology reports alone or to the combination of radiology and pathology reports. CONCLUSION Overall, linking data sources significantly improved classification performance for all the diseases examined. However, there is no single approach that suits all scenarios; the choice of the most effective combination of data sources depends on the specific disease to be classified. Collapse Key Words Cancer record retrieval Electronic Health Records Natural Language Processing Pathology Radiology Text mining Collapse MESH Headings Collapse Grants Collapse