1
|
Hurvitz N, Ilan Y. The Constrained-Disorder Principle Assists in Overcoming Significant Challenges in Digital Health: Moving from "Nice to Have" to Mandatory Systems. Clin Pract 2023; 13:994-1014. [PMID: 37623270 PMCID: PMC10453547 DOI: 10.3390/clinpract13040089] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2023] [Revised: 08/16/2023] [Accepted: 08/18/2023] [Indexed: 08/26/2023] Open
Abstract
The success of artificial intelligence depends on whether it can penetrate the boundaries of evidence-based medicine, the lack of policies, and the resistance of medical professionals to its use. The failure of digital health to meet expectations requires rethinking some of the challenges faced. We discuss some of the most significant challenges faced by patients, physicians, payers, pharmaceutical companies, and health systems in the digital world. The goal of healthcare systems is to improve outcomes. Assisting in diagnosing, collecting data, and simplifying processes is a "nice to have" tool, but it is not essential. Many of these systems have yet to be shown to improve outcomes. Current outcome-based expectations and economic constraints make "nice to have," "assists," and "ease processes" insufficient. Complex biological systems are defined by their inherent disorder, bounded by dynamic boundaries, as described by the constrained disorder principle (CDP). It provides a platform for correcting systems' malfunctions by regulating their degree of variability. A CDP-based second-generation artificial intelligence system provides solutions to some challenges digital health faces. Therapeutic interventions are held to improve outcomes with these systems. In addition to improving clinically meaningful endpoints, CDP-based second-generation algorithms ensure patient and physician engagement and reduce the health system's costs.
Collapse
Affiliation(s)
| | - Yaron Ilan
- Hadassah Medical Center, Department of Medicine, Faculty of Medicine, Hebrew University, POB 1200, Jerusalem IL91120, Israel;
| |
Collapse
|
2
|
Bennett C, Thornton M, Park C, Henry G, Zhang Y, Malladi V, Kim D. SeqWho: reliable, rapid determination of sequence file identity using k-mer frequencies in Random Forest classifiers. Bioinformatics 2022; 38:1830-1837. [PMID: 35134110 PMCID: PMC8963323 DOI: 10.1093/bioinformatics/btac050] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2021] [Revised: 01/12/2022] [Accepted: 01/26/2022] [Indexed: 02/05/2023] Open
Abstract
MOTIVATION With the vast improvements in sequencing technologies and increased number of protocols, sequencing is being used to answer complex biological problems. Subsequently, analysis pipelines have become more time consuming and complicated, usually requiring highly extensive prevalidation steps. Here, we present SeqWho, a program designed to assess heuristically the quality of sequencing files and reliably classify the organism and protocol type by using Random Forest classifiers trained on biases native in k-mer frequencies and repeat sequence identities. RESULTS Using one of our primary models, we show that our method accurately and rapidly classifies human and mouse sequences from nine different sequencing libraries by species, library and both together, 98.32%, 97.86% and 96.38% of the time, respectively. Ultimately, we demonstrate that SeqWho is a powerful method for reliably validating the quality and identity of the sequencing files used in any pipeline. AVAILABILITY AND IMPLEMENTATION https://github.com/DaehwanKimLab/seqwho. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Christopher Bennett
- Lyda Hill Department of Bioinformatics, University of Texas Southwestern, Dallas, TX 75390, USA
| | - Micah Thornton
- Lyda Hill Department of Bioinformatics, University of Texas Southwestern, Dallas, TX 75390, USA
| | - Chanhee Park
- Lyda Hill Department of Bioinformatics, University of Texas Southwestern, Dallas, TX 75390, USA
| | - Gervaise Henry
- Department of Urology, University of Texas Southwestern, Dallas, TX 75390, USA
| | - Yun Zhang
- Lyda Hill Department of Bioinformatics, University of Texas Southwestern, Dallas, TX 75390, USA
| | - Venkat Malladi
- Lyda Hill Department of Bioinformatics, University of Texas Southwestern, Dallas, TX 75390, USA
| | - Daehwan Kim
- Lyda Hill Department of Bioinformatics, University of Texas Southwestern, Dallas, TX 75390, USA
| |
Collapse
|
3
|
Mitchell G, Zadoks RN, Skuce PJ. A Universal Approach to Molecular Identification of Rumen Fluke Species Across Hosts, Continents, and Sample Types. Front Vet Sci 2021; 7:605259. [PMID: 33748201 PMCID: PMC7969503 DOI: 10.3389/fvets.2020.605259] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2020] [Accepted: 11/10/2020] [Indexed: 12/14/2022] Open
Abstract
Rumen fluke are parasitic trematodes that affect domestic and wild ruminants across a wide range of countries and habitats. There are 6 major genera of rumen fluke and over 70 recognized species. Accurate species identification is important to investigate the epidemiology, pathophysiology and economic impact of rumen fluke species but paramphistomes are morphologically plastic, which has resulted in numerous instances of misclassification. Here, we present a universal approach to molecular identification of rumen fluke species, including different life-cycle stages (eggs, juvenile and mature fluke) and sample preservation methods (fresh, ethanol- or formalin-fixed, and paraffin wax-embedded). Among 387 specimens from 173 animals belonging to 10 host species and originating from 14 countries on 5 continents, 10 rumen fluke species were identified based on ITS-2 intergenic spacer sequencing, including members of the genera Calicophoron, Cotylophoron, Fischeroedius, Gastrothylax, Orthocoelium, and Paramphistomum. Pairwise comparison of ITS-2 sequences from this study and GenBank showed >98.5% homology for 80% of intra-species comparisons and <98.5% homology for 97% of inter-species comparisons, suggesting that some sequence data may have been entered into public repositories with incorrect species attribution based on morphological analysis. We propose that ITS-2 sequencing could be used as a universal tool for rumen fluke identification across host and parasite species from diverse technical and geographical origins and form the basis of an international reference database for accurate species identification.
Collapse
Affiliation(s)
- Gillian Mitchell
- Moredun Research Institute, Pentlands Science Park, Penicuik, United Kingdom
| | - Ruth N. Zadoks
- Moredun Research Institute, Pentlands Science Park, Penicuik, United Kingdom
- Sydney School of Veterinary Science, Faculty of Science, University of Sydney, Camden, NSW, Australia
| | - Philip J. Skuce
- Moredun Research Institute, Pentlands Science Park, Penicuik, United Kingdom
| |
Collapse
|
4
|
Response score of deep learning for out-of-distribution sample detection of medical images. J Biomed Inform 2020; 107:103442. [PMID: 32450299 DOI: 10.1016/j.jbi.2020.103442] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2019] [Revised: 05/02/2020] [Accepted: 05/05/2020] [Indexed: 02/07/2023]
Abstract
Deep learning Convolutional Neural Networks have achieved remarkable performance in a variety of classification tasks. The data-driven nature of deep learning indicates that a model behaves in response to the data used to train the model, and the quality of datasets may lead to substantial influence on the model's performance, especially when dealing with complicated clinical images. In this paper, we propose a simple and novel method to investigate and quantify a deep learning model's response with respect to a given sample, allowing us to detect out-of-distribution samples based on a newly proposed metric, Response Score. The key idea is that samples belonging to different classes may have different degrees of influence on a model. We quantify the resulting consequence of a single sample to a trained-model and relate the quantitative measure of the consequence (by the Response Score) to detect the out-of-distribution samples. The proposed method can find multiple applications such as (1) recognizing abnormal samples, (2) detecting mixed-domain data, and (3) identifying mislabeled data. We present extensive experiments on the three different applications using four biomedical imaging datasets. Experimental results show that our method exhibits remarkable performance and outperforms the compared methods.
Collapse
|
5
|
Bouadjenek MR, Zobel J, Verspoor K. Automated assessment of biological database assertions using the scientific literature. BMC Bioinformatics 2019; 20:216. [PMID: 31035936 PMCID: PMC6489365 DOI: 10.1186/s12859-019-2801-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2018] [Accepted: 04/09/2019] [Indexed: 12/27/2022] Open
Abstract
BACKGROUND The large biological databases such as GenBank contain vast numbers of records, the content of which is substantively based on external resources, including published literature. Manual curation is used to establish whether the literature and the records are indeed consistent. We explore in this paper an automated method for assessing the consistency of biological assertions, to assist biocurators, which we call BARC, Biocuration tool for Assessment of Relation Consistency. In this method a biological assertion is represented as a relation between two objects (for example, a gene and a disease); we then use our novel set-based relevance algorithm SaBRA to retrieve pertinent literature, and apply a classifier to estimate the likelihood that this relation (assertion) is correct. RESULTS Our experiments on assessing gene-disease relations and protein-protein interactions using the PubMed Central collection show that BARC can be effective at assisting curators to perform data cleansing. Specifically, the results obtained showed that BARC substantially outperforms the best baselines, with an improvement of F-measure of 3.5% and 13%, respectively, on gene-disease relations and protein-protein interactions. We have additionally carried out a feature analysis that showed that all feature types are informative, as are all fields of the documents. CONCLUSIONS BARC provides a clear benefit for the biocuration community, as there are no prior automated tools for identifying inconsistent assertions in large-scale biological databases.
Collapse
Affiliation(s)
- Mohamed Reda Bouadjenek
- Department of Mechanical & Industrial Engineering, University of Toronto, Toronto, M5S 3G8 Canada
| | - Justin Zobel
- School of Computing and Information Systems, University of Melbourne, Melbourne, 3010 Australia
| | - Karin Verspoor
- School of Computing and Information Systems, University of Melbourne, Melbourne, 3010 Australia
| |
Collapse
|
6
|
Jacob S, Wolff JJ, Steinbach MS, Doyle CB, Kumar V, Elison JT. Neurodevelopmental heterogeneity and computational approaches for understanding autism. Transl Psychiatry 2019; 9:63. [PMID: 30718453 PMCID: PMC6362076 DOI: 10.1038/s41398-019-0390-0] [Citation(s) in RCA: 44] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/12/2018] [Revised: 10/31/2018] [Accepted: 12/09/2018] [Indexed: 12/17/2022] Open
Abstract
In recent years, the emerging field of computational psychiatry has impelled the use of machine learning models as a means to further understand the pathogenesis of multiple clinical disorders. In this paper, we discuss how autism spectrum disorder (ASD) was and continues to be diagnosed in the context of its complex neurodevelopmental heterogeneity. We review machine learning approaches to streamline ASD's diagnostic methods, to discern similarities and differences from comorbid diagnoses, and to follow developmentally variable outcomes. Both supervised machine learning models for classification outcome and unsupervised approaches to identify new dimensions and subgroups are discussed. We provide an illustrative example of how computational analytic methods and a longitudinal design can improve our inferential ability to detect early dysfunctional behaviors that may or may not reach threshold levels for formal diagnoses. Specifically, an unsupervised machine learning approach of anomaly detection is used to illustrate how community samples may be utilized to investigate early autism risk, multidimensional features, and outcome variables. Because ASD symptoms and challenges are not static within individuals across development, computational approaches present a promising method to elucidate subgroups of etiological contributions to phenotype, alternative developmental courses, interactions with biomedical comorbidities, and to predict potential responses to therapeutic interventions.
Collapse
Affiliation(s)
- Suma Jacob
- Department of Psychiatry, University of Minnesota, Minneapolis, MN, 55414, USA.
| | - Jason J Wolff
- Department of Educational Psychology, University of Minnesota, Minneapolis, MN, 55455, USA
| | - Michael S Steinbach
- Department of Computer Science and Engineering, University of Minnesota, Minneapolis, MN, 55416, USA
| | - Colleen B Doyle
- Institute of Child Development, University of Minnesota, Minneapolis, MN, 55455, USA
| | - Vipan Kumar
- Department of Computer Science and Engineering, University of Minnesota, Minneapolis, MN, 55416, USA
| | - Jed T Elison
- Institute of Child Development, University of Minnesota, Minneapolis, MN, 55455, USA
| |
Collapse
|
7
|
Bouadjenek MR, Verspoor K, Zobel J. Literature consistency of bioinformatics sequence databases is effective for assessing record quality. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2017; 2017:3074790. [PMID: 28365737 PMCID: PMC5467556 DOI: 10.1093/database/bax021] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/01/2016] [Accepted: 02/20/2017] [Indexed: 11/18/2022]
Abstract
Bioinformatics sequence databases such as Genbank or UniProt contain hundreds of millions of records of genomic data. These records are derived from direct submissions from individual laboratories, as well as from bulk submissions from large-scale sequencing centres; their diversity and scale means that they suffer from a range of data quality issues including errors, discrepancies, redundancies, ambiguities, incompleteness and inconsistencies with the published literature. In this work, we seek to investigate and analyze the data quality of sequence databases from the perspective of a curator, who must detect anomalous and suspicious records. Specifically, we emphasize the detection of inconsistent records with respect to the literature. Focusing on GenBank, we propose a set of 24 quality indicators, which are based on treating a record as a query into the published literature, and then use query quality predictors. We then carry out an analysis that shows that the proposed quality indicators and the quality of the records have a mutual relationship, in which one depends on the other. We propose to represent record-literature consistency as a vector of these quality indicators. By reducing the dimensionality of this representation for visualization purposes using principal component analysis, we show that records which have been reported as inconsistent with the literature fall roughly in the same area, and therefore share similar characteristics. By manually analyzing records not previously known to be erroneous that fall in the same area than records know to be inconsistent, we show that one record out of four is inconsistent with respect to the literature. This high density of inconsistent record opens the way towards the development of automatic methods for the detection of faulty records. We conclude that literature inconsistency is a meaningful strategy for identifying suspicious records. Database URL: https://github.com/rbouadjenek/DQBioinformatics
Collapse
|
8
|
Bouadjenek MR, Verspoor K. Multi-field query expansion is effective for biomedical dataset retrieval. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2017; 2017:4107606. [PMID: 29220457 PMCID: PMC5737205 DOI: 10.1093/database/bax062] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/20/2017] [Accepted: 07/31/2017] [Indexed: 01/01/2023]
Abstract
In the context of the bioCADDIE challenge addressing information retrieval of biomedical datasets, we propose a method for retrieval of biomedical data sets with heterogenous schemas through query reformulation. In particular, the method proposed transforms the initial query into a multi-field query that is then enriched with terms that are likely to occur in the relevant datasets. We compare and evaluate two query expansion strategies, one based on the Rocchio method and another based on a biomedical lexicon. We then perform a comprehensive comparative evaluation of our method on the bioCADDIE dataset collection for biomedical retrieval. We demonstrate the effectiveness of our multi-field query method compared to two baselines, with MAP improved from 0.2171 and 0.2669 to 0.2996. We also show the benefits of query expansion, where the Rocchio expanstion method improves the MAP for our two baselines from 0.2171 and 0.2669 to 0.335. We show that the Rocchio query expansion method slightly outperforms the one based on the biomedical lexicon as a source of terms, with an improvement of roughly 3% for MAP. However, the query expansion method based on the biomedical lexicon is much less resource intensive since it does not require computation of any relevance feedback set or any initial execution of the query. Hence, in term of trade-off between efficiency, execution time and retrieval accuracy, we argue that the query expansion method based on the biomedical lexicon offers the best performance for a prototype biomedical data search engine intended to be used at a large scale. In the official bioCADDIE challenge results, although our approach is ranked seventh in terms of the infNDCG evaluation metric, it ranks second in term of P@10 and NDCG. Hence, the method proposed here provides overall good retrieval performance in relation to the approaches of other competitors. Consequently, the observations made in this paper should benefit the development of a Data Discovery Index prototype or the improvement of the existing one.
Collapse
Affiliation(s)
- Mohamed Reda Bouadjenek
- School of Computing and Information Systems, The University of Melbourne, Parkville, VIC, 3010, Australia
| | - Karin Verspoor
- School of Computing and Information Systems, The University of Melbourne, Parkville, VIC, 3010, Australia
| |
Collapse
|