1
|
Xing X, Sun M, Guo Z, Zhao Y, Cai Y, Zhou P, Wang H, Gao W, Li P, Yang H. Functional annotation map of natural compounds in traditional Chinese medicines library: TCMs with myocardial protection as a case. Acta Pharm Sin B 2023; 13:3802-3816. [PMID: 37719385 PMCID: PMC10502289 DOI: 10.1016/j.apsb.2023.06.002] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2023] [Revised: 05/14/2023] [Accepted: 05/31/2023] [Indexed: 09/19/2023] Open
Abstract
The chemical complexity of traditional Chinese medicines (TCMs) makes the active and functional annotation of natural compounds challenging. Herein, we developed the TCMs-Compounds Functional Annotation platform (TCMs-CFA) for large-scale predicting active compounds with potential mechanisms from TCM complex system, without isolating and activity testing every single compound one by one. The platform was established based on the integration of TCMs knowledge base, chemome profiling, and high-content imaging. It mainly included: (1) selection of herbal drugs of target based on TCMs knowledge base; (2) chemome profiling of TCMs extract library by LC‒MS; (3) cytological profiling of TCMs extract library by high-content cell-based imaging; (4) active compounds discovery by combining each mass signal and multi-parametric cell phenotypes; (5) construction of functional annotation map for predicting the potential mechanisms of lead compounds. In this stud TCMs with myocardial protection were applied as a case study, and validated for the feasibility and utility of the platform. Seven frequently used herbal drugs (Ginseng, etc.) were screened from 100,000 TCMs formulas for myocardial protection and subsequently prepared as a library of 700 extracts. By using TCMs-CFA platform, 81 lead compounds, including 10 novel bioactive ones, were quickly identified by correlating 8089 mass signals with 170,100 cytological parameters from an extract library. The TCMs-CFA platform described a new evidence-led tool for the rapid discovery process by data mining strategies, which is valuable for novel lead compounds from TCMs. All computations are done through Python and are publicly available on GitHub.
Collapse
Affiliation(s)
- Xudong Xing
- State Key Laboratory of Natural Medicines, School of Traditional Chinese Pharmacy, China Pharmaceutical University, Nanjing 211198, China
| | - Mengru Sun
- State Key Laboratory of Natural Medicines, School of Traditional Chinese Pharmacy, China Pharmaceutical University, Nanjing 211198, China
| | - Zifan Guo
- State Key Laboratory of Natural Medicines, School of Traditional Chinese Pharmacy, China Pharmaceutical University, Nanjing 211198, China
| | - Yongjuan Zhao
- State Key Laboratory of Natural Medicines, School of Traditional Chinese Pharmacy, China Pharmaceutical University, Nanjing 211198, China
| | - Yuru Cai
- State Key Laboratory of Natural Medicines, School of Traditional Chinese Pharmacy, China Pharmaceutical University, Nanjing 211198, China
| | - Ping Zhou
- State Key Laboratory of Natural Medicines, School of Traditional Chinese Pharmacy, China Pharmaceutical University, Nanjing 211198, China
| | - Huiying Wang
- State Key Laboratory of Natural Medicines, School of Traditional Chinese Pharmacy, China Pharmaceutical University, Nanjing 211198, China
| | - Wen Gao
- State Key Laboratory of Natural Medicines, School of Traditional Chinese Pharmacy, China Pharmaceutical University, Nanjing 211198, China
| | - Ping Li
- State Key Laboratory of Natural Medicines, School of Traditional Chinese Pharmacy, China Pharmaceutical University, Nanjing 211198, China
| | - Hua Yang
- State Key Laboratory of Natural Medicines, School of Traditional Chinese Pharmacy, China Pharmaceutical University, Nanjing 211198, China
| |
Collapse
|
2
|
Cuffy C, McInnes BT. Exploring a deep learning neural architecture for closed Literature-based discovery. J Biomed Inform 2023; 143:104362. [PMID: 37146741 DOI: 10.1016/j.jbi.2023.104362] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2022] [Revised: 03/15/2023] [Accepted: 04/06/2023] [Indexed: 05/07/2023]
Abstract
Scientific literature presents a wealth of information yet to be explored. As the number of researchers increase with each passing year and publications are released, this contributes to an era where specialized fields of research are becoming more prevalent. As this trend continues, this further propagates the separation of interdisciplinary publications and makes keeping up to date with literature a laborious task. Literature-based discovery (LBD) aims to mitigate these concerns by promoting information sharing among non-interacting literature while extracting potentially meaningful information. Furthermore, recent advances in neural network architectures and data representation techniques have fueled their respective research communities in achieving state-of-the-art performance in many downstream tasks. However, studies of neural network-based methods for LBD remain to be explored. We introduce and explore a deep learning neural network-based approach for LBD. Additionally, we investigate various approaches to represent terms as concepts and analyze the affect of feature scaling representations into our model. We compare the evaluation performance of our method on five hallmarks of cancer datasets utilized for closed discovery. Our results show the chosen representation as input into our model affects evaluation performance. We found feature scaling our input representations increases evaluation performance and decreases the necessary number of epochs needed to achieve model generalization. We also explore two approaches to represent model output. We found reducing the model's output to capturing a subset of concepts improved evaluation performance at the cost of model generalizability. We also compare the efficacy of our method on the five hallmarks of cancer datasets to a set of randomly chosen relations between concepts. We found these experiments confirm our method's suitability for LBD.
Collapse
Affiliation(s)
- Clint Cuffy
- Virginia Commonwealth University, 401 S. Main St., Richmond, VA 23284, USA.
| | - Bridget T McInnes
- Virginia Commonwealth University, 401 S. Main St., Richmond, VA 23284, USA.
| |
Collapse
|
3
|
Zaripova K, Cosmo L, Kazi A, Ahmadi SA, Bronstein MM, Navab N. Graph-in-Graph (GiG): Learning interpretable latent graphs in non-Euclidean domain for biological and healthcare applications. Med Image Anal 2023; 88:102839. [PMID: 37263109 DOI: 10.1016/j.media.2023.102839] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2022] [Revised: 04/26/2023] [Accepted: 05/06/2023] [Indexed: 06/03/2023]
Abstract
Graphs are a powerful tool for representing and analyzing unstructured, non-Euclidean data ubiquitous in the healthcare domain. Two prominent examples are molecule property prediction and brain connectome analysis. Importantly, recent works have shown that considering relationships between input data samples has a positive regularizing effect on the downstream task in healthcare applications. These relationships are naturally modeled by a (possibly unknown) graph structure between input samples. In this work, we propose Graph-in-Graph (GiG), a neural network architecture for protein classification and brain imaging applications that exploits the graph representation of the input data samples and their latent relation. We assume an initially unknown latent-graph structure between graph-valued input data and propose to learn a parametric model for message passing within and across input graph samples, end-to-end along with the latent structure connecting the input graphs. Further, we introduce a Node Degree Distribution Loss (NDDL) that regularizes the predicted latent relationships structure. This regularization can significantly improve the downstream task. Moreover, the obtained latent graph can represent patient population models or networks of molecule clusters, providing a level of interpretability and knowledge discovery in the input domain, which is of particular value in healthcare.
Collapse
Affiliation(s)
- Kamilia Zaripova
- Department of Computer Science, Technical University of Munich, Munich, Germany.
| | - Luca Cosmo
- Department of Environmental Sciences, Informatics and Statistics, Ca' Foscari University of Venice, Venice, Italy; Informatics Department, USI University of Lugano, Lugano, Switzerland
| | - Anees Kazi
- Athinoula A. Martinos Center for Biomedical Imaging, Massachusetts General Hospital, Harvard Medical School, Boston, USA
| | | | | | - Nassir Navab
- Department of Computer Science, Technical University of Munich, Munich, Germany; Whiting School of Engineering, Johns Hopkins University, Baltimore, USA
| |
Collapse
|
4
|
Li K, Marsic I, Sarcevic A, Yang S, Sullivan TM, Tempel PE, Milestone ZP, O'Connell KJ, Burd RS. Discovering interpretable medical process models: A case study in trauma resuscitation. J Biomed Inform 2023; 140:104344. [PMID: 36940896 PMCID: PMC10111432 DOI: 10.1016/j.jbi.2023.104344] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2022] [Revised: 01/20/2023] [Accepted: 03/13/2023] [Indexed: 03/23/2023]
Abstract
Understanding the actual work (i.e., "work-as-done") rather than theorized work (i.e., "work-as-imagined") during complex medical processes is critical for developing approaches that improve patient outcomes. Although process mining has been used to discover process models from medical activity logs, it often omits critical steps or produces cluttered and unreadable models. In this paper, we introduce a TraceAlignment-based ProcessDiscovery method called TAD Miner to build interpretable process models for complex medical processes. TAD Miner creates simple linear process models using a threshold metric that optimizes the consensus sequence to represent the backbone process, and then identifies both concurrent activities and uncommon-but-critical activities to represent the side branches. TAD Miner also identifies the locations of repeated activities, an essential feature for representing medical treatment steps. We conducted a study using activity logs of 308 pediatric trauma resuscitations to develop and evaluate TAD Miner. TAD Miner was used to discover process models for five resuscitation goals, including establishing intravenous (IV) access, administering non-invasive oxygenation, performing back assessment, administering blood transfusion, and performing intubation. We quantitively evaluated the process models with several complexity and accuracy metrics, and performed qualitative evaluation with four medical experts to assess the accuracy and interpretability of the discovered models. Through these evaluations, we compared the performance of our method to that of two state-of-the-art process discovery algorithms: Inductive Miner and Split Miner. The process models discovered by TAD Miner had lower complexity and better interpretability than the state-of-the-art methods, and the fitness and precision of the models were comparable. We used the TAD process models to identify (1) the errors and (2)the best locations for the tentative steps in knowledge-driven expert models. The knowledge-driven models were revised based on the modifications suggested by the discovered models. The improved modeling using TAD Miner may enhance understanding of complex medical processes.
Collapse
Affiliation(s)
- Keyi Li
- Electrical and Computer Engineering Department, Rutgers University, 94 Brett Road, Piscataway, NJ 08854, USA.
| | - Ivan Marsic
- Electrical and Computer Engineering Department, Rutgers University, 94 Brett Road, Piscataway, NJ 08854, USA.
| | - Aleksandra Sarcevic
- College of Computing and Informatics, Drexel University 3675 Market Street, Philadelphia, PA 19104, USA.
| | - Sen Yang
- Linkedin, 1000 W Maude Ave, Sunnyvale, CA 94085, USA.
| | - Travis M Sullivan
- Division of Trauma and Burn Surgery, Children's National Hospital, 111 Michigan Ave NW, Washington, DC 20010, USA.
| | - Peyton E Tempel
- Division of Trauma and Burn Surgery, Children's National Hospital, 111 Michigan Ave NW, Washington, DC 20010, USA.
| | - Zachary P Milestone
- Division of Trauma and Burn Surgery, Children's National Hospital, 111 Michigan Ave NW, Washington, DC 20010, USA.
| | - Karen J O'Connell
- Division of Trauma and Burn Surgery, Children's National Hospital, 111 Michigan Ave NW, Washington, DC 20010, USA.
| | - Randall S Burd
- Division of Trauma and Burn Surgery, Children's National Hospital, 111 Michigan Ave NW, Washington, DC 20010, USA.
| |
Collapse
|
5
|
Kumari N, Acharjya DP. A hybrid rough set shuffled frog leaping knowledge inference system for diagnosis of lung cancer disease. Comput Biol Med 2023; 155:106662. [PMID: 36805223 DOI: 10.1016/j.compbiomed.2023.106662] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2022] [Revised: 01/13/2023] [Accepted: 02/09/2023] [Indexed: 02/15/2023]
Abstract
Abundant medical data are generated in the digital world every second. However, gathering helpful information from these data is difficult. Gathering useful information from the dataset is very advantageous and demanding. Besides, such data also contain many extraneous features that do not influence the foreboding accuracy while diagnosing a disease. The data must eliminate these extraneous features to get a better diagnosis. Ultimately, the minimized information system will lead to a better diagnosis. In this paper, we have introduced an incremental rough set shuffled frog leaping algorithm for knowledge inference. The proposed algorithm helps find minimum features from an information system while handling complex databases with uncertainty and incompleteness. The proposed rough set shuffled frog leaping knowledge inference model works in two phases. In the initial phase, the incremental rough set shuffled frog leaping algorithm is used to get the most relevant features. Identifying the relevant features is carried out using a fitness function, which uses the rough degree of dependency. The use of the fitness function identifies the much information with the minimum number of features. The purpose of feature selection is to identify a feature subset from an original set of features without reducing the predictive accuracy and to scale back the computation overhead in the data processing. In the second phase, a rough set is utilized for knowledge discovery in perception with rule generation. The selection of decision rules is carried out based on the accuracy of the decision rule and a predefined threshold value. An empirical analysis of the lung disease information system and a comparative study is conducted. Experimental outcomes exhibit that hybrid techniques express the feasibility of the proposed model while achieving better classification accuracy.
Collapse
Affiliation(s)
- Nancy Kumari
- School of Computer Science and Engineering, VIT, Vellore 632014, India
| | - D P Acharjya
- School of Computer Science and Engineering, VIT, Vellore 632014, India.
| |
Collapse
|
6
|
Shu X, Ye Y. Knowledge Discovery: Methods from data mining and machine learning. Soc Sci Res 2023; 110:102817. [PMID: 36796993 DOI: 10.1016/j.ssresearch.2022.102817] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/17/2022] [Revised: 10/17/2022] [Accepted: 10/18/2022] [Indexed: 06/18/2023]
Abstract
The interdisciplinary field of knowledge discovery and data mining emerged from a necessity of big data requiring new analytical methods beyond the traditional statistical approaches to discover new knowledge from the data mine. This emergent approach is a dialectic research process that is both deductive and inductive. The data mining approach automatically or semi-automatically considers a larger number of joint, interactive, and independent predictors to address causal heterogeneity and improve prediction. Instead of challenging the conventional model-building approach, it plays an important complementary role in improving model goodness of fit, revealing valid and significant hidden patterns in data, identifying nonlinear and non-additive effects, providing insights into data developments, methods, and theory, and enriching scientific discovery. Machine learning builds models and algorithms by learning and improving from data when the explicit model structure is unclear and algorithms with good performance are difficult to attain. The most recent development is to incorporate this new paradigm of predictive modeling with the classical approach of parameter estimation regressions to produce improved models that combine explanation and prediction.
Collapse
Affiliation(s)
| | - Yiwan Ye
- University of California Davis, USA
| |
Collapse
|
7
|
Zha Y, Chong H, Yang P, Ning K. Microbial Dark Matter: from Discovery to Applications. Genomics Proteomics Bioinformatics 2022; 20:867-881. [PMID: 35477055 PMCID: PMC10025686 DOI: 10.1016/j.gpb.2022.02.007] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/06/2021] [Revised: 09/28/2021] [Accepted: 03/22/2022] [Indexed: 01/12/2023]
Abstract
With the rapid increase of the microbiome samples and sequencing data, more and more knowledge about microbial communities has been gained. However, there is still much more to learn about microbial communities, including billions of novel species and genes, as well as countless spatiotemporal dynamic patterns within the microbial communities, which together form the microbial dark matter. In this work, we summarized the dark matter in microbiome research and reviewed current data mining methods, especially artificial intelligence (AI) methods, for different types of knowledge discovery from microbial dark matter. We also provided case studies on using AI methods for microbiome data mining and knowledge discovery. In summary, we view microbial dark matter not as a problem to be solved but as an opportunity for AI methods to explore, with the goal of advancing our understanding of microbial communities, as well as developing better solutions to global concerns about human health and the environment.
Collapse
Affiliation(s)
- Yuguo Zha
- MOE Key Laboratory of Molecular Biophysics, Hubei Key Laboratory of Bioinformatics and Molecular-imaging, Center of Artificial Intelligence Biology, Department of Bioinformatics and Systems Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China
| | - Hui Chong
- MOE Key Laboratory of Molecular Biophysics, Hubei Key Laboratory of Bioinformatics and Molecular-imaging, Center of Artificial Intelligence Biology, Department of Bioinformatics and Systems Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China
| | - Pengshuo Yang
- MOE Key Laboratory of Molecular Biophysics, Hubei Key Laboratory of Bioinformatics and Molecular-imaging, Center of Artificial Intelligence Biology, Department of Bioinformatics and Systems Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China
| | - Kang Ning
- MOE Key Laboratory of Molecular Biophysics, Hubei Key Laboratory of Bioinformatics and Molecular-imaging, Center of Artificial Intelligence Biology, Department of Bioinformatics and Systems Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China.
| |
Collapse
|
8
|
Pačínková A, Popovici V. Using empirical biological knowledge to infer regulatory networks from multi-omics data. BMC Bioinformatics 2022; 23:351. [PMID: 35996085 PMCID: PMC9396869 DOI: 10.1186/s12859-022-04891-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2022] [Accepted: 08/08/2022] [Indexed: 12/13/2022] Open
Abstract
Background Integration of multi-omics data can provide a more complex view of the biological system consisting of different interconnected molecular components, the crucial aspect for developing novel personalised therapeutic strategies for complex diseases. Various tools have been developed to integrate multi-omics data. However, an efficient multi-omics framework for regulatory network inference at the genome level that incorporates prior knowledge is still to emerge. Results We present IntOMICS, an efficient integrative framework based on Bayesian networks. IntOMICS systematically analyses gene expression, DNA methylation, copy number variation and biological prior knowledge to infer regulatory networks. IntOMICS complements the missing biological prior knowledge by so-called empirical biological knowledge, estimated from the available experimental data. Regulatory networks derived from IntOMICS provide deeper insights into the complex flow of genetic information on top of the increasing accuracy trend compared to a published algorithm designed exclusively for gene expression data. The ability to capture relevant crosstalks between multi-omics modalities is verified using known associations in microsatellite stable/instable colon cancer samples. Additionally, IntOMICS performance is compared with two algorithms for multi-omics regulatory network inference that can also incorporate prior knowledge in the inference framework. IntOMICS is also applied to detect potential predictive biomarkers in microsatellite stable stage III colon cancer samples. Conclusions We provide IntOMICS, a framework for multi-omics data integration using a novel approach to biological knowledge discovery. IntOMICS is a powerful resource for exploratory systems biology and can provide valuable insights into the complex mechanisms of biological processes that have a vital role in personalised medicine. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04891-9.
Collapse
Affiliation(s)
- Anna Pačínková
- RECETOX, Faculty of Science, Masaryk University, Kotlarska 2, Brno, Czech Republic. .,Faculty of Informatics, Masaryk University, Botanicka 68a, Brno, Czech Republic.
| | - Vlad Popovici
- RECETOX, Faculty of Science, Masaryk University, Kotlarska 2, Brno, Czech Republic
| |
Collapse
|
9
|
Mohammadiun S, Hu G, Gharahbagh AA, Li J, Hewage K, Sadiq R. Evaluation of machine learning techniques to select marine oil spill response methods under small-sized dataset conditions. J Hazard Mater 2022; 436:129282. [PMID: 35739791 DOI: 10.1016/j.jhazmat.2022.129282] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/12/2022] [Revised: 05/17/2022] [Accepted: 05/31/2022] [Indexed: 06/15/2023]
Abstract
Oil spill incidents can significantly impact marine ecosystems in Arctic/subarctic areas. Low biodegradation rate, harsh environments, remoteness, and lack of sufficient response infrastructure make those cold waters more susceptible to the impacts of oil spills. A major challenge in Arctic/subarctic areas is to timely select suitable oil spill response methods (OSRMs), concerning the process complexity and insufficient data for decision analysis. In this study, we used various regression-based machine learning techniques, including artificial neural networks (ANNs), Gaussian process regression (GPR), and support vector regression, to develop decision-support models for OSRM selection. Using a small hypothetical oil spill dataset, the modelling performance was thoroughly compared to find techniques working well under data constraints. The regression-based machine learning models were also compared with integrated and optimized fuzzy decision trees models (OFDTs) previously developed by the authors. OFDTs and GPR outperformed other techniques considering prediction power (> 30 % accuracy enhancement). Also, the use of the Bayesian regularization algorithm enhanced the performance of ANNs by reducing their sensitivity to the size of the training dataset (e.g., 29 % accuracy enhancement compared to an unregularized ANN).
Collapse
Affiliation(s)
- Saeed Mohammadiun
- School of Engineering, University of British Columbia, Okanagan, 3333 University Way, Kelowna, BC V1V 1V7 Canada.
| | - Guangji Hu
- School of Engineering, University of British Columbia, Okanagan, 3333 University Way, Kelowna, BC V1V 1V7 Canada.
| | - Abdorreza Alavi Gharahbagh
- Department of Electrical and Computer Engineering, Azad University - Shahrood Branch, Shahrood 1584743311, Iran.
| | - Jianbing Li
- Environmental Engineering Program, University of Northern British Columbia, 3333 University Way, Prince George, BC V2N 4Z9 Canada.
| | - Kasun Hewage
- School of Engineering, University of British Columbia, Okanagan, 3333 University Way, Kelowna, BC V1V 1V7 Canada.
| | - Rehan Sadiq
- School of Engineering, University of British Columbia, Okanagan, 3333 University Way, Kelowna, BC V1V 1V7 Canada.
| |
Collapse
|
10
|
Schutte D, Vasilakes J, Bompelli A, Zhou Y, Fiszman M, Xu H, Kilicoglu H, Bishop JR, Adam T, Zhang R. Discovering novel drug-supplement interactions using SuppKG generated from the biomedical literature. J Biomed Inform 2022; 131:104120. [PMID: 35709900 PMCID: PMC9335448 DOI: 10.1016/j.jbi.2022.104120] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2021] [Revised: 04/26/2022] [Accepted: 06/08/2022] [Indexed: 12/04/2022]
Abstract
Objective: Develop a novel methodology to create a comprehensive knowledge graph (SuppKG) to represent a domain with limited coverage in the Unified Medical Language System (UMLS), specifically dietary supplement (DS) information for discovering drug-supplement interactions (DSI), by leveraging biomedical natural language processing (NLP) technologies and a DS domain terminology. Materials and Methods: We created SemRepDS (an extension of an NLP tool, SemRep), capable of extracting semantic relations from abstracts by leveraging a DS-specific terminology (iDISK) containing 28,884 DS terms not found in the UMLS. PubMed abstracts were processed using SemRepDS to generate semantic relations, which were then filtered using a PubMedBERT model to remove incorrect relations before generating SuppKG. Two discovery pathways were applied to SuppKG to identify potential DSIs, which are then compared with an existing DSI database and also evaluated by medical professionals for mechanistic plausibility. Results: SemRepDS returned 158.5% more DS entities and 206.9% more DS relations than SemRep. The fine-tuned PubMedBERT model (significantly outperformed other machine learning and BERT models) obtained an F1 score of 0.8605 and removed 43.86% of semantic relations, improving the precision of the relations by 26.4% over pre-filtering. SuppKG consists of 56,635 nodes and 595,222 directed edges with 2,928 DS-specific nodes and 164,738 edges. Manual review of findings identified 182 of 250 (72.8%) proposed DS-Gene-Drug and 77 of 100 (77%) proposed DS-Gene1-Function-Gene2-Drug pathways to be mechanistically plausible. Discussion: With added DS terminology to the UMLS, SemRepDS has the capability to find more DS-specific semantic relationships from PubMed than SemRep. The utility of the resulting SuppKG was demonstrated using discovery patterns to find novel DSIs. Conclusion: For the domain with limited coverage in the traditional terminology (e.g., UMLS), we demonstrated an approach to leverage domain terminology and improve existing NLP tools to generate a more comprehensive knowledge graph for the downstream task. Even this study focuses on DSI, the method may be adapted to other domains.
Collapse
Affiliation(s)
- Dalton Schutte
- Institute for Health Informatics, University of Minnesota, Minneapolis, MN, USA; Department of Pharmaceutical Care & Health Systems, University of Minnesota, Minneapolis, MN, USA
| | - Jake Vasilakes
- Institute for Health Informatics, University of Minnesota, Minneapolis, MN, USA; Department of Pharmaceutical Care & Health Systems, University of Minnesota, Minneapolis, MN, USA; National Centre for Text Mining, School of Computer Science, The University of Manchester, Manchester, United Kingdom
| | - Anu Bompelli
- Department of Pharmaceutical Care & Health Systems, University of Minnesota, Minneapolis, MN, USA
| | - Yuqi Zhou
- Institute for Health Informatics, University of Minnesota, Minneapolis, MN, USA; Department of Pharmaceutical Care & Health Systems, University of Minnesota, Minneapolis, MN, USA
| | - Marcelo Fiszman
- NITES - Núcleo de Inovação e Tecnologia Em Saúde, Pontifical Catholic University of Rio de Janeiro, Brazil
| | - Hua Xu
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Halil Kilicoglu
- School of Information Sciences, University of Illinois, Champaign, IL, USA
| | - Jeffrey R Bishop
- Department of Experimental and Clinical Pharmacy, University of Minnesota, Minneapolis, MN, USA
| | - Terrence Adam
- Institute for Health Informatics, University of Minnesota, Minneapolis, MN, USA; Department of Pharmaceutical Care & Health Systems, University of Minnesota, Minneapolis, MN, USA
| | - Rui Zhang
- Institute for Health Informatics, University of Minnesota, Minneapolis, MN, USA; Department of Pharmaceutical Care & Health Systems, University of Minnesota, Minneapolis, MN, USA.
| |
Collapse
|
11
|
Sassi S, Ivanovic M, Chbeir R, Prasath R, Manolopoulos Y. Collective intelligence and knowledge exploration: an introduction. Int J Data Sci Anal 2022; 14:99-111. [PMID: 35730041 PMCID: PMC9205147 DOI: 10.1007/s41060-022-00338-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022]
Abstract
Collective intelligence and Knowledge Exploration (CI and KE) have been adopted to solve many problems. They are particularly used by companies as a support for innovation to efficiently obtain usable results. CI is usually defined as a group ability to perform consistently well across a wide variety of tasks, and it has to be combined with KD to ensure processes optimization, efficient management process, participative management, leadership, continuous teamwork, and so on. The importance of innovation grows the same way as the importance of mixing CI and KE, ensuring the successful exploitation of knowledge. Here, we present a quick review of current knowledge-oriented CI developments and applications. It aims at showing some observations about what's currently missing. Our editorial presents some recent interesting studies that we have gathered after a tight selection process. It also concludes by proposing avenue challenges to continue pushing CI and KE research forward, particularly regarding knowledge exploration.
Collapse
Affiliation(s)
| | | | | | - Rajendra Prasath
- Indian Institute of Information Technology, Tiruchirappalli, India
| | | |
Collapse
|
12
|
Sebro R, Kahn CE. Causal Associations Among Diseases and Imaging Findings in Radiology Reports. Stud Health Technol Inform 2022; 294:411-412. [PMID: 35612109 DOI: 10.3233/shti220487] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
This study explored the ability to identify causal relationships between diseases and imaging findings from their co-occurrences in radiology reports. A natural language processing (NLP) system with negative-expression filtering detected positive mentions of 16,912 disorders, interventions, and imaging findings in 1,702,462 consecutive radiology reports; the 55,564 causal relations defined by the Radiology Gamuts Ontology (RGO) served as reference standard. Conditions were considered to co-occur if they were present in reports from the same patient. The ϕ and κ statistics both achieved AUC0.70, P<0.001 in identifying causal relationships from pairwise co-occurrence data. Analysis of radiology reports can identify a large proportion of known causal associations among diseases and imaging findings. Automated approaches hold promise to identify causal relationships among diseases and imaging findings from their co-occurrence in text-based radiology reports.
Collapse
Affiliation(s)
| | - Charles E Kahn
- University of Pennsylvania, Philadelphia, Pennsylvania, USA
| |
Collapse
|
13
|
Raja K. Biomedical Literature Mining and Its Components. Methods Mol Biol 2022; 2496:1-16. [PMID: 35713856 DOI: 10.1007/978-1-0716-2305-3_1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
The published biomedical articles are the best source of knowledge to understand the importance of biomedical entities such as disease, drugs, and their role in different patient population groups. The number of biomedical literature available and being published is increasing at an exponential rate with the use of large scale experimental techniques. Manual extraction of such information is becoming extremely difficult because of the huge number of biomedical literature available. Alternatively, text mining approaches receive much interest within biomedicine by providing automatic extraction of such information in more structured format from the unstructured biomedical text. Here, a text mining protocol to extract the patient population information, to identify the disease and drug mentions in PubMed titles and abstracts, and a simple information retrieval approach to retrieve a list of relevant documents for a user query are presented. The text mining protocol presented in this chapter is useful for retrieving information on drugs for patients with a specific disease. The protocol covers three major text mining tasks, namely, information retrieval, information extraction, and knowledge discovery.
Collapse
Affiliation(s)
- Kalpana Raja
- Regenerative Biology, Morgridge Institute for Research, Madison, WI, USA.
| |
Collapse
|
14
|
Abstract
Since the advent of high-throughput omics technologies, various molecular data such as genes, transcripts, proteins, and metabolites have been made widely available to researchers. This has afforded clinicians, bioinformaticians, statisticians, and data scientists the opportunity to apply their innovations in feature mining and predictive modeling to a rich data resource to develop a wide range of generalizable prediction models. What has become apparent over the last 10 years is that researchers have adopted deep neural networks (or "deep nets") as their preferred paradigm of choice for complex data modeling due to the superiority of performance over more traditional statistical machine learning approaches, such as support vector machines. A key stumbling block, however, is that deep nets inherently lack transparency and are considered to be a "black box" approach. This naturally makes it very difficult for clinicians and other stakeholders to trust their deep learning models even though the model predictions appear to be highly accurate. In this chapter, we therefore provide a detailed summary of the deep net architectures typically used in omics research, together with a comprehensive summary of the notable "deep feature mining" techniques researchers have applied to open up this black box and provide some insights into the salient input features and why these models behave as they do. We group these techniques into the following three categories: (a) hidden layer visualization and interpretation; (b) input feature importance and impact evaluation; and (c) output layer gradient analysis. While we find that omics researchers have made some considerable gains in opening up the black box through interpretation of the hidden layer weights and node activations to identify salient input features, we highlight other approaches for omics researchers, such as employing deconvolutional network-based approaches and development of bespoke attribute impact measures to enable researchers to better understand the relationships between the input data and hidden layer representations formed and thus the output behavior of their deep nets.
Collapse
Affiliation(s)
- Abeer Alzubaidi
- School of Science and Technology, Department of Computer Science, Nottingham Trent University, Nottingham, UK.
| | | |
Collapse
|
15
|
Fana SE, Esmaeili F, Esmaeili S, Bandaryan F, Esfahani EN, Amoli MM, Razi F. Knowledge discovery in genetics of diabetes in Iran, a roadmap for future researches. J Diabetes Metab Disord 2021; 20:1785-1791. [PMID: 34900825 DOI: 10.1007/s40200-021-00838-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/12/2021] [Accepted: 06/18/2021] [Indexed: 12/12/2022]
Abstract
Purpose The pathogenesis of diabetes is considered polygenic as a result of complex interactions between genetic/epigenetic and environmental factors. This review intended to evaluate the scientometric and knowledge gap of diabetes genetics researches conducted in Iran as a case of developing countries, and drawn up a roadmap for future studies. Methods We searched Scopus and PubMed databases from January 2015 until December 2019 using the keywords: (diabetes OR diabetic) AND (Iran). All publications were reviewed by two experts and after choosing relevant articles, they were categorized based on the subject, level of evidence, study design, publication year, and type of genetic studies. Results Of 10,540 records, 428 articles were met the inclusion criteria. Generally, the number of researches about diabetes genetics rose since 2015. Case-control/cross-sectional and animal studies were the common types of study design and based on the subject, the most frequent researches were about genetic factors involved in diabetes development (38%). Briefly, the top seven genes that were evaluated for T2DM were TCF7L2, APOAII, FTO, PON1, ADIPOQ, MTHFR, and PPARG respectively, and also, CTL4 for T1DM. miR-21, miR-155, and miR-375 respectively were the most micro-RNAs that were evaluated. Furthermore, there were six studies about lncRNAs. Discussion and Conclusion Investigation about the genetic of diabetes is progressed although there are some limitations like non-homogenous data from Iran, heterogeneity of ethnicity, and rationale of studies. Compared to the previous analysis in Iran, still, GWAS and large-scale studies are required to achieve better policies for manage and control of diabetes disease. Supplementary Information The online version contains supplementary material available at 10.1007/s40200-021-00838-8.
Collapse
Affiliation(s)
- Saeed Ebrahimi Fana
- Department of Clinical Biochemistry, Tehran University of Medical Sciences, Tehran, Iran
- Student Scientific Research Center, Tehran University of Medical Sciences, Tehran, Iran
| | - Fataneh Esmaeili
- Department of Clinical Biochemistry, Tehran University of Medical Sciences, Tehran, Iran
- Student Scientific Research Center, Tehran University of Medical Sciences, Tehran, Iran
| | - Shahnaz Esmaeili
- Endocrinology and Metabolism Research Center, Endocrinology and Metabolism Clinical Sciences Institute, Tehran University of Medical Sciences, Tehran, Iran
| | - Fatemeh Bandaryan
- Metabolomics and Genomics Research Center Endocrinology and Metabolism Molecular- Cellular Sciences Institute, Tehran University of Medical Sciences, Tehran, Iran
| | - Ensieh Nasli Esfahani
- Diabetes Research Center, Endocrinology and Metabolism Clinical Sciences Institute, Tehran University of Medical Sciences, Tehran, Iran
| | - Mahsa Mohammad Amoli
- Metabolic Disorders Research Center, Endocrinology and Metabolism Molecular -Cellular Sciences Institute, Tehran University of Medical Sciences, Tehran, Iran
| | - Farideh Razi
- Diabetes Research Center, Endocrinology and Metabolism Clinical Sciences Institute, Tehran University of Medical Sciences, Tehran, Iran
| |
Collapse
|
16
|
Trautman A, Linchangco R, Walstead R, Jay JJ, Brouwer C. The Aliment to Bodily Condition knowledgebase (ABCkb): a database connecting plants and human health. BMC Res Notes 2021; 14:433. [PMID: 34838100 PMCID: PMC8627056 DOI: 10.1186/s13104-021-05835-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2021] [Accepted: 11/03/2021] [Indexed: 11/10/2022] Open
Abstract
Objective Overconsumption of processed foods has led to an increase in chronic diet-related diseases such obesity and type 2 diabetes. Although diets high in fresh fruits and vegetables are linked with healthier outcomes, the specific mechanisms for these relationships are poorly understood. Experiments examining plant phytochemical production and breeding programs, or separately on the health effects of nutritional supplements have yielded results that are sparse, siloed, and difficult to integrate between the domains of human health and agriculture. To connect plant products to health outcomes through their molecular mechanism an integrated computational resource is necessary. Results We created the Aliment to Bodily Condition Knowledgebase (ABCkb) to connect plants to human health by creating a stepwise path from plant \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\rightarrow$$\end{document}→ plant product \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\rightarrow$$\end{document}→ human gene \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\rightarrow$$\end{document}→ pathways \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\rightarrow$$\end{document}→ indication. ABCkb integrates 11 curated sources as well as relationships mined from Medline abstracts by loading into a graph database which is deployed via a Docker container. This new resource, provided in a queryable container with a user-friendly interface connects plant products with human health outcomes for generating nutritive hypotheses. All scripts used are available on github (https://github.com/atrautm1/ABCkb) along with basic directions for building the knowledgebase and a browsable interface is available (https://abckb.charlotte.edu). Supplementary Information The online version contains supplementary material available at 10.1186/s13104-021-05835-x.
Collapse
Affiliation(s)
- Aaron Trautman
- Bioinformatics Services Division, UNC Charlotte, Charlotte, NC, USA.,Department of Bioinformatics and Genomics, UNC Charlotte, Charlotte, NC, USA
| | - Richard Linchangco
- Bioinformatics Services Division, UNC Charlotte, Charlotte, NC, USA.,Department of Bioinformatics and Genomics, UNC Charlotte, Charlotte, NC, USA
| | - Rachel Walstead
- Department of Bioinformatics and Genomics, UNC Charlotte, Charlotte, NC, USA
| | - Jeremy J Jay
- Bioinformatics Services Division, UNC Charlotte, Charlotte, NC, USA.,Department of Bioinformatics and Genomics, UNC Charlotte, Charlotte, NC, USA
| | - Cory Brouwer
- Bioinformatics Services Division, UNC Charlotte, Charlotte, NC, USA. .,Department of Bioinformatics and Genomics, UNC Charlotte, Charlotte, NC, USA.
| |
Collapse
|
17
|
Liu J, Stewart H, Wiens C, Mcnitt-Gray J, Liu B. Development of an integrated biomechanics informatics system with knowledge discovery and decision support tools for research of injury prevention and performance enhancement. Comput Biol Med 2021; 141:105062. [PMID: 34836623 DOI: 10.1016/j.compbiomed.2021.105062] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2021] [Revised: 11/11/2021] [Accepted: 11/20/2021] [Indexed: 11/03/2022]
Abstract
The field of biomechanics involves integrating a variety of data types such as waveform, video, discrete, and performance. These different sources of data must be efficiently and accurately associated to provide meaningful feedback to athletes, coaches, and healthcare professionals to prevent injury and improve rehabilitation/performance. There are many challenges in biomechanics research such as data storage, standardization, review, sharing, and accessibility. Data is stored in different formats, structures, and locations such as physical hard drives or Dropbox/Google Drive, leading to issues during sharing and collaboration. Data is reviewed and analyzed through different software applications that need to be downloaded and installed locally before they are available for use. An integrated biomechanics informatics system (IBIS) built based on the core principles in medical imaging informatics provides a solution to many of these challenges. The system provides a secure web-based platform that will be accessible remotely for authenticated users to upload, share, and download data. The web-based application includes built-in data viewers that are streamlined for reviewing multimedia data and decision support/knowledge discovery tools. These tools include automatic foot contact detection for pre-processing, built-in statistical analysis applications for longitudinal and cross-study analysis, and a multi-institutional collaboration module. The IBIS system creates a centralized hub to support multi-institutional collaborative biomechanics research and analysis that is remotely accessible to all users including athletes, coaches, researchers, and clinicians generating a novel streamlined research workflow, data analysis, and knowledge discovery process.
Collapse
Affiliation(s)
- Joseph Liu
- Image Processing and Informatics Laboratory, Department of Biomedical Engineering, Univ. of Southern California, 1042 Downey Way, Los Angeles, CA, 90089, USA.
| | - Harper Stewart
- USC Biomechanics Lab, Department of Biological Sciences, Univ. of Southern California, Los Angeles, CA, 90089, USA
| | - Casey Wiens
- USC Biomechanics Lab, Department of Biological Sciences, Univ. of Southern California, Los Angeles, CA, 90089, USA
| | - Jill Mcnitt-Gray
- USC Biomechanics Lab, Department of Biological Sciences, Univ. of Southern California, Los Angeles, CA, 90089, USA
| | - Brent Liu
- Image Processing and Informatics Laboratory, Department of Biomedical Engineering, Univ. of Southern California, 1042 Downey Way, Los Angeles, CA, 90089, USA
| |
Collapse
|
18
|
Abstract
Background Nurses require a great deal of knowledge to provide a comprehensive and effective nursing care. A number of patterns have been put into place to help nurses acquire this knowledge. The aim of this study was to describe the core variable in the process of using patterns of knowing by nurses in clinical practice. Methods The study was conducted in qualitative and grounded theory approach, between April 2018 and January 2020. Semi-structured interviews were used for data collection. All the interviews were transcribed verbatim. Nineteen clinical nurses were interviewed, and eight observation sessions were conducted in different hospital departments. Participants were first selected through purposeful and then theoretical sampling. Data were analyzed and interpreted using constant comparison analysis approach. Results The findings of the study indicated that nurses apply the patterns of knowing in three ways in their clinical practice: “cohesion of patterns of knowing”, “domination of some patterns of knowing” and “elimination of some patterns of knowing”. The core variable of this process is cohesion of patterns of knowing in the domain of flexibility. Conclusion The findings of the present study indicate that application of patterns of knowing is practiced in a range of nurse flexibility in clinical settings.
Collapse
Affiliation(s)
- Forough Rafii
- Professor, Nursing Care Research Center, Iran University of Medical Sciences, Tehran, Iran
| | | | - Fereshteh Javaheri Tehrani
- Correspondence: PhD of Nursing, Nursing Care Research Center, Iran University of Medical Sciences, Tehran, Iran
| |
Collapse
|
19
|
Steiner B, Saalfeld B, Elgert L, Haux R, Wolf KH. OnTARi: an ontology for factors influencing therapy adherence to rehabilitation. BMC Med Inform Decis Mak 2021; 21:153. [PMID: 33975585 PMCID: PMC8111729 DOI: 10.1186/s12911-021-01512-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2021] [Accepted: 04/28/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Adherence and motivation are key factors for successful treatment of patients with chronic diseases, especially in long-term care processes like rehabilitation. However, only a few patients achieve good treatment adherence. The causes are manifold. Adherence-influencing factors vary depending on indications, therapies, and individuals. Positive and negative effects are rarely confirmed or even contradictory. An ontology seems to be convenient to represent existing knowledge in this domain and to make it available for information retrieval. METHODS First, a manual data extraction of current knowledge in the domain of treatment adherence in rehabilitation was conducted. Data was retrieved from various sources, including basic literature, scientific publications, and health behavior models. Second, all adherence and motivation factors identified were formalized according to the ontology development methodology METHONTOLOGY. This comprises the specification, conceptualization, formalization, and implementation of the ontology "Ontology for factors influencing therapy adherence to rehabilitation" (OnTARi) in Protégé. A taxonomy-oriented evaluation was conducted by two domain experts. RESULTS OnTARi includes 281 classes implemented in ontology web language, ten object properties, 22 data properties, 1440 logical axioms, 244 individuals, and 1023 annotations. Six higher-level classes are differentiated: (1) Adherence, (2) AdherenceFactors, (3) AdherenceFactorCategory, (4) Rehabilitation, (5) RehabilitationForm, and (6) RehabilitationType. By means of the class AdherenceFactors 227 adherence factors, thereof 49 hard factors, are represented. Each factor involves a proper description, synonyms, possibly existing acronyms, and a German translation. OnTARi illustrates links between adherence factors through 160 influences-relations. Description logic queries implemented in Protégé allow multiple targeted requests, e.g., for the extraction of adherence factors in a specific rehabilitation area. CONCLUSIONS With OnTARi, a generic reference model was built to represent potential adherence and motivation factors and their interrelations in rehabilitation of patients with chronic diseases. In terms of information retrieval, this formalization can serve as a basis for implementation and adaptation of conventional rehabilitative measures, taking into account (patient-specific) adherence factors. OnTARi also enables the development of medical assistance systems to increase motivation and adherence in rehabilitation processes.
Collapse
Affiliation(s)
- Bianca Steiner
- Peter L. Reichertz Institute for Medical Informatics of TU Braunschweig and Hannover Medical School, Braunschweig, Germany.
| | - Birgit Saalfeld
- Peter L. Reichertz Institute for Medical Informatics of TU Braunschweig and Hannover Medical School, Hannover, Germany
| | - Lena Elgert
- Peter L. Reichertz Institute for Medical Informatics of TU Braunschweig and Hannover Medical School, Hannover, Germany
| | - Reinhold Haux
- Peter L. Reichertz Institute for Medical Informatics of TU Braunschweig and Hannover Medical School, Braunschweig, Germany
| | - Klaus-Hendrik Wolf
- Peter L. Reichertz Institute for Medical Informatics of TU Braunschweig and Hannover Medical School, Hannover, Germany
| |
Collapse
|
20
|
Eugenie R, Stattner E. DISGROU: an algorithm for discontinuous subgroup discovery. PeerJ Comput Sci 2021; 7:e512. [PMID: 33987462 PMCID: PMC8093955 DOI: 10.7717/peerj-cs.512] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2020] [Accepted: 04/07/2021] [Indexed: 06/12/2023]
Abstract
In this paper, we focus on the problem of the search for subgroups in numerical data. This approach aims to identify the subsets of objects, called subgroups, which exhibit interesting characteristics compared to the average, according to a quality measure calculated on a target variable. In this article, we present DISGROU, a new approach that identifies subgroups whose attribute intervals may be discontinuous. Unlike the main algorithms in the field, the originality of our proposal lies in the way it breaks down the intervals of the attributes during the subgroup research phase. The basic assumption of our approach is that the range of attributes defining the groups can be disjoint to improve the quality of the identified subgroups. Indeed the traditional methods in the field perform the subgroup search process only over continuous intervals, which results in the identification of subgroups defined over wider intervals thus containing some irrelevant objects that degrade the quality function. In this way, another advantage of our approach is that it does not require a prior discretization of the attributes, since it works directly on the numerical attributes. The efficiency of our proposal is first demonstrated by comparing the results with two algorithms that are references in the field and then by applying to a case study.
Collapse
Affiliation(s)
- Reynald Eugenie
- Laboratory of Mathematics, Computer Science and Applications, Université des Antilles, Pointe à Pitre, Guadeloupe, France
| | - Erick Stattner
- Laboratory of Mathematics, Computer Science and Applications, Université des Antilles, Pointe a Pitre, Guadeloupe, France
| |
Collapse
|
21
|
Jacimovic J, Jakovljevic A, Nagendrababu V, Duncan HF, Dummer PMH. A bibliometric analysis of the dental scientific literature on COVID-19. Clin Oral Investig 2021; 25:6171-6183. [PMID: 33822288 PMCID: PMC8022306 DOI: 10.1007/s00784-021-03916-6] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2020] [Accepted: 03/25/2021] [Indexed: 02/06/2023]
Abstract
Objectives The rapid production of a large volume of literature during the early phase of the COVID-19 outbreak created a substantial burden for clinicians and scientists. Therefore, this manuscript aims to identify and describe the scientific literature addressing COVID-19 from a dental research perspective, in terms of the manuscript origin, research domain, study type, and level of evidence (LoE). Materials and methods Data were retrieved from Web of Science, Scopus, and PubMed. A descriptive analysis of bibliographic data, collaboration network, and keyword co-occurrence analysis were performed. Articles were further classified according to the field of interest, main research question, type of study, and LoE. Results The present study identified 296 dental scientific COVID-19 original papers, published in 89 journals, and co-authored by 1331 individuals affiliated with 429 institutions from 53 countries. Although 81.4% were single-country papers, extensive collaboration among the institutions of single countries (Italian, British, and Brazilian institutions) was observed. The main research areas were as follows: the potential use of saliva and other oral fluids as promising samples for COVID-19 testing, dental education, and guidelines for the prevention of COVID-19 transmission in dental practice. The majority of articles were narrative reviews, cross-sectional studies, and short communications. The overall LoE in the analyzed dental literature was low, with only two systematic reviews with the highest LoE I. Conclusion The dental literature on the COVID-19 pandemic does not provide data relevant to the evidence-based decision-making process. Future studies with a high LoE are essential to gain precise knowledge on COVID-19 infection within the various fields of Dentistry. Clinical relevance The published dental literature on COVID-19 consists principally of articles with a low level of scientific evidence which do not provide sufficient reliable high-quality evidence that is essential for decision making in clinical dental practice. Supplementary Information The online version contains supplementary material available at 10.1007/s00784-021-03916-6.
Collapse
Affiliation(s)
- Jelena Jacimovic
- Central Library, School of Dental Medicine, University of Belgrade, Belgrade, Serbia
| | - Aleksandar Jakovljevic
- Department of Pathophysiology, School of Dental Medicine, University of Belgrade, Belgrade, Serbia.
| | - Venkateshbabu Nagendrababu
- Department of Preventive and Restorative Dentistry, College of Dental Medicine, University of Sharjah, Sharjah, UAE
| | - Henry Fergus Duncan
- Division of Restorative Dentistry and Periodontology, Dublin Dental University Hospital, Trinity College Dublin, Dublin, Ireland
| | - Paul M H Dummer
- School of Dentistry, College of Biomedical and Life Sciences, Cardiff University, Cardiff, UK
| |
Collapse
|
22
|
Hegazi MO, Al-Dossari Y, Al-Yahy A, Al-Sumari A, Hilal A. Preprocessing Arabic text on social media. Heliyon 2021; 7:e06191. [PMID: 33644469 PMCID: PMC7895730 DOI: 10.1016/j.heliyon.2021.e06191] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2019] [Revised: 05/19/2020] [Accepted: 02/01/2021] [Indexed: 11/04/2022] Open
Abstract
Currently, social media plays an important role in daily life and routine. Millions of people use social media for different purposes. Large amounts of data flow through online networks every second, and these data contain valuable information that can be extracted if the data are properly processed and analyzed. However, most of the processing results are affected by preprocessing difficulties. This paper presents an approach to extract information from social media Arabic text. It provides an integrated solution for the challenges in preprocessing Arabic text on social media in four stages: data collection, cleaning, enrichment, and availability. The preprocessed Arabic text is stored in structured database tables to provide a useful corpus to which, information extraction and data analysis algorithms can be applied. The experiment in this study reveals that the implementation of the proposed approach yields a useful and full-featured dataset and valuable information. The resultant dataset presented the Arabic text in three structured levels with more than 20 features. Additionally, the experiment provides valuable information and processed results such as topic classification and sentiment analysis.
Collapse
Affiliation(s)
- Mohamed Osman Hegazi
- Department of Computer Science, College of Computer Engineering and Sciences, Prince Sattam Bin Abdulaziz University, Al-Kharj 11942, Saudi Arabia
| | - Yasser Al-Dossari
- Department of Computer Science, College of Computer Engineering and Sciences, Prince Sattam Bin Abdulaziz University, Al-Kharj 11942, Saudi Arabia
| | - Abdullah Al-Yahy
- Department of Computer Science, College of Computer Engineering and Sciences, Prince Sattam Bin Abdulaziz University, Al-Kharj 11942, Saudi Arabia
| | - Abdulaziz Al-Sumari
- Department of Computer Science, College of Computer Engineering and Sciences, Prince Sattam Bin Abdulaziz University, Al-Kharj 11942, Saudi Arabia
| | - Anwer Hilal
- Department of Computer and Self Development, Preparatory Year Deanship, Prince Sattam Bin Abdulaziz University, Al-Kharj 11942, Saudi Arabia
| |
Collapse
|
23
|
Li X, Peng S, Du J. Towards medical knowmetrics: representing and computing medical knowledge using semantic predications as the knowledge unit and the uncertainty as the knowledge context. Scientometrics 2021;:1-27. [PMID: 33612884 DOI: 10.1007/s11192-021-03880-8] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2020] [Accepted: 01/19/2021] [Indexed: 11/05/2022]
Abstract
In China, Prof. Hongzhou Zhao and Zeyuan Liu are the pioneers of the concept “knowledge unit” and “knowmetrics” for measuring knowledge. However, the definition on “computable knowledge object” remains controversial so far in different fields. For example, it is defined as (1) quantitative scientific concept in natural science and engineering, (2) knowledge point in the field of education research, and (3) semantic predications, i.e., Subject-Predicate-Object (SPO) triples in biomedical fields. The Semantic MEDLINE Database (SemMedDB), a high-quality public repository of SPO triples extracted from medical literature, provides a basic data infrastructure for measuring medical knowledge. In general, the study of extracting SPO triples as computable knowledge unit from unstructured scientific text has been overwhelmingly focusing on scientific knowledge per se. Since the SPO triples would be possibly extracted from hypothetical, speculative statements or even conflicting and contradictory assertions, the knowledge status (i.e., the uncertainty), which serves as an integral and critical part of scientific knowledge has been largely overlooked. This article aims to put forward a framework for Medical Knowmetrics using the SPO triples as the knowledge unit and the uncertainty as the knowledge context. The lung cancer publications dataset is used to validate the proposed framework. The uncertainty of medical knowledge and how its status evolves over time indirectly reflect the strength of competing knowledge claims, and the probability of certainty for a given SPO triple. We try to discuss the new insights using the uncertainty-centric approaches to detect research fronts, and identify knowledge claims with high certainty level, in order to improve the efficacy of knowledge-driven decision support.
Collapse
|
24
|
Moro G, Masseroli M. Gene function finding through cross-organism ensemble learning. BioData Min 2021; 14:14. [PMID: 33579334 PMCID: PMC7879670 DOI: 10.1186/s13040-021-00239-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2020] [Accepted: 01/10/2021] [Indexed: 11/12/2022] Open
Abstract
Background Structured biological information about genes and proteins is a valuable resource to improve discovery and understanding of complex biological processes via machine learning algorithms. Gene Ontology (GO) controlled annotations describe, in a structured form, features and functions of genes and proteins of many organisms. However, such valuable annotations are not always reliable and sometimes are incomplete, especially for rarely studied organisms. Here, we present GeFF (Gene Function Finder), a novel cross-organism ensemble learning method able to reliably predict new GO annotations of a target organism from GO annotations of another source organism evolutionarily related and better studied. Results Using a supervised method, GeFF predicts unknown annotations from random perturbations of existing annotations. The perturbation consists in randomly deleting a fraction of known annotations in order to produce a reduced annotation set. The key idea is to train a supervised machine learning algorithm with the reduced annotation set to predict, namely to rebuild, the original annotations. The resulting prediction model, in addition to accurately rebuilding the original known annotations for an organism from their perturbed version, also effectively predicts new unknown annotations for the organism. Moreover, the prediction model is also able to discover new unknown annotations in different target organisms without retraining.We combined our novel method with different ensemble learning approaches and compared them to each other and to an equivalent single model technique. We tested the method with five different organisms using their GO annotations: Homo sapiens, Mus musculus, Bos taurus, Gallus gallus and Dictyostelium discoideum. The outcomes demonstrate the effectiveness of the cross-organism ensemble approach, which can be customized with a trade-off between the desired number of predicted new annotations and their precision.A Web application to browse both input annotations used and predicted ones, choosing the ensemble prediction method to use, is publicly available at http://tiny.cc/geff/. Conclusions Our novel cross-organism ensemble learning method provides reliable predicted novel gene annotations, i.e., functions, ranked according to an associated likelihood value. They are very valuable both to speed the annotation curation, focusing it on the prioritized new annotations predicted, and to complement known annotations available.
Collapse
Affiliation(s)
- Gianluca Moro
- DISI - University of Bologna, Via dell'Università, Cesena (FC), Italy.
| | - Marco Masseroli
- DEIB, Politecnico di Milano, Piazza L. Da Vinci 32, Milan, 20133, Italy
| |
Collapse
|
25
|
Yu J, Liu G. Extracting and inserting knowledge into stacked denoising auto-encoders. Neural Netw 2021; 137:31-42. [PMID: 33545610 DOI: 10.1016/j.neunet.2021.01.010] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2019] [Revised: 11/28/2020] [Accepted: 01/14/2021] [Indexed: 10/22/2022]
Abstract
Deep neural networks (DNNs) with a complex structure and multiple nonlinear processing units have achieved great successes for feature learning in image and visualization analysis. Due to interpretability of the "black box" problem in DNNs, however, there are still many obstacles to applications of DNNs in various real-world cases. This paper proposes a new DNN model, knowledge-based deep stacked denoising auto-encoders (KBSDAE), which inserts the knowledge (i.e., confidence and classification rules) into the deep network structure. This model not only can offer a good understanding of the representations learned by the deep network but also can produce an improvement in the learning performance of stacked denoising auto-encoder (SDAE). The knowledge discovery algorithm is proposed to extract confidence rules to interpret the layerwise network (i.e., denoising auto-encoder (DAE)). The symbolic language is developed to describe the deep network and shows that it is suitable for the representation of quantitative reasoning in a deep network. The confidence rule insertion to the deep network is able to produce an improvement in feature learning of DAEs. The classification rules extracted from the data offer a novel method for knowledge insertion to the classification layer of SDAE. The testing results of KBSDAE on various benchmark data indicate that the proposed method not only effectively extracts knowledge from the deep network, but also shows better feature learning performance than that of those typical DNNs (e.g., SDAE).
Collapse
Affiliation(s)
- Jianbo Yu
- School of Mechanical Engineering, Tongji University, Shanghai 201804, China.
| | - Guoliang Liu
- School of Mechanical Engineering, Tongji University, Shanghai 201804, China
| |
Collapse
|
26
|
Shahmoradi L, Ramezani A, Atlasi R, Namazi N, Larijani B. Visualization of knowledge flow in interpersonal scientific collaboration network endocrinology and metabolism research institute. J Diabetes Metab Disord 2020; 20:815-823. [PMID: 34222091 DOI: 10.1007/s40200-020-00644-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/25/2020] [Accepted: 09/21/2020] [Indexed: 10/23/2022]
Abstract
Purpose Research collaborations can help to increase scientific productivity. The purpose of the present study was to draw up the knowledge flow network of the Endocrinology and Metabolism Research Institute (EMRI) affiliated to Tehran University of Medical Sciences. Methods The present study is a descriptive cross-sectional study on the publications of the EMRI. Web of Science Core collection databases were searched for the EMRI publications between 2002 to November 2019. Besides, publications were classified and visualized based on authorships (institutes and country of affiliation), and keywords (co-occurrence and trend). Scientometric methods including VOSviewer and HistCite were used for descriptive statistics and data analysis. Results Total citations to the records were 47,528 and papers were published in 916 journals. The annual growth rate of publications and the citation was 14.2% and 18.9%, respectively. A total of 9466 authors from 136 countries collaborated in the publications. The co-authorship patterns showed that the average co-authorship and collaboration coefficient was 3.3 and 0.19. Conclusion Knowledge flow between EMRI researchers with international collaborations, engagement with leading countries, and interdisciplinary collaborations have an increasing trend. To develop a full picture of co-authorship, using social network analysis indicators are suggested for future studies.
Collapse
Affiliation(s)
- Leila Shahmoradi
- Halal Research Center of IRI, FDA, Tehran, Iran.,Department of Health Information Management, School of Allied Medical Sciences, Tehran University of Medical Sciences, Tehran, Iran
| | - Aboozar Ramezani
- Department of Medical Library and Information Sciences, Virtual School, Tehran University of Medical Sciences, Tehran, Iran
| | - Rasha Atlasi
- Evidence Based Practice Research Center, Endocrinology and Metabolism Clinical Sciences Institute, Tehran University of Medical Sciences, Tehran, Iran
| | - Nazli Namazi
- Diabetes Research Center, Endocrinology and Metabolism Clinical Sciences Institute, Tehran University of Medical Sciences, Tehran, Iran
| | - Bagher Larijani
- Endocrinology and Metabolism Research Center, Endocrinology and Metabolism Clinical Sciences Institute, Tehran University of Medical Sciences, Tehran, Iran
| |
Collapse
|
27
|
Piad-Morffis A, Gutiérrez Y, Almeida-Cruz Y, Muñoz R. A computational ecosystem to support eHealth Knowledge Discovery technologies in Spanish. J Biomed Inform 2020; 109:103517. [PMID: 32712157 PMCID: PMC7377985 DOI: 10.1016/j.jbi.2020.103517] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2020] [Revised: 05/18/2020] [Accepted: 07/19/2020] [Indexed: 11/29/2022]
Abstract
The massive amount of biomedical information published online requires the development of automatic knowledge discovery technologies to effectively make use of this available content. To foster and support this, the research community creates linguistic resources, such as annotated corpora, and designs shared evaluation campaigns and academic competitive challenges. This work describes an ecosystem that facilitates research and development in knowledge discovery in the biomedical domain, specifically in Spanish language. To this end, several resources are developed and shared with the research community, including a novel semantic annotation model, an annotated corpus of 1045 sentences, and computational resources to build and evaluate automatic knowledge discovery techniques. Furthermore, a research task is defined with objective evaluation criteria, and an online evaluation environment is setup and maintained, enabling researchers interested in this task to obtain immediate feedback and compare their results with the state-of-the-art. As a case study, we analyze the results of a competitive challenge based on these resources and provide guidelines for future research. The constructed ecosystem provides an effective learning and evaluation environment to encourage research in knowledge discovery in Spanish biomedical documents.
Collapse
Affiliation(s)
| | - Yoan Gutiérrez
- University Institute for Computing Research (IUII), University of Alicante, Alicante 03690, Spain; Department of Language and Computing Systems, University of Alicante, Alicante 03690, Spain.
| | | | - Rafael Muñoz
- University Institute for Computing Research (IUII), University of Alicante, Alicante 03690, Spain; Department of Language and Computing Systems, University of Alicante, Alicante 03690, Spain.
| |
Collapse
|
28
|
Matsuo R, Yamazaki T, Suzuki M, Toyama H, Araki K. A random forest algorithm-based approach to capture latent decision variables and their cutoff values. J Biomed Inform 2020; 110:103548. [PMID: 32866626 DOI: 10.1016/j.jbi.2020.103548] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2020] [Revised: 08/21/2020] [Accepted: 08/25/2020] [Indexed: 11/18/2022]
Abstract
Although reference intervals (RIs) and clinical decision limits (CDLs) are vital laboratory information for supporting the interpretation of numerical clinical pathology results, there is evidence that RIs and CDLs vary in certain contexts as well as other evidence that RIs and CDLs are flawed. We propose a random forest algorithm-based exploration methodology by using phenotype transformation of independent variables in relation to dependent variables to capture latent decision variables and their cutoff values. We denote certain CDLs within the RIs estimated by an indirect method that affect some diagnostics or outcomes in the context of specific patients' conditions as latent CDLs. We then apply the proposed methodology to clinical laboratory data regarding bodily fluids, such as blood, urine at the admission of patients for the exploration of latent CDLs of hospital length of stay (HLOS) for each patients' condition identified by diseases of patients who undergoing surgeries. From the exploration results, we found that free Thyroxine (T4) above five unique cutoff values: 1.16 ng/dL, 1.19 ng/dL, 1.2 ng/dL, 1.23 ng/dL and 1.25 ng/dL for tachyarrhythmia predicted longer HLOS, though these cutoff values fall within the estimated RIs as well as the hospital-determined RIs. In addition to the evidence that higher free Thyroxine (T4) levels within the RIs have an association with the corresponding disease, on the whole, the cutoff values except 1.16 ng/dL tended to affect long HLOS with the significant differences. The cutoff values could be taken up for discussion among clinical experts whether it is meaningful to alert the risk of patients' conditions and the long HLOS at the admission of patients. If clinical experts appreciate its meaningfulness in clinical practice, the alerts could be embedded in electronic medical records for handling those risks at the admission of patients.
Collapse
Affiliation(s)
- Ryosuke Matsuo
- Faculty of Medicine, University of Miyazaki Hospital, 5200, Kihara, Kiyotake-cho, Miyazaki-shi, Miyazaki, 889-1692, Japan.
| | - Tomoyoshi Yamazaki
- Faculty of Medicine, University of Miyazaki Hospital, 5200, Kihara, Kiyotake-cho, Miyazaki-shi, Miyazaki, 889-1692, Japan.
| | - Muneou Suzuki
- Faculty of Medicine, University of Miyazaki Hospital, 5200, Kihara, Kiyotake-cho, Miyazaki-shi, Miyazaki, 889-1692, Japan.
| | - Hinako Toyama
- Institute of Medical Data Sciences, 1-10-2, Tsukushino, Abiko-shi, Chiba, 270-1164, Japan.
| | - Kenji Araki
- Faculty of Medicine, University of Miyazaki Hospital, 5200, Kihara, Kiyotake-cho, Miyazaki-shi, Miyazaki, 889-1692, Japan.
| |
Collapse
|
29
|
Menychtas A, Tsanakas P, Maglogiannis I. Knowledge Discovery on IoT-Enabled mHealth Applications. Adv Exp Med Biol 2020; 1194:181-91. [PMID: 32468534 DOI: 10.1007/978-3-030-32622-7_16] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register]
Abstract
The exponential growth of the number and variety of IoT devices and applications for personal use, as well as the improvement of their quality and performance, facilitates the realization of intelligent eHealth concepts. Nowadays, it is easier than ever for individuals to monitor themselves, quantify, and log their everyday activities in order to gain insights about their body's performance and receive recommendations and incentives to improve it. Of course, in order for such systems to live up to the promise, given the treasure trove of data that is collected, machine learning techniques need to be integrated in the processing and analysis of the data. This systematic and automated quantification, logging, and analysis of personal data, using IoT and AI technologies, have given birth to the phenomenon of Quantified-Self. This work proposes a prototype decentralized Quantified-Self application, built on top of a dedicated IoT gateway that aggregates and analyzes data from multiple sources, such as biosignal sensors and wearables, and performs analytics on it.
Collapse
|
30
|
Abstract
Objective Although sequencing and other high-throughput data production technologies are increasingly affordable, data analysis and interpretation remains a significant factor in the cost of -omics studies. Despite the broad acceptance of findable, accessible, interoperable, and reusable (FAIR) data principles which focus on data discoverability and annotation, data integration remains a significant bottleneck in linking prior work in order to better understand novel research. Relevant and timely information discovery is difficult for increasingly multi-disciplinary projects when scientists cannot easily keep up with work across multiple fields. Computational tools are necessary to accurately describe data contents, and empower linkage to existing resources without prior knowledge of the various database resources. Results We developed the Databio tool, accessible at https://datab.io/, to automate data parsing, identifier detection, and streamline common tasks to provide a point-and-click approach to data manipulation and integration in life sciences research and translational medicine. Databio uses fast real-time data structures and a data warehouse of 137 million identifiers, with automated heuristics to describe data provenance without highly specialized knowledge or bioinformatics training.
Collapse
Affiliation(s)
- Robert W Reid
- Department of Bioinformatics and Genomics, College of Computing and Informatics, University of North Carolina at Charlotte, 9201 University City Blvd, Charlotte, NC, 28223, USA.,North Carolina Research Campus, 150 N Research Campus Dr, Kannapolis, NC, 28081, USA
| | - Jacob W Ferrier
- Department of Bioinformatics and Genomics, College of Computing and Informatics, University of North Carolina at Charlotte, 9201 University City Blvd, Charlotte, NC, 28223, USA
| | - Jeremy J Jay
- Department of Bioinformatics and Genomics, College of Computing and Informatics, University of North Carolina at Charlotte, 9201 University City Blvd, Charlotte, NC, 28223, USA. .,North Carolina Research Campus, 150 N Research Campus Dr, Kannapolis, NC, 28081, USA.
| |
Collapse
|
31
|
Heo GE, Xie Q, Song M, Lee JH. Combining entity co-occurrence with specialized word embeddings to measure entity relation in Alzheimer's disease. BMC Med Inform Decis Mak 2019; 19:240. [PMID: 31801521 PMCID: PMC6894106 DOI: 10.1186/s12911-019-0934-5] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022] Open
Abstract
Background Extracting useful information from biomedical literature plays an important role in the development of modern medicine. In natural language processing, there have been rigorous attempts to find meaningful relationships between entities automatically by co-occurrence-based methods. It has been increasingly important to understand whether relationships exist, and if so how strong, between any two entities extracted from a large number of texts. One of the defining methods is to measure semantic similarity and relatedness between two entities. Methods We propose a hybrid ranking method that combines a co-occurrence approach considering both direct and indirect entity pair relationship with specialized word embeddings for measuring the relatedness of two entities. Results We evaluate the proposed ranking method comparatively with other well-known methods such as co-occurrence, Word2Vec, COALS (Correlated Occurrence Analog to Lexical Semantics), and random indexing by calculating top-ranked entities related to Alzheimer’s disease. In addition, we analyze gene, pathway, and gene–phenotype relationships. Overall, the proposed method tends to find more hidden relationships than the other methods. Conclusion Our proposed method is able to select more useful related entities that not only highly co-occur but also have more indirect relations for the target entity. In pathway analysis, our proposed method shows superior performance at identifying (functional) cross clustering and higher-level pathways. Our proposed method, resulting from phenotype analysis, has an advantage in identifying the common genotype relating to phenotypes from biological literature.
Collapse
Affiliation(s)
- Go Eun Heo
- Department of Library and Information Science, Yonsei University, 50 Yonsei-ro Seodaemun-gu, Seoul, 03722, Republic of Korea
| | - Qing Xie
- Department of Library and Information Science, Yonsei University, 50 Yonsei-ro Seodaemun-gu, Seoul, 03722, Republic of Korea
| | - Min Song
- Department of Library and Information Science, Yonsei University, 50 Yonsei-ro Seodaemun-gu, Seoul, 03722, Republic of Korea.
| | - Jeong-Hoon Lee
- Department of Creative IT Engineering, POSTECH, 77 Cheongam-ro Nam-gu, Pohang, Gyeongbuk, 37673, Republic of Korea
| |
Collapse
|
32
|
Cao XH, Han C, Glass LM, Kindman A, Obradovic Z. Time-to-event estimation by re-defining time. J Biomed Inform 2019; 100:103326. [PMID: 31678589 DOI: 10.1016/j.jbi.2019.103326] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2019] [Revised: 09/05/2019] [Accepted: 10/28/2019] [Indexed: 11/26/2022]
Abstract
The primary goal of a time-to-event estimation model is to accurately infer the occurrence time of a target event. Most existing studies focus on developing new models to effectively utilize the information in the censored observations. In this paper, we propose a model to tackle the time-to-event estimation problem from a completely different perspective. Our model relaxes a fundamental constraint that the target variable, time, is a univariate number which satisfies a partial order. Instead, the proposed model interprets each event occurrence time as a time concept with a vector representation. We hypothesize that the model will be more accurate and interpretable by capturing (1) the relationships between features and time concept vectors and (2) the relationships among time concept vectors. We also propose a scalable framework to simultaneously learn the model parameters and time concept vectors. Rigorous experiments and analysis have been conducted in medical event prediction task on seven gene expression datasets. The results demonstrate the efficiency and effectiveness of the proposed model. Furthermore, similarity information among time concept vectors helped in identifying time regimes, thus leading to a potential knowledge discovery related to the human cancer considered in our experiments.
Collapse
Affiliation(s)
- Xi Hang Cao
- Center for Data Analytics and Biomedical Informatics, Temple University, 386 SERC, 1925 N. 12th St., Philadelphia, PA 19122, USA.
| | - Chao Han
- Center for Data Analytics and Biomedical Informatics, Temple University, 386 SERC, 1925 N. 12th St., Philadelphia, PA 19122, USA.
| | | | | | - Zoran Obradovic
- Center for Data Analytics and Biomedical Informatics, Temple University, 386 SERC, 1925 N. 12th St., Philadelphia, PA 19122, USA.
| |
Collapse
|
33
|
Abstract
Emerging applications of machine learning and artificial intelligence offer the opportunity to discover new clinical knowledge through secondary exploration of existing patient medical records. This new knowledge may in turn offer a foundation to build new types of clinical decision support (CDS) that provide patient-specific insights and guidance across a wide range of clinical questions and settings. This article will provide an overview of these emerging approaches to CDS, discussing both existing technologies as well as challenges that health systems and informaticists will need to address to allow these emerging approaches to reach their full potential.
Collapse
Affiliation(s)
- Jason M Baron
- Department of Pathology, Massachusetts General Hospital, Harvard Medical School, 55 Fruit Street, Boston, MA 02214, USA.
| | - Danielle E Kurant
- Department of Pathology, Massachusetts General Hospital, Harvard Medical School, 55 Fruit Street, Boston, MA 02214, USA
| | - Anand S Dighe
- Department of Pathology, Massachusetts General Hospital, Harvard Medical School, 55 Fruit Street, Boston, MA 02214, USA
| |
Collapse
|
34
|
Rodriguez JC, Merino GA, Llera AS, Fernández EA. Massive integrative gene set analysis enables functional characterization of breast cancer subtypes. J Biomed Inform 2019; 93:103157. [PMID: 30928514 DOI: 10.1016/j.jbi.2019.103157] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2018] [Revised: 03/11/2019] [Accepted: 03/22/2019] [Indexed: 01/31/2023]
Abstract
The availability of large-scale repositories and integrated cancer genome efforts have created unprecedented opportunities to study and describe cancer biology. In this sense, the aim of translational researchers is the integration of multiple omics data to achieve a better identification of homogeneous subgroups of patients in order to develop adequate diagnostic and treatment strategies from the personalized medicine perspective. So far, existing integrative methods have grouped together omics data information, leaving out individual omics data phenotypic interpretation. Here, we present the Massive and Integrative Gene Set Analysis (MIGSA) R package. This tool can analyze several high throughput experiments in a comprehensive way through a functional analysis strategy, relating a phenotype to its biological function counterpart defined by means of gene sets. By simultaneously querying different multiple omics data from the same or different groups of patients, common and specific functional patterns for each studied phenotype can be obtained. The usefulness of MIGSA was demonstrated by applying the package to functionally characterize the intrinsic breast cancer PAM50 subtypes. For each subtype, specific functional transcriptomic profiles and gene sets enriched by transcriptomic and proteomic data were identified. To achieve this, transcriptomic and proteomic data from 28 datasets were analyzed using MIGSA. As a result, enriched gene sets and important genes were consistently found as related to a specific subtype across experiments or data types and thus can be used as molecular signature biomarkers.
Collapse
|
35
|
Abstract
Medical data is one of the most rewarding and yet most complicated data to analyze. How can healthcare providers use modern data analytics tools and technologies to analyze and create value from complex data? Data analytics, with its promise to efficiently discover valuable pattern by analyzing large amount of unstructured, heterogeneous, non-standard and incomplete healthcare data. It does not only forecast but also helps in decision making and is increasingly noticed as breakthrough in ongoing advancement with the goal is to improve the quality of patient care and reduces the healthcare cost. The aim of this study is to provide a comprehensive and structured overview of extensive research on the advancement of data analytics methods for disease prevention. This review first introduces disease prevention and its challenges followed by traditional prevention methodologies. We summarize state-of-the-art data analytics algorithms used for classification of disease, clustering (unusually high incidence of a particular disease), anomalies detection (detection of disease) and association as well as their respective advantages, drawbacks and guidelines for selection of specific model followed by discussion on recent development and successful application of disease prevention methods. The article concludes with open research challenges and recommendations.
Collapse
|
36
|
Abstract
Most biological processes including diseases are multifactorial and determined by a complex interplay of various genetic and environmental factors. This chapter aims to provide a user guide to data querying, analysis, and visualization with TargetMine and the associated auxiliary toolkit. We have also discussed some of the commonly used data queries for the researchers who are interested in gene set analysis within a data warehouse framework. Overall, TargetMine provides a convenient web browser-based interface that enables the discovery of new hypotheses interactively, by performing analysis of omics data using complicated searches without any scripting and programming efforts on the part of the user and also by providing the results in an easy-to-comprehend output format.
Collapse
Affiliation(s)
- Yi-An Chen
- Laboratory of Bioinformatics, National Institutes of Biomedical Innovation, Health and Nutrition, Ibaraki, Osaka, Japan
| | - Lokesh P Tripathi
- Laboratory of Bioinformatics, National Institutes of Biomedical Innovation, Health and Nutrition, Ibaraki, Osaka, Japan.
| | - Kenji Mizuguchi
- Laboratory of Bioinformatics, National Institutes of Biomedical Innovation, Health and Nutrition, Ibaraki, Osaka, Japan.
| |
Collapse
|
37
|
Arji G, Safdari R, Rezaeizadeh H, Abbassian A, Mokhtaran M, Hossein Ayati M. A systematic literature review and classification of knowledge discovery in traditional medicine. Comput Methods Programs Biomed 2019; 168:39-57. [PMID: 30392889 DOI: 10.1016/j.cmpb.2018.10.017] [Citation(s) in RCA: 27] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/31/2018] [Revised: 10/14/2018] [Accepted: 10/26/2018] [Indexed: 06/08/2023]
Abstract
INTRODUCTION AND OBJECTIVE Despite the importance of machine learning methods application in traditional medicine there is a no systematic literature review and a classification for this field. This is the first comprehensive literature review of the application of data mining methods in traditional medicine. METHOD We reviewed 5 database between 2000 to 2017 based on the Kitchenham systematic review methodology. 502 articles were identified and reviewed for their relevance to application of machine learning methods in traditional medicine, 42 selected papers were classified and categorized on four dimension; 1) application domain of data mining techniques in traditional medicine; 2) the data mining methods most frequently used in traditional medicine; 3) main strength and limitation of data mining techniques in traditional medicine; 4) the performance evaluation methods in data mining methods in traditional medicine. RESULT The result obtained showed that main application domain of data mining techniques in traditional medicine was related to syndrome differentiation. Bayesian Networks (BNs), Artificial Neural Networks (ANNs) and Support Vector Machines (SVMs) were recognized as being the methods most frequently applied in traditional medicine. Furthermore, each data mining techniques has its own strength and limitations when applied in traditional medicine. Single scaler methods were frequently used for performance evaluation of data mining methods. CONCLUSION Machine learning methods have become an important research field in traditional medicine. Our research provides information about this methods by examining the related articles.
Collapse
Affiliation(s)
- Goli Arji
- Department of Health Information Management, School of Allied Medical Sciences, Tehran University of Medical Sciences, Tehran, Iran
| | - Reza Safdari
- Department of Health Information Management, School of Allied Medical Sciences, Tehran University of Medical Sciences, Tehran, Iran.
| | - Hossein Rezaeizadeh
- Department of Traditional Medicine, School of Traditional Medicine, Tehran University of Medical Science, Tehran, Iran
| | - Alireza Abbassian
- Department of Traditional Medicine, School of Traditional Medicine, Tehran University of Medical Science, Tehran, Iran
| | - Mehrshad Mokhtaran
- Assistant Professor of Medical Informatics, Tehran University of Medical Sciences, Tehran, Iran
| | - Mohammad Hossein Ayati
- Department of Traditional Medicine, School of Traditional Medicine, Tehran University of Medical Science, Tehran, Iran
| |
Collapse
|
38
|
Benhar H, Idri A, Fernández-Alemán JL. A Systematic Mapping Study of Data Preparation in Heart Disease Knowledge Discovery. J Med Syst 2018; 43:17. [PMID: 30542772 DOI: 10.1007/s10916-018-1134-z] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2018] [Accepted: 12/03/2018] [Indexed: 01/25/2023]
Abstract
The increasing amount of data produced by various biomedical and healthcare systems has led to a need for methodologies related to knowledge data discovery. Data mining (DM) offers a set of powerful techniques that allow the identification and extraction of relevant information from medical datasets, thus enabling doctors and patients to greatly benefit from DM, particularly in the case of diseases with high mortality and morbidity rates, such as heart disease (HD). Nonetheless, the use of raw medical data implies several challenges, such as missing data, noise, redundancy and high dimensionality, which make the extraction of useful and relevant information difficult and challenging. Intensive research has, therefore, recently begun in order to prepare raw healthcare data before knowledge extraction. In any knowledge data discovery (KDD) process, data preparation is the step prior to DM that deals with data imperfectness in order to improve its quality so as to satisfy the requirements and improve the performances of DM techniques. The objective of this paper is to perform a systematic mapping study (SMS) on data preparation for KDD in cardiology so as to provide an overview of the quantity and type of research carried out in this respect. The SMS consisted of a set of 58 selected papers published in the period January 2000 and December 2017. The selected studies were analyzed according to six criteria: year and channel of publication, preparation task, medical task, DM objective, research type and empirical type. The results show that a high amount of data preparation research was carried out in order to improve the performance of DM-based decision support systems in cardiology. Researchers were mainly interested in the data reduction preparation task and particularly in feature selection. Moreover, the majority of the selected studies focused on classification for the diagnosis of HD. Two main research types were identified in the selected studies: solution proposal and evaluation research, and the most frequently used empirical type was that of historical-based evaluation.
Collapse
|
39
|
Ostaszewski M, Kieffer E, Danoy G, Schneider R, Bouvry P. Clustering approaches for visual knowledge exploration in molecular interaction networks. BMC Bioinformatics 2018; 19:308. [PMID: 30157777 PMCID: PMC6116538 DOI: 10.1186/s12859-018-2314-z] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2017] [Accepted: 08/14/2018] [Indexed: 12/02/2022] Open
Abstract
BACKGROUND Biomedical knowledge grows in complexity, and becomes encoded in network-based repositories, which include focused, expert-drawn diagrams, networks of evidence-based associations and established ontologies. Combining these structured information sources is an important computational challenge, as large graphs are difficult to analyze visually. RESULTS We investigate knowledge discovery in manually curated and annotated molecular interaction diagrams. To evaluate similarity of content we use: i) Euclidean distance in expert-drawn diagrams, ii) shortest path distance using the underlying network and iii) ontology-based distance. We employ clustering with these metrics used separately and in pairwise combinations. We propose a novel bi-level optimization approach together with an evolutionary algorithm for informative combination of distance metrics. We compare the enrichment of the obtained clusters between the solutions and with expert knowledge. We calculate the number of Gene and Disease Ontology terms discovered by different solutions as a measure of cluster quality. Our results show that combining distance metrics can improve clustering accuracy, based on the comparison with expert-provided clusters. Also, the performance of specific combinations of distance functions depends on the clustering depth (number of clusters). By employing bi-level optimization approach we evaluated relative importance of distance functions and we found that indeed the order by which they are combined affects clustering performance. Next, with the enrichment analysis of clustering results we found that both hierarchical and bi-level clustering schemes discovered more Gene and Disease Ontology terms than expert-provided clusters for the same knowledge repository. Moreover, bi-level clustering found more enriched terms than the best hierarchical clustering solution for three distinct distance metric combinations in three different instances of disease maps. CONCLUSIONS In this work we examined the impact of different distance functions on clustering of a visual biomedical knowledge repository. We found that combining distance functions may be beneficial for clustering, and improve exploration of such repositories. We proposed bi-level optimization to evaluate the importance of order by which the distance functions are combined. Both combination and order of these functions affected clustering quality and knowledge recognition in the considered benchmarks. We propose that multiple dimensions can be utilized simultaneously for visual knowledge exploration.
Collapse
Affiliation(s)
- Marek Ostaszewski
- Luxembourg Centre for Systems Biomedicine, University of Luxembourg, 7, Avenue des Hauts-Fourneaux, Esch-Belval, Luxembourg
| | - Emmanuel Kieffer
- Interdisciplinary Centre for Security, Reliability and Trust, University of Luxembourg, 6, Avenue de la Fonte, Esch-Belval, Luxembourg
| | - Grégoire Danoy
- Computer Science and Communications Research Unit, University of Luxembourg, 6, Avenue de la Fonte, Esch-Belval, Luxembourg
| | - Reinhard Schneider
- Luxembourg Centre for Systems Biomedicine, University of Luxembourg, 7, Avenue des Hauts-Fourneaux, Esch-Belval, Luxembourg
| | - Pascal Bouvry
- Computer Science and Communications Research Unit, University of Luxembourg, 6, Avenue de la Fonte, Esch-Belval, Luxembourg
| |
Collapse
|
40
|
Xin J, Afrasiabi C, Lelong S, Adesara J, Tsueng G, Su AI, Wu C. Cross-linking BioThings APIs through JSON-LD to facilitate knowledge exploration. BMC Bioinformatics 2018; 19:30. [PMID: 29390967 PMCID: PMC5796402 DOI: 10.1186/s12859-018-2041-5] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2017] [Accepted: 01/24/2018] [Indexed: 01/25/2023] Open
Abstract
BACKGROUND Application Programming Interfaces (APIs) are now widely used to distribute biological data. And many popular biological APIs developed by many different research teams have adopted Javascript Object Notation (JSON) as their primary data format. While usage of a common data format offers significant advantages, that alone is not sufficient for rich integrative queries across APIs. RESULTS Here, we have implemented JSON for Linking Data (JSON-LD) technology on the BioThings APIs that we have developed, MyGene.info , MyVariant.info and MyChem.info . JSON-LD provides a standard way to add semantic context to the existing JSON data structure, for the purpose of enhancing the interoperability between APIs. We demonstrated several use cases that were facilitated by semantic annotations using JSON-LD, including simpler and more precise query capabilities as well as API cross-linking. CONCLUSIONS We believe that this pattern offers a generalizable solution for interoperability of APIs in the life sciences.
Collapse
Affiliation(s)
- Jiwen Xin
- Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, USA
| | - Cyrus Afrasiabi
- Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, USA
| | - Sebastien Lelong
- Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, USA
| | - Julee Adesara
- Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, USA
| | - Ginger Tsueng
- Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, USA
| | - Andrew I Su
- Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, USA
| | - Chunlei Wu
- Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, USA.
| |
Collapse
|
41
|
Silva JCF, Carvalho TFM, Basso MF, Deguchi M, Pereira WA, Sobrinho RR, Vidigal PMP, Brustolini OJB, Silva FF, Dal-Bianco M, Fontes RLF, Santos AA, Zerbini FM, Cerqueira FR, Fontes EPB. Geminivirus data warehouse: a database enriched with machine learning approaches. BMC Bioinformatics 2017; 18:240. [PMID: 28476106 PMCID: PMC5420152 DOI: 10.1186/s12859-017-1646-4] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2016] [Accepted: 04/25/2017] [Indexed: 03/28/2023] Open
Abstract
BACKGROUND The Geminiviridae family encompasses a group of single-stranded DNA viruses with twinned and quasi-isometric virions, which infect a wide range of dicotyledonous and monocotyledonous plants and are responsible for significant economic losses worldwide. Geminiviruses are divided into nine genera, according to their insect vector, host range, genome organization, and phylogeny reconstruction. Using rolling-circle amplification approaches along with high-throughput sequencing technologies, thousands of full-length geminivirus and satellite genome sequences were amplified and have become available in public databases. As a consequence, many important challenges have emerged, namely, how to classify, store, and analyze massive datasets as well as how to extract information or new knowledge. Data mining approaches, mainly supported by machine learning (ML) techniques, are a natural means for high-throughput data analysis in the context of genomics, transcriptomics, proteomics, and metabolomics. RESULTS Here, we describe the development of a data warehouse enriched with ML approaches, designated geminivirus.org. We implemented search modules, bioinformatics tools, and ML methods to retrieve high precision information, demarcate species, and create classifiers for genera and open reading frames (ORFs) of geminivirus genomes. CONCLUSIONS The use of data mining techniques such as ETL (Extract, Transform, Load) to feed our database, as well as algorithms based on machine learning for knowledge extraction, allowed us to obtain a database with quality data and suitable tools for bioinformatics analysis. The Geminivirus Data Warehouse (geminivirus.org) offers a simple and user-friendly environment for information retrieval and knowledge discovery related to geminiviruses.
Collapse
Affiliation(s)
- Jose Cleydson F Silva
- Departamento de Informática, Universidade Federal de Viçosa, Viçosa, Brazil.,National Institute of Science and Technology in Plant-Pest Interactions/BIOAGRO, Universidade Federal de Viçosa, Viçosa, Brazil
| | | | - Marcos F Basso
- National Institute of Science and Technology in Plant-Pest Interactions/BIOAGRO, Universidade Federal de Viçosa, Viçosa, Brazil
| | - Michihito Deguchi
- National Institute of Science and Technology in Plant-Pest Interactions/BIOAGRO, Universidade Federal de Viçosa, Viçosa, Brazil
| | - Welison A Pereira
- National Institute of Science and Technology in Plant-Pest Interactions/BIOAGRO, Universidade Federal de Viçosa, Viçosa, Brazil
| | - Roberto R Sobrinho
- National Institute of Science and Technology in Plant-Pest Interactions/BIOAGRO, Universidade Federal de Viçosa, Viçosa, Brazil
| | - Pedro M P Vidigal
- Núcleo de Biomoléculas, Universidade Federal de Viçosa, Viçosa, MG, Brazil
| | - Otávio J B Brustolini
- National Institute of Science and Technology in Plant-Pest Interactions/BIOAGRO, Universidade Federal de Viçosa, Viçosa, Brazil
| | - Fabyano F Silva
- Departamento de Zootecnia, Universidade Federal de Viçosa, Viçosa, Brazil
| | - Maximiller Dal-Bianco
- National Institute of Science and Technology in Plant-Pest Interactions/BIOAGRO, Universidade Federal de Viçosa, Viçosa, Brazil
| | | | - Anésia A Santos
- National Institute of Science and Technology in Plant-Pest Interactions/BIOAGRO, Universidade Federal de Viçosa, Viçosa, Brazil.,Departamento de Biologia Geral, Universidade Federal de Viçosa, Viçosa, Brazil
| | - Francisco Murilo Zerbini
- National Institute of Science and Technology in Plant-Pest Interactions/BIOAGRO, Universidade Federal de Viçosa, Viçosa, Brazil.,Departamento de Fitopatologia, Universidade Federal de Viçosa, Viçosa, MG, Brazil
| | - Fabio R Cerqueira
- Departamento de Informática, Universidade Federal de Viçosa, Viçosa, Brazil.,Departamento de Engenharia de Produção, Universidade Federal Fluminense, Petrópolis, Rio de Janeiro, Brazil
| | - Elizabeth P B Fontes
- National Institute of Science and Technology in Plant-Pest Interactions/BIOAGRO, Universidade Federal de Viçosa, Viçosa, Brazil. .,Departamento de Bioquímica e Biologia Molecular, Universidade Federal de Viçosa, Viçosa, Brazil.
| |
Collapse
|
42
|
Abstract
Background Most of hydrophilic and hydrophobic residues are thought to be exposed and buried in proteins, respectively. In contrast to the majority of the existing studies on protein folding characteristics using protein structures, in this study, our aim was to design predictors for estimating relative solvent accessibility (RSA) of amino acid residues to discover protein folding characteristics from sequences. Methods The proposed 20 real-value RSA predictors were designed on the basis of the support vector regression method with a set of informative physicochemical properties (PCPs) obtained by means of an optimal feature selection algorithm. Then, molecular dynamics simulations were performed for validating the knowledge discovered by analysis of the selected PCPs. Results The RSA predictors had the mean absolute error of 14.11% and a correlation coefficient of 0.69, better than the existing predictors. The hydrophilic-residue predictors preferred PCPs of buried amino acid residues to PCPs of exposed ones as prediction features. A hydrophobic spine composed of exposed hydrophobic residues of an α-helix was discovered by analyzing the PCPs of RSA predictors corresponding to hydrophobic residues. For example, the results of a molecular dynamics simulation of wild-type sequences and their mutants showed that proteins 1MOF and 2WRP_H16I (Protein Data Bank IDs), which have a perfectly hydrophobic spine, have more stable structures than 1MOF_I54D and 2WRP do (which do not have a perfectly hydrophobic spine). Conclusions We identified informative PCPs to design high-performance RSA predictors and to analyze these PCPs for identification of novel protein folding characteristics. A hydrophobic spine in a protein can help to stabilize exposed α-helices. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1368-z) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Yi-Fan Liou
- Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu, Taiwan
| | - Hui-Ling Huang
- Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu, Taiwan.,Department of Biological Science and Technology, National Chiao Tung University, Hsinchu, Taiwan
| | - Shinn-Ying Ho
- Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu, Taiwan. .,Department of Biological Science and Technology, National Chiao Tung University, Hsinchu, Taiwan.
| |
Collapse
|
43
|
Abstract
This paper uses examples from Australia to argue for a new approach to integrative research in the Earth's near surface environment where the pedosphere, atmosphere, hydrosphere, and biosphere interact, the so-called 'Critical Zone'. In Australia, for around 25years, environmental data layers presented through Geographical Information Systems software have been combined with field-based measurements and observations to produce spatially explicit predictive models for digitally mapping soils and soil properties. The availability of spatially extensive datasets representing different factors of landscape evolution and their exploration with machine learning and rule induction techniques also allow the evaluation of emergent patterns against existing domain knowledge, which in turn can lead to new insights and can facilitate their extrapolation over large areas. Thus the data-driven approach is complementary to the hypothesis-driven scientific inquiry in Critical Zone observatories.
Collapse
Affiliation(s)
- Elisabeth N Bui
- CSIRO Land and Water, GPO Box 1666, Canberra ACT 2601, Australia.
| |
Collapse
|
44
|
Rodriguez JC, González GA, Fresno C, Llera AS, Fernández EA. Improving information retrieval in functional analysis. Comput Biol Med 2016; 79:10-20. [PMID: 27723507 DOI: 10.1016/j.compbiomed.2016.09.017] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2016] [Revised: 09/21/2016] [Accepted: 09/22/2016] [Indexed: 12/20/2022]
Abstract
Transcriptome analysis is essential to understand the mechanisms regulating key biological processes and functions. The first step usually consists of identifying candidate genes; to find out which pathways are affected by those genes, however, functional analysis (FA) is mandatory. The most frequently used strategies for this purpose are Gene Set and Singular Enrichment Analysis (GSEA and SEA) over Gene Ontology. Several statistical methods have been developed and compared in terms of computational efficiency and/or statistical appropriateness. However, whether their results are similar or complementary, the sensitivity to parameter settings, or possible bias in the analyzed terms has not been addressed so far. Here, two GSEA and four SEA methods and their parameter combinations were evaluated in six datasets by comparing two breast cancer subtypes with well-known differences in genetic background and patient outcomes. We show that GSEA and SEA lead to different results depending on the chosen statistic, model and/or parameters. Both approaches provide complementary results from a biological perspective. Hence, an Integrative Functional Analysis (IFA) tool is proposed to improve information retrieval in FA. It provides a common gene expression analytic framework that grants a comprehensive and coherent analysis. Only a minimal user parameter setting is required, since the best SEA/GSEA alternatives are integrated. IFA utility was demonstrated by evaluating four prostate cancer and the TCGA breast cancer microarray datasets, which showed its biological generalization capabilities.
Collapse
Affiliation(s)
- Juan C Rodriguez
- UA AREA CS. AGR. ING. BIO. Y S, Universidad Católica de Córdoba, CONICET, Córdoba, Argentina; Facultad de Matemática, Astronomía y Física, Universidad Nacional de Córdoba, Córdoba, Argentina
| | - Germán A González
- UA AREA CS. AGR. ING. BIO. Y S, Universidad Católica de Córdoba, CONICET, Córdoba, Argentina; Instituto Nacional de Cáncer, MinSal, Córdoba, Agentina
| | - Cristóbal Fresno
- UA AREA CS. AGR. ING. BIO. Y S, Universidad Católica de Córdoba, CONICET, Córdoba, Argentina
| | - Andrea S Llera
- IIBBA, Fund. Instituto Leloir, CONICET, Buenos Aires, Argentina
| | - Elmer A Fernández
- UA AREA CS. AGR. ING. BIO. Y S, Universidad Católica de Córdoba, CONICET, Córdoba, Argentina; Facultad de Ciencias Exactas, Físicas y Naturales, Universidad Nacional de Córdoba, Córdoba, Argentina.
| |
Collapse
|
45
|
Yimam SM, Biemann C, Majnaric L, Šabanović Š, Holzinger A. An adaptive annotation approach for biomedical entity and relation recognition. Brain Inform 2016; 3:157-168. [PMID: 27747591 PMCID: PMC4999566 DOI: 10.1007/s40708-016-0036-4] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2015] [Accepted: 01/25/2016] [Indexed: 12/14/2022] Open
Abstract
In this article, we demonstrate the impact of interactive machine learning: we develop biomedical entity recognition dataset using a human-into-the-loop approach. In contrary to classical machine learning, human-in-the-loop approaches do not operate on predefined training or test sets, but assume that human input regarding system improvement is supplied iteratively. Here, during annotation, a machine learning model is built on previous annotations and used to propose labels for subsequent annotation. To demonstrate that such interactive and iterative annotation speeds up the development of quality dataset annotation, we conduct three experiments. In the first experiment, we carry out an iterative annotation experimental simulation and show that only a handful of medical abstracts need to be annotated to produce suggestions that increase annotation speed. In the second experiment, clinical doctors have conducted a case study in annotating medical terms documents relevant for their research. The third experiment explores the annotation of semantic relations with relation instance learning across documents. The experiments validate our method qualitatively and quantitatively, and give rise to a more personalized, responsive information extraction technology.
Collapse
Affiliation(s)
- Seid Muhie Yimam
- TU Darmstadt CS Department, FG Language Technology, 64289 Darmstadt, Germany
| | - Chris Biemann
- TU Darmstadt CS Department, FG Language Technology, 64289 Darmstadt, Germany
| | - Ljiljana Majnaric
- Josip Juraj Strossmayer University of Osijek Faculty of Medicine Osijek, Osijek, Croatia
| | - Šefket Šabanović
- Josip Juraj Strossmayer University of Osijek Faculty of Medicine Osijek, Osijek, Croatia
| | - Andreas Holzinger
- Research Unit HCI-KDD Institute for Medical Informatics, Statistics and Documentation Medical University Graz, Auenbruggerplatz 2, 8036 Graz, Austria
| |
Collapse
|
46
|
Zaslavsky L, Ciufo S, Fedorov B, Tatusova T. Clustering analysis of proteins from microbial genomes at multiple levels of resolution. BMC Bioinformatics 2016; 17 Suppl 8:276. [PMID: 27586436 PMCID: PMC5009818 DOI: 10.1186/s12859-016-1112-8] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022] Open
Abstract
Background Microbial genomes at the National Center for Biotechnology Information (NCBI) represent a large collection of more than 35,000 assemblies. There are several complexities associated with the data: a great variation in sampling density since human pathogens are densely sampled while other bacteria are less represented; different protein families occur in annotations with different frequencies; and the quality of genome annotation varies greatly. In order to extract useful information from these sophisticated data, the analysis needs to be performed at multiple levels of phylogenomic resolution and protein similarity, with an adequate sampling strategy. Results Protein clustering is used to construct meaningful and stable groups of similar proteins to be used for analysis and functional annotation. Our approach is to create protein clusters at three levels. First, tight clusters in groups of closely-related genomes (species-level clades) are constructed using a combined approach that takes into account both sequence similarity and genome context. Second, clustroids of conservative in-clade clusters are organized into seed global clusters. Finally, global protein clusters are built around the the seed clusters. We propose filtering strategies that allow limiting the protein set included in global clustering. The in-clade clustering procedure, subsequent selection of clustroids and organization into seed global clusters provides a robust representation and high rate of compression. Seed protein clusters are further extended by adding related proteins. Extended seed clusters include a significant part of the data and represent all major known cell machinery. The remaining part, coming from either non-conservative (unique) or rapidly evolving proteins, from rare genomes, or resulting from low-quality annotation, does not group together well. Processing these proteins requires significant computational resources and results in a large number of questionable clusters. Conclusion The developed filtering strategies allow to identify and exclude such peripheral proteins limiting the protein dataset in global clustering. Overall, the proposed methodology allows the relevant data at different levels of details to be obtained and data redundancy eliminated while keeping biologically interesting variations. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1112-8) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Leonid Zaslavsky
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, 20894, MD, USA.
| | - Stacy Ciufo
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, 20894, MD, USA
| | - Boris Fedorov
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, 20894, MD, USA
| | - Tatiana Tatusova
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, 20894, MD, USA
| |
Collapse
|
47
|
Papanikolaou N, Pavlopoulos GA, Theodosiou T, Vizirianakis IS, Iliopoulos I. DrugQuest - a text mining workflow for drug association discovery. BMC Bioinformatics 2016; 17 Suppl 5:182. [PMID: 27295093 PMCID: PMC4905607 DOI: 10.1186/s12859-016-1041-6] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022] Open
Abstract
Background Text mining and data integration methods are gaining ground in the field of health sciences due to the exponential growth of bio-medical literature and information stored in biological databases. While such methods mostly try to extract bioentity associations from PubMed, very few of them are dedicated in mining other types of repositories such as chemical databases. Results Herein, we apply a text mining approach on the DrugBank database in order to explore drug associations based on the DrugBank “Description”, “Indication”, “Pharmacodynamics” and “Mechanism of Action” text fields. We apply Name Entity Recognition (NER) techniques on these fields to identify chemicals, proteins, genes, pathways, diseases, and we utilize the TextQuest algorithm to find additional biologically significant words. Using a plethora of similarity and partitional clustering techniques, we group the DrugBank records based on their common terms and investigate possible scenarios why these records are clustered together. Different views such as clustered chemicals based on their textual information, tag clouds consisting of Significant Terms along with the terms that were used for clustering are delivered to the user through a user-friendly web interface. Conclusions DrugQuest is a text mining tool for knowledge discovery: it is designed to cluster DrugBank records based on text attributes in order to find new associations between drugs. The service is freely available at http://bioinformatics.med.uoc.gr/drugquest.
Collapse
Affiliation(s)
- Nikolas Papanikolaou
- Division of Basic Sciences, University of Crete, Medical School, Gouves, 71003, Heraklion, Crete, Greece
| | - Georgios A Pavlopoulos
- Division of Basic Sciences, University of Crete, Medical School, Gouves, 71003, Heraklion, Crete, Greece
| | - Theodosios Theodosiou
- Division of Basic Sciences, University of Crete, Medical School, Gouves, 71003, Heraklion, Crete, Greece
| | - Ioannis S Vizirianakis
- School of Pharmacy, Laboratory of Pharmacology, Aristotle University of Thessaloniki, University Campus, 54124, Thessaloniki, Greece
| | - Ioannis Iliopoulos
- Division of Basic Sciences, University of Crete, Medical School, Gouves, 71003, Heraklion, Crete, Greece.
| |
Collapse
|
48
|
Domeniconi G, Masseroli M, Moro G, Pinoli P. Cross-organism learning method to discover new gene functionalities. Comput Methods Programs Biomed 2016; 126:20-34. [PMID: 26724853 DOI: 10.1016/j.cmpb.2015.12.002] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/04/2015] [Revised: 11/16/2015] [Accepted: 12/08/2015] [Indexed: 06/05/2023]
Abstract
BACKGROUND Knowledge of gene and protein functions is paramount for the understanding of physiological and pathological biological processes, as well as in the development of new drugs and therapies. Analyses for biomedical knowledge discovery greatly benefit from the availability of gene and protein functional feature descriptions expressed through controlled terminologies and ontologies, i.e., of gene and protein biomedical controlled annotations. In the last years, several databases of such annotations have become available; yet, these valuable annotations are incomplete, include errors and only some of them represent highly reliable human curated information. Computational techniques able to reliably predict new gene or protein annotations with an associated likelihood value are thus paramount. METHODS Here, we propose a novel cross-organisms learning approach to reliably predict new functionalities for the genes of an organism based on the known controlled annotations of the genes of another, evolutionarily related and better studied, organism. We leverage a new representation of the annotation discovery problem and a random perturbation of the available controlled annotations to allow the application of supervised algorithms to predict with good accuracy unknown gene annotations. Taking advantage of the numerous gene annotations available for a well-studied organism, our cross-organisms learning method creates and trains better prediction models, which can then be applied to predict new gene annotations of a target organism. RESULTS We tested and compared our method with the equivalent single organism approach on different gene annotation datasets of five evolutionarily related organisms (Homo sapiens, Mus musculus, Bos taurus, Gallus gallus and Dictyostelium discoideum). Results show both the usefulness of the perturbation method of available annotations for better prediction model training and a great improvement of the cross-organism models with respect to the single-organism ones, without influence of the evolutionary distance between the considered organisms. The generated ranked lists of reliably predicted annotations, which describe novel gene functionalities and have an associated likelihood value, are very valuable both to complement available annotations, for better coverage in biomedical knowledge discovery analyses, and to quicken the annotation curation process, by focusing it on the prioritized novel annotations predicted.
Collapse
Affiliation(s)
- Giacomo Domeniconi
- DISI, Università degli Studi di Bologna, Via Venezia 52, 47521 Cesena, Italy.
| | - Marco Masseroli
- DEIB, Politecnico di Milano, Piazza L. Da Vinci 32, 20133 Milan, Italy.
| | - Gianluca Moro
- DISI, Università degli Studi di Bologna, Via Venezia 52, 47521 Cesena, Italy.
| | - Pietro Pinoli
- DEIB, Politecnico di Milano, Piazza L. Da Vinci 32, 20133 Milan, Italy.
| |
Collapse
|
49
|
Girardi D, Küng J, Kleiser R, Sonnberger M, Csillag D, Trenkler J, Holzinger A. Interactive knowledge discovery with the doctor-in-the-loop: a practical example of cerebral aneurysms research. Brain Inform 2016; 3:133-43. [PMID: 27747590 DOI: 10.1007/s40708-016-0038-2] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2015] [Accepted: 02/03/2016] [Indexed: 12/02/2022] Open
Abstract
Established process models for knowledge discovery find the domain-expert in a customer-like and supervising role. In the field of biomedical research, it is necessary to move the domain-experts into the center of this process with far-reaching consequences for both their research output and the process itself. In this paper, we revise the established process models for knowledge discovery and propose a new process model for domain-expert-driven interactive knowledge discovery. Furthermore, we present a research infrastructure which is adapted to this new process model and demonstrate how the domain-expert can be deeply integrated even into the highly complex data-mining process and data-exploration tasks. We evaluated this approach in the medical domain for the case of cerebral aneurysms research.
Collapse
|
50
|
Zare Hosseini Z, Mohammadzadeh M. Knowledge discovery from patients' behavior via clustering-classification algorithms based on weighted eRFM and CLV model: An empirical study in public health care services. Iran J Pharm Res 2016; 15:355-67. [PMID: 27610177 PMCID: PMC4986115] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/01/2022]
Abstract
The rapid growing of information technology (IT) motivates and makes competitive advantages in health care industry. Nowadays, many hospitals try to build a successful customer relationship management (CRM) to recognize target and potential patients, increase patient loyalty and satisfaction and finally maximize their profitability. Many hospitals have large data warehouses containing customer demographic and transactions information. Data mining techniques can be used to analyze this data and discover hidden knowledge of customers. This research develops an extended RFM model, namely RFML (added parameter: Length) based on health care services for a public sector hospital in Iran with the idea that there is contrast between patient and customer loyalty, to estimate customer life time value (CLV) for each patient. We used Two-step and K-means algorithms as clustering methods and Decision tree (CHAID) as classification technique to segment the patients to find out target, potential and loyal customers in order to implement strengthen CRM. Two approaches are used for classification: first, the result of clustering is considered as Decision attribute in classification process and second, the result of segmentation based on CLV value of patients (estimated by RFML) is considered as Decision attribute. Finally the results of CHAID algorithm show the significant hidden rules and identify existing patterns of hospital consumers.
Collapse
Affiliation(s)
| | - Mahdi Mohammadzadeh
- Shahid Beheshti university of medical sciences, Faculty of pharmacy, Tehran, IRAN. ,corresponding author:
| |
Collapse
|