1
|
Yang X, Huang K, Yang D, Zhao W, Zhou X. Biomedical Big Data Technologies, Applications, and Challenges for Precision Medicine: A Review. GLOBAL CHALLENGES (HOBOKEN, NJ) 2024; 8:2300163. [PMID: 38223896 PMCID: PMC10784210 DOI: 10.1002/gch2.202300163] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/02/2023] [Revised: 09/20/2023] [Indexed: 01/16/2024]
Abstract
The explosive growth of biomedical Big Data presents both significant opportunities and challenges in the realm of knowledge discovery and translational applications within precision medicine. Efficient management, analysis, and interpretation of big data can pave the way for groundbreaking advancements in precision medicine. However, the unprecedented strides in the automated collection of large-scale molecular and clinical data have also introduced formidable challenges in terms of data analysis and interpretation, necessitating the development of novel computational approaches. Some potential challenges include the curse of dimensionality, data heterogeneity, missing data, class imbalance, and scalability issues. This overview article focuses on the recent progress and breakthroughs in the application of big data within precision medicine. Key aspects are summarized, including content, data sources, technologies, tools, challenges, and existing gaps. Nine fields-Datawarehouse and data management, electronic medical record, biomedical imaging informatics, Artificial intelligence-aided surgical design and surgery optimization, omics data, health monitoring data, knowledge graph, public health informatics, and security and privacy-are discussed.
Collapse
Affiliation(s)
- Xue Yang
- Department of Pancreatic Surgery and West China Biomedical Big Data CenterWest China HospitalSichuan UniversityChengdu610041China
| | - Kexin Huang
- Department of Pancreatic Surgery and West China Biomedical Big Data CenterWest China HospitalSichuan UniversityChengdu610041China
| | - Dewei Yang
- College of Advanced Manufacturing EngineeringChongqing University of Posts and TelecommunicationsChongqingChongqing400000China
| | - Weiling Zhao
- Center for Systems MedicineSchool of Biomedical InformaticsUTHealth at HoustonHoustonTX77030USA
| | - Xiaobo Zhou
- Center for Systems MedicineSchool of Biomedical InformaticsUTHealth at HoustonHoustonTX77030USA
| |
Collapse
|
2
|
Hu X, Li XK, Wen S, Li X, Zeng TS, Zhang JY, Wang W, Bi Y, Zhang Q, Tian SH, Min J, Wang Y, Liu G, Huang H, Peng M, Zhang J, Wu C, Li YM, Sun H, Ning G, Chen LL. Predictive modeling the probability of suffering from metabolic syndrome using machine learning: A population-based study. Heliyon 2022; 8:e12343. [PMID: 36643319 PMCID: PMC9834713 DOI: 10.1016/j.heliyon.2022.e12343] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2022] [Revised: 06/16/2022] [Accepted: 12/06/2022] [Indexed: 12/14/2022] Open
Abstract
Background There is an increasing trend of Metabolic syndrome (MetS) prevalence, which has been considered as an important contributor for cardiovascular disease (CVD), cancers and diabetes. However, there is often a long asymptomatic phase of MetS, resulting in not diagnosed and intervened so timely as needed. It would be very helpful to explore tools to predict the probability of suffering from MetS in daily life or routinely clinical practice. Objective To develop models that predict individuals' probability of suffering from MetS timely with high efficacy in general population. Methods The present study enrolled 8964 individuals aged 40-75 years without severe diseases, which was a part of the REACTION study from October 2011 to February 2012. We developed three prediction models for different scenarios in hospital (Model 1, 2) or at home (Model 3) based on LightGBM (LGBM) technique and corresponding logistic regression (LR) models were also constructed for comparison. Model 1 included variables of laboratory tests, lifestyles and anthropometric measurements while model 2 was built with components of MetS excluded based on model 1, and model 3 was constructed with blood biochemical indexes removed based on model 2. Additionally, we also investigated the strength of association between the predictive factors and MetS, as well as that between the predictors and each component of MetS. Results In this study, 2714 (30.3%) participants suffer from MetS accordingly. The performances of the LGBM models in predicting the probability of suffering from MetS produced good results and were presented as follows: model 1 had an area under the curve (AUC) value of 0.993 while model 2 indicated an AUC value of 0.885. Model 3 had an AUC value of 0.859, which is close to that of model 2. The AUC values of LR model 1 and 2 for the scenario in hospital and model 3 at home were 0.938, 0.839 and 0.820 respectively, which seemed lower than that of their corresponding machine learning models, respectively. In both LGBM and logistic models, gender, height and resting pulse rate (RPR) were predictors for MetS. Women had higher risk of MetS than men (OR 8.84, CI: 6.70-11.66), and each 1-cm increase in height indicated 3.8% higher risk of suffering from MetS in people over 58 years, whereas each 1- Beat Per Minute (bpm) increase in RPR showed 1.0% higher risk in individuals younger than 62 years. Conclusion The present study showed that the prediction models developed by machine learning demonstrated effective in evaluating the probability of suffering from MetS, and presented prominent predicting efficacies and accuracies. Additionally, we found that women showed a higher risk of MetS than men, and height in individuals over 58 years was important factor in predicting the probability of suffering from MetS while RPR was of vital importance in people aged 40-62 years.
Collapse
Affiliation(s)
- Xiang Hu
- Department of Endocrinology, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China,Hubei Provincial Clinical Research Center for Diabetes and Metabolic Disorders, Wuhan, China
| | - Xue-Ke Li
- Department of Endocrinology, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China,Hubei Provincial Clinical Research Center for Diabetes and Metabolic Disorders, Wuhan, China
| | - Shiping Wen
- Centre for Artificial Intelligence, Faculty of Engineering Information Technology, University of Technology Sydney, Ultimo, NSW, 2007, Australia
| | - Xingyu Li
- School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China
| | - Tian-Shu Zeng
- Department of Endocrinology, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China,Hubei Provincial Clinical Research Center for Diabetes and Metabolic Disorders, Wuhan, China
| | - Jiao-Yue Zhang
- Department of Endocrinology, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China,Hubei Provincial Clinical Research Center for Diabetes and Metabolic Disorders, Wuhan, China
| | - Weiqing Wang
- Department of Endocrinology and Metabolism, State Key Laboratory of Medical Genomes, National Clinical Research Center for Metabolic Diseases, Shanghai Clinical Center for Endocrine and Metabolic Diseases, Shanghai Institute of Endocrine and Metabolic Diseases, Ruijin Hospital, Shanghai Jiao-Tong University School of Medicine, Shanghai, China
| | - Yufang Bi
- Department of Endocrinology and Metabolism, State Key Laboratory of Medical Genomes, National Clinical Research Center for Metabolic Diseases, Shanghai Clinical Center for Endocrine and Metabolic Diseases, Shanghai Institute of Endocrine and Metabolic Diseases, Ruijin Hospital, Shanghai Jiao-Tong University School of Medicine, Shanghai, China
| | - Qiao Zhang
- Department of Cardiovascular Surgery, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China
| | - Sheng-Hua Tian
- Department of Endocrinology, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China,Hubei Provincial Clinical Research Center for Diabetes and Metabolic Disorders, Wuhan, China
| | - Jie Min
- Department of Endocrinology, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China,Hubei Provincial Clinical Research Center for Diabetes and Metabolic Disorders, Wuhan, China
| | - Ying Wang
- Department of Endocrinology, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China,Hubei Provincial Clinical Research Center for Diabetes and Metabolic Disorders, Wuhan, China
| | - Geng Liu
- Department of Endocrinology, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China,Hubei Provincial Clinical Research Center for Diabetes and Metabolic Disorders, Wuhan, China
| | | | - Miaomiao Peng
- Department of Endocrinology, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China,Hubei Provincial Clinical Research Center for Diabetes and Metabolic Disorders, Wuhan, China
| | | | - Chaodong Wu
- Department of Nutrition and Food Science, Texas A&M University, College Station, TX, USA
| | - Yu-Ming Li
- Department of Endocrinology, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China,Hubei Provincial Clinical Research Center for Diabetes and Metabolic Disorders, Wuhan, China
| | - Hui Sun
- Department of Endocrinology, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China,Hubei Provincial Clinical Research Center for Diabetes and Metabolic Disorders, Wuhan, China
| | - Guang Ning
- Department of Endocrinology and Metabolism, State Key Laboratory of Medical Genomes, National Clinical Research Center for Metabolic Diseases, Shanghai Clinical Center for Endocrine and Metabolic Diseases, Shanghai Institute of Endocrine and Metabolic Diseases, Ruijin Hospital, Shanghai Jiao-Tong University School of Medicine, Shanghai, China
| | - Lu-Lu Chen
- Department of Endocrinology, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China,Hubei Provincial Clinical Research Center for Diabetes and Metabolic Disorders, Wuhan, China,Corresponding author.
| |
Collapse
|
3
|
How to Determine the Early Warning Threshold Value of Meteorological Factors on Influenza through Big Data Analysis and Machine Learning. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2020; 2020:8845459. [PMID: 33343686 PMCID: PMC7725585 DOI: 10.1155/2020/8845459] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/11/2020] [Revised: 10/27/2020] [Accepted: 11/23/2020] [Indexed: 12/26/2022]
Abstract
Infectious diseases are a major health challenge for the worldwide population. Since their rapid spread can cause great distress to the real world, in addition to taking appropriate measures to curb the spread of infectious diseases in the event of an outbreak, proper prediction and early warning before the outbreak of the threat of infectious diseases can provide an important basis for early and reasonable response by the government health sector, reduce morbidity and mortality, and greatly reduce national losses. However, if only traditional medical data is involved, it may be too late or too difficult to implement prediction and early warning of an infectious outbreak. Recently, medical big data has become a research hotspot and has played an increasingly important role in public health, precision medicine, and disease prediction. In this paper, we focus on exploring a prediction and early warning method for influenza with the help of medical big data. It is well known that meteorological conditions have an influence on influenza outbreaks. So, we try to find a way to determine the early warning threshold value of influenza outbreaks through big data analysis concerning meteorological factors. Results show that, based on analysis of meteorological conditions combined with influenza outbreak history data, the early warning threshold of influenza outbreaks could be established with reasonable high accuracy.
Collapse
|
4
|
Wehrens R, Sihag V, Sülz S, van Elten H, van Raaij E, de Bont A, Weggelaar-Jansen AM. Understanding the Uptake of Big Data in Health Care: Protocol for a Multinational Mixed-Methods Study. JMIR Res Protoc 2020; 9:e16779. [PMID: 33090113 PMCID: PMC7644380 DOI: 10.2196/16779] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2019] [Revised: 07/17/2020] [Accepted: 07/21/2020] [Indexed: 11/25/2022] Open
Abstract
Background Despite the high potential of big data, their applications in health care face many organizational, social, financial, and regulatory challenges. The societal dimensions of big data are underrepresented in much medical research. Little is known about integrating big data applications in the corporate routines of hospitals and other care providers. Equally little is understood about embedding big data applications in daily work practices and how they lead to actual improvements for health care actors, such as patients, care professionals, care providers, information technology companies, payers, and the society. Objective This planned study aims to provide an integrated analysis of big data applications, focusing on the interrelations among concrete big data experiments, organizational routines, and relevant systemic and societal dimensions. To understand the similarities and differences between interactions in various contexts, the study covers 12 big data pilot projects in eight European countries, each with its own health care system. Workshops will be held with stakeholders to discuss the findings, our recommendations, and the implementation. Dissemination is supported by visual representations developed to share the knowledge gained. Methods This study will utilize a mixed-methods approach that combines performance measurements, interviews, document analysis, and cocreation workshops. Analysis will be structured around the following four key dimensions: performance, embedding, legitimation, and value creation. Data and their interrelations across the dimensions will be synthesized per application and per country. Results The study was funded in August 2017. Data collection started in April 2018 and will continue until September 2021. The multidisciplinary focus of this study enables us to combine insights from several social sciences (health policy analysis, business administration, innovation studies, organization studies, ethics, and health services research) to advance a holistic understanding of big data value realization. The multinational character enables comparative analysis across the following eight European countries: Austria, France, Germany, Ireland, the Netherlands, Spain, Sweden, and the United Kingdom. Given that national and organizational contexts change over time, it will not be possible to isolate the factors and actors that explain the implementation of big data applications. The visual representations developed for dissemination purposes will help to reduce complexity and clarify the relations between the various dimensions. Conclusions This study will develop an integrated approach to big data applications that considers the interrelations among concrete big data experiments, organizational routines, and relevant systemic and societal dimensions. International Registered Report Identifier (IRRID) DERR1-10.2196/16779
Collapse
Affiliation(s)
- Rik Wehrens
- Erasmus School of Health Policy & Management, Erasmus University Rotterdam, Rotterdam, Netherlands
| | - Vikrant Sihag
- Rotterdam School of Management, Erasmus University Rotterdam, Rotterdam, Netherlands.,Department of Industrial Engineering & Innovation Sciences, Eindhoven University of Technology, Eindhoven, Netherlands
| | - Sandra Sülz
- Erasmus School of Health Policy & Management, Erasmus University Rotterdam, Rotterdam, Netherlands
| | - Hilco van Elten
- Erasmus School of Health Policy & Management, Erasmus University Rotterdam, Rotterdam, Netherlands
| | - Erik van Raaij
- Erasmus School of Health Policy & Management, Erasmus University Rotterdam, Rotterdam, Netherlands.,Rotterdam School of Management, Erasmus University Rotterdam, Rotterdam, Netherlands
| | - Antoinette de Bont
- Erasmus School of Health Policy & Management, Erasmus University Rotterdam, Rotterdam, Netherlands
| | - Anne Marie Weggelaar-Jansen
- Erasmus School of Health Policy & Management, Erasmus University Rotterdam, Rotterdam, Netherlands.,School of Medical Physics and Engineering, University of Technology Eindhoven, Eindhoven, Netherlands
| |
Collapse
|
5
|
Gagalova KK, Leon Elizalde MA, Portales-Casamar E, Görges M. What You Need to Know Before Implementing a Clinical Research Data Warehouse: Comparative Review of Integrated Data Repositories in Health Care Institutions. JMIR Form Res 2020; 4:e17687. [PMID: 32852280 PMCID: PMC7484778 DOI: 10.2196/17687] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2020] [Revised: 06/09/2020] [Accepted: 07/17/2020] [Indexed: 12/23/2022] Open
Abstract
Background Integrated data repositories (IDRs), also referred to as clinical data warehouses, are platforms used for the integration of several data sources through specialized analytical tools that facilitate data processing and analysis. IDRs offer several opportunities for clinical data reuse, and the number of institutions implementing an IDR has grown steadily in the past decade. Objective The architectural choices of major IDRs are highly diverse and determining their differences can be overwhelming. This review aims to explore the underlying models and common features of IDRs, provide a high-level overview for those entering the field, and propose a set of guiding principles for small- to medium-sized health institutions embarking on IDR implementation. Methods We reviewed manuscripts published in peer-reviewed scientific literature between 2008 and 2020, and selected those that specifically describe IDR architectures. Of 255 shortlisted articles, we found 34 articles describing 29 different architectures. The different IDRs were analyzed for common features and classified according to their data processing and integration solution choices. Results Despite common trends in the selection of standard terminologies and data models, the IDRs examined showed heterogeneity in the underlying architecture design. We identified 4 common architecture models that use different approaches for data processing and integration. These different approaches were driven by a variety of features such as data sources, whether the IDR was for a single institution or a collaborative project, the intended primary data user, and purpose (research-only or including clinical or operational decision making). Conclusions IDR implementations are diverse and complex undertakings, which benefit from being preceded by an evaluation of requirements and definition of scope in the early planning stage. Factors such as data source diversity and intended users of the IDR influence data flow and synchronization, both of which are crucial factors in IDR architecture planning.
Collapse
Affiliation(s)
- Kristina K Gagalova
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, Canada.,Bioinformatics Graduate Program, University of British Columbia, Vancouver, BC, Canada.,Research Institute, BC Children's Hospital, Vancouver, BC, Canada
| | - M Angelica Leon Elizalde
- Research Institute, BC Children's Hospital, Vancouver, BC, Canada.,School of Population and Public Health, University of British Columbia, Vancouver, BC, Canada
| | - Elodie Portales-Casamar
- Research Institute, BC Children's Hospital, Vancouver, BC, Canada.,Department of Pediatrics, University of British Columbia, Vancouver, BC, Canada
| | - Matthias Görges
- Research Institute, BC Children's Hospital, Vancouver, BC, Canada.,Department of Anesthesiology, Pharmacology and Therapeutics, University of British Columbia, Vancouver, BC, Canada
| |
Collapse
|
6
|
Mohaghegh N, Magierowski S, Ghafar-Zadeh E. A Novel Method for Detection and Progress Assessment of Visual Distortion Caused by Macular Disorder: A Central Serous Chorioretinopathy (CSR) Case Study. Vision (Basel) 2019; 3:vision3040068. [PMID: 31835894 PMCID: PMC6969906 DOI: 10.3390/vision3040068] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2019] [Revised: 12/08/2019] [Accepted: 12/09/2019] [Indexed: 11/16/2022] Open
Abstract
This paper presents a new mathematical model along with a measurement platform for accurate detection and monitoring of various visual distortions (VD) caused by macular disorders such as central serous chorioretinopathy (CSR) and age-related macular degeneration (AMD). This platform projects a series of graphical patterns on the patient’s retina and calculates the severity of VDs accordingly. The accuracy of this technique relies on the accurate detection of distorted lines by the patient. We also propose a simple mathematical model to evaluate the VD created by CSR. The model is used as a control for the test results achieved from the proposed platform. The proposed platform consists of the required hardware and software for the generation and projection of patterns along with the collection and processing of patients against their standard optical coherence tomography (OCT) images. Based on these results, the OCT images agree with the VD test results, and the proposed platform can be used as an alternative home monitoring method for various macular disorders.
Collapse
|
7
|
Challenges of big data integration in the life sciences. Anal Bioanal Chem 2019; 411:6791-6800. [PMID: 31463515 DOI: 10.1007/s00216-019-02074-9] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2019] [Revised: 07/08/2019] [Accepted: 08/06/2019] [Indexed: 10/26/2022]
Abstract
Big data has been reported to be revolutionizing many areas of life, including science. It summarizes data that is unprecedentedly large, rapidly generated, heterogeneous, and hard to accurately interpret. This availability has also brought new challenges: How to properly annotate data to make it searchable? What are the legal and ethical hurdles when sharing data? How to store data securely, preventing loss and corruption? The life sciences are not the only disciplines that must align themselves with big data requirements to keep up with the latest developments. The large hadron collider, for instance, generates research data at a pace beyond any current biomedical research center. There are three recent major coinciding events that explain the emergence of big data in the context of research: the technological revolution for data generation, the development of tools for data analysis, and a conceptual change towards open science and data. The true potential of big data lies in pattern discovery in large datasets, as well as the formulation of new models and hypotheses. Confirmation of the existence of the Higgs boson, for instance, is one of the most recent triumphs of big data analysis in physics. Digital representations of biological systems have become more comprehensive. This, in combination with advances in machine learning, creates exciting new research possibilities. In this paper, we review the state of big data in bioanalytical research and provide an overview of the guidelines for its proper usage.
Collapse
|
8
|
NGRID: A novel platform for detection and progress assessment of visual distortion caused by macular disorders. Comput Biol Med 2019; 111:103340. [PMID: 31279165 DOI: 10.1016/j.compbiomed.2019.103340] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2018] [Revised: 06/20/2019] [Accepted: 06/20/2019] [Indexed: 01/11/2023]
Abstract
This paper presents a new graphical macular interface system (GMIS) for accurate, rapid, and quantitative measurement of visual distortion (VD) in the central vision of patients suffering from macular disorders. In this system, a series of predefined graphical patterns or multiple grids (NGRID) are randomly selected from a library of patterns and visualized on the screen, then the VDs identified by the patient are recorded as binary codes using various control methods including speech recognition. Scalable Vector Graphics (SVG) is used to generate the patterns and save them into a central library. Based on the projected patterns and the patients' responses, a VD graph or so-called heatmap is generated for eye-care purposes. We demonstrate and discuss the functionality of the proposed system for the detection and progress assessment of a macular condition in patients suffering from Central Serous Chorioretinopathy (CSR). Also, we characterize the proposed technique to evaluate the systematic error and response time on healthy human subjects with normal vision. Based on these results, the voice recognition input method exhibits a lower error but a higher response time compared to other input devices. We run the proposed NGRID VD technique to evaluate the effect of CSR on the visual field of a CSR patient. The generated heatmaps are in agreement with standard Optical Coherence Tomography (OCT) images obtained at different times from both the left and right eyes. These results reveal the applicability of the proposed technique for the detection and assessment of macular disorders. Based on these results, the proposed NGRID platform shows great promise for use as an alternative solution for in-home monitoring of various macular disorders and as a means of forwarding responses to secured cloud facilities for future data analysis.
Collapse
|
9
|
Comment on: “A Bibliometric Analysis and Visualization of Medical Big Data Research” Sustainability 2018, 10, 166. SUSTAINABILITY 2018. [DOI: 10.3390/su10124851] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Liao et al. [...]
Collapse
|
10
|
An Efficient Middle Layer Platform for Medical Imaging Archives. JOURNAL OF HEALTHCARE ENGINEERING 2018; 2018:3984061. [PMID: 30034674 PMCID: PMC6033252 DOI: 10.1155/2018/3984061] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/14/2017] [Revised: 04/29/2018] [Accepted: 05/09/2018] [Indexed: 11/17/2022]
Abstract
Digital medical image usage is common in health services and clinics. These data have a vital importance for diagnosis and treatment; therefore, preservation, protection, and archiving of these data are a challenge. Rapidly growing file sizes differentiated data formats and increasing number of files constitute big data, which traditional systems do not have the capability to process and store these data. This study investigates an efficient middle layer platform based on Hadoop and MongoDB architecture using the state-of-the-art technologies in the literature. We have developed this system to improve the medical image compression method that we have developed before to create a middle layer platform that performs data compression and archiving operations. With this study, a platform using MapReduce programming model on Hadoop has been developed that can be scalable. MongoDB, a NoSQL database, has been used to satisfy performance requirements of the platform. A four-node Hadoop cluster has been built to evaluate the developed platform and execute distributed MapReduce algorithms. The actual patient medical images have been used to validate the performance of the platform. The processing of test images takes 15,599 seconds on a single node, but on the developed platform, this takes 8,153 seconds. Moreover, due to the medical imaging processing package used in the proposed method, the compression ratio values produced for the non-ROI image are between 92.12% and 97.84%. In conclusion, the proposed platform provides a cloud-based integrated solution to the medical image archiving problem.
Collapse
|
11
|
Ristevski B, Chen M. Big Data Analytics in Medicine and Healthcare. J Integr Bioinform 2018; 15:/j/jib.ahead-of-print/jib-2017-0030/jib-2017-0030.xml. [PMID: 29746254 PMCID: PMC6340124 DOI: 10.1515/jib-2017-0030] [Citation(s) in RCA: 95] [Impact Index Per Article: 15.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2017] [Accepted: 03/20/2018] [Indexed: 12/28/2022] Open
Abstract
This paper surveys big data with highlighting the big data analytics in medicine and healthcare. Big data characteristics: value, volume, velocity, variety, veracity and variability are described. Big data analytics in medicine and healthcare covers integration and analysis of large amount of complex heterogeneous data such as various – omics data (genomics, epigenomics, transcriptomics, proteomics, metabolomics, interactomics, pharmacogenomics, diseasomics), biomedical data and electronic health records data. We underline the challenging issues about big data privacy and security. Regarding big data characteristics, some directions of using suitable and promising open-source distributed data processing software platform are given.
Collapse
Affiliation(s)
- Blagoj Ristevski
- "St. Kliment Ohridski" University - Bitola, Faculty of Information and Communication Technologies, ul. Partizanska bb, 7000 Bitola, Republic of Macedonia
| | - Ming Chen
- Department of Bioinformatics, College of Life Sciences, Zhejiang University Zijingang Campus, Hangzhou, P.R. China
| |
Collapse
|
12
|
Abstract
With the advancement of technology in data science and network technology, the world has stepped into the Era of Big Data, and the medical field is rich in data suitable for analysis. Thus, in recent years, there has been much research in medical big data, mainly targeting data collection, data analysis, and visualization. However, very few works provide a full survey of the medical big data on chronic diseases and health monitoring. This review investigates recent research efforts and conducts a comprehensive overview of the work on medical big data, especially as related to chronic diseases and health monitoring. It focuses on the full cycles of the big data processing, which includes medical big data preprocessing, big data tools and algorithms, big data visualization, and security issues in big data. It also attempts to combine common big data technologies with special medical needs by analyzing in detail existing works of medical big data. To the best of our knowledge, this is the first survey that targets chronic diseases and health monitoring big data technologies.
Collapse
|
13
|
Simović A. A Big Data smart library recommender system for an educational institution. LIBRARY HI TECH 2018. [DOI: 10.1108/lht-06-2017-0131] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Purpose
With the exponential growth of the amount of data, the most sophisticated systems of traditional libraries are not able to fulfill the demands of modern business and user needs. The purpose of this paper is to present the possibility of creating a Big Data smart library as an integral and enhanced part of the educational system that will improve user service and increase motivation in the continuous learning process through content-aware recommendations.
Design/methodology/approach
This paper presents an approach to the design of a Big Data system for collecting, analyzing, processing and visualizing data from different sources to a smart library specifically suitable for application in educational institutions.
Findings
As an integrated recommender system of the educational institution, the practical application of Big Data smart library meets the user needs and assists in finding personalized content from several sources, resulting in economic benefits for the institution and user long-term satisfaction.
Social implications
The need for continuous education alters business processes in libraries with requirements to adopt new technologies, business demands, and interactions with users. To be able to engage in a new era of business in the Big Data environment, librarians need to modernize their infrastructure for data collection, data analysis, and data visualization.
Originality/value
A unique value of this paper is its perspective of the implementation of a Big Data solution for smart libraries as a part of a continuous learning process, with the aim to improve the results of library operations by integrating traditional systems with Big Data technology. The paper presents a Big Data smart library system that has the potential to create new values and data-driven decisions by incorporating multiple sources of differential data.
Collapse
|
14
|
Pashazadeh A, Navimipour NJ. Big data handling mechanisms in the healthcare applications: A comprehensive and systematic literature review. J Biomed Inform 2018; 82:47-62. [PMID: 29655946 DOI: 10.1016/j.jbi.2018.03.014] [Citation(s) in RCA: 31] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2017] [Revised: 11/19/2017] [Accepted: 03/23/2018] [Indexed: 01/08/2023]
Abstract
Healthcare provides many services such as diagnosing, treatment, prevention of diseases, illnesses, injuries, and other physical and mental disorders. Large-scale distributed data processing applications in healthcare as a basic concept operates on large amounts of data. Therefore, big data application functions are the main part of healthcare operations, but there was not any comprehensive and systematic survey about studying and evaluating the important techniques in this field. Therefore, this paper aims at providing the comprehensive, detailed, and systematic study of the state-of-the-art mechanisms in the big data related to healthcare applications in five categories, including machine learning, cloud-based, heuristic-based, agent-based, and hybrid mechanisms. Also, this paper displayed a systematic literature review (SLR) of the big data applications in the healthcare literature up to the end of 2016. Initially, 205 papers were identified, but a paper selection process reduced the number of papers to 29 important studies.
Collapse
Affiliation(s)
- Asma Pashazadeh
- Department of Computer Engineering, Tabriz Branch, Islamic Azad University, Tabriz, Iran
| | - Nima Jafari Navimipour
- Department of Computer Engineering, Tabriz Branch, Islamic Azad University, Tabriz, Iran.
| |
Collapse
|
15
|
Chen J, Li K, Rong H, Bilal K, Yang N, Li K. A disease diagnosis and treatment recommendation system based on big data mining and cloud computing. Inf Sci (N Y) 2018. [DOI: 10.1016/j.ins.2018.01.001] [Citation(s) in RCA: 52] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
16
|
Sebaa A, Chikh F, Nouicer A, Tari A. Medical Big Data Warehouse: Architecture and System Design, a Case Study: Improving Healthcare Resources Distribution. J Med Syst 2018; 42:59. [DOI: 10.1007/s10916-018-0894-9] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2016] [Accepted: 01/08/2018] [Indexed: 10/18/2022]
|
17
|
Using Distributed Data over HBase in Big Data Analytics Platform for Clinical Services. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2017; 2017:6120820. [PMID: 29375652 PMCID: PMC5742497 DOI: 10.1155/2017/6120820] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/01/2017] [Accepted: 11/01/2017] [Indexed: 02/01/2023]
Abstract
Big data analytics (BDA) is important to reduce healthcare costs. However, there are many challenges of data aggregation, maintenance, integration, translation, analysis, and security/privacy. The study objective to establish an interactive BDA platform with simulated patient data using open-source software technologies was achieved by construction of a platform framework with Hadoop Distributed File System (HDFS) using HBase (key-value NoSQL database). Distributed data structures were generated from benchmarked hospital-specific metadata of nine billion patient records. At optimized iteration, HDFS ingestion of HFiles to HBase store files revealed sustained availability over hundreds of iterations; however, to complete MapReduce to HBase required a week (for 10 TB) and a month for three billion (30 TB) indexed patient records, respectively. Found inconsistencies of MapReduce limited the capacity to generate and replicate data efficiently. Apache Spark and Drill showed high performance with high usability for technical support but poor usability for clinical services. Hospital system based on patient-centric data was challenging in using HBase, whereby not all data profiles were fully integrated with the complex patient-to-hospital relationships. However, we recommend using HBase to achieve secured patient data while querying entire hospital volumes in a simplified clinical event model across clinical services.
Collapse
|
18
|
A cloud-based framework for large-scale traditional Chinese medical record retrieval. J Biomed Inform 2017; 77:21-33. [PMID: 29175431 DOI: 10.1016/j.jbi.2017.11.013] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2017] [Revised: 11/02/2017] [Accepted: 11/20/2017] [Indexed: 11/21/2022]
Abstract
INTRODUCTION Electronic medical records are increasingly common in medical practice. The secondary use of medical records has become increasingly important. It relies on the ability to retrieve the complete information about desired patient populations. How to effectively and accurately retrieve relevant medical records from large- scale medical big data is becoming a big challenge. Therefore, we propose an efficient and robust framework based on cloud for large-scale Traditional Chinese Medical Records (TCMRs) retrieval. METHODS We propose a parallel index building method and build a distributed search cluster, the former is used to improve the performance of index building, and the latter is used to provide high concurrent online TCMRs retrieval. Then, a real-time multi-indexing model is proposed to ensure the latest relevant TCMRs are indexed and retrieved in real-time, and a semantics-based query expansion method and a multi- factor ranking model are proposed to improve retrieval quality. Third, we implement a template-based visualization method for displaying medical reports. RESULTS The proposed parallel indexing method and distributed search cluster can improve the performance of index building and provide high concurrent online TCMRs retrieval. The multi-indexing model can ensure the latest relevant TCMRs are indexed and retrieved in real-time. The semantics expansion method and the multi-factor ranking model can enhance retrieval quality. The template-based visualization method can enhance the availability and universality, where the medical reports are displayed via friendly web interface. CONCLUSIONS In conclusion, compared with the current medical record retrieval systems, our system provides some advantages that are useful in improving the secondary use of large-scale traditional Chinese medical records in cloud environment. The proposed system is more easily integrated with existing clinical systems and be used in various scenarios.
Collapse
|
19
|
Li L, Lu J, Xue W, Wang L, Zhai Y, Fan Z, Wu G, Fan F, Li J, Zhang C, Zhang Y, Zhao J. Target of obstructive sleep apnea syndrome merge lung cancer: based on big data platform. Oncotarget 2017; 8:21567-21578. [PMID: 28423489 PMCID: PMC5400607 DOI: 10.18632/oncotarget.15372] [Citation(s) in RCA: 35] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2016] [Accepted: 01/16/2017] [Indexed: 11/26/2022] Open
Abstract
Based on our hospital database, the incidence of lung cancer diagnoses was similar in obstructive sleep apnea Syndrome (OSAS) and hospital general population; among individual with a diagnosis of lung cancer, the presence of OSAS was associated with an increased risk for mortality. In the gene expression and network-level information, we revealed significant alterations of molecules related to HIF1 and metabolic pathways in the hypoxic-conditioned lung cancer cells. We also observed that GBE1 and HK2 are downstream of HIF1 pathway important in hypoxia-conditioned lung cancer cell. Furthermore, we used publicly available datasets to validate that the late-stage lung adenocarcinoma patients showed higher expression HK2 and GBE1 than early-stage ones. In terms of prognostic features, a survival analysis revealed that the high GBE1 and HK2 expression group exhibited poorer survival in lung adenocarcinoma patients. By analyzing and integrating multiple datasets, we identify molecular convergence between hypoxia and lung cancer that reflects their clinical profiles and reveals molecular pathways involved in hypoxic-induced lung cancer progression. In conclusion, we show that OSAS severity appears to increase the risk of lung cancer mortality.
Collapse
Affiliation(s)
- Lifeng Li
- Biotherapy Center, The First Affiliated Hospital of Zhengzhou University, Zhengzhou 450052, Henan, China.,Department of Oncology, The First Affiliated Hospital of Zhengzhou University, Zhengzhou 450052, Henan, China.,Department of Pharmacy, The First Affiliated Hospital of Zhengzhou University, Zhengzhou 450052, Henan, China
| | - Jingli Lu
- Department of Pharmacy, The First Affiliated Hospital of Zhengzhou University, Zhengzhou 450052, Henan, China
| | - Wenhua Xue
- Department of Pharmacy, The First Affiliated Hospital of Zhengzhou University, Zhengzhou 450052, Henan, China
| | - Liping Wang
- Department of Oncology, The First Affiliated Hospital of Zhengzhou University, Zhengzhou 450052, Henan, China
| | - Yunkai Zhai
- Engineering Research Center of Digital Medicine, Zhengzhou 450052, Henan, China.,Engineering Laboratory for Digital Telemedicine Service, Zhengzhou 450052, Henan, China
| | - Zhirui Fan
- Department of Oncology, The First Affiliated Hospital of Zhengzhou University, Zhengzhou 450052, Henan, China
| | - Ge Wu
- Engineering Research Center of Digital Medicine, Zhengzhou 450052, Henan, China.,Engineering Laboratory for Digital Telemedicine Service, Zhengzhou 450052, Henan, China
| | - Feifei Fan
- Biotherapy Center, The First Affiliated Hospital of Zhengzhou University, Zhengzhou 450052, Henan, China.,Department of Oncology, The First Affiliated Hospital of Zhengzhou University, Zhengzhou 450052, Henan, China.,Department of Respiratoty and Sleep Disease, The First Affiliated Hospital of Zhengzhou University, Zhengzhou 450052, Henan, China
| | - Jieyao Li
- Biotherapy Center, The First Affiliated Hospital of Zhengzhou University, Zhengzhou 450052, Henan, China.,Department of Oncology, The First Affiliated Hospital of Zhengzhou University, Zhengzhou 450052, Henan, China
| | - Chaoqi Zhang
- Biotherapy Center, The First Affiliated Hospital of Zhengzhou University, Zhengzhou 450052, Henan, China.,Department of Oncology, The First Affiliated Hospital of Zhengzhou University, Zhengzhou 450052, Henan, China
| | - Yi Zhang
- Biotherapy Center, The First Affiliated Hospital of Zhengzhou University, Zhengzhou 450052, Henan, China.,Department of Oncology, The First Affiliated Hospital of Zhengzhou University, Zhengzhou 450052, Henan, China
| | - Jie Zhao
- Department of Pharmacy, The First Affiliated Hospital of Zhengzhou University, Zhengzhou 450052, Henan, China.,Engineering Research Center of Digital Medicine, Zhengzhou 450052, Henan, China.,Engineering Laboratory for Digital Telemedicine Service, Zhengzhou 450052, Henan, China
| |
Collapse
|
20
|
Harpaz D, Eltzov E, Seet RCS, Marks RS, Tok AIY. Point-of-Care-Testing in Acute Stroke Management: An Unmet Need Ripe for Technological Harvest. BIOSENSORS 2017; 7:E30. [PMID: 28771209 PMCID: PMC5618036 DOI: 10.3390/bios7030030] [Citation(s) in RCA: 30] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/04/2017] [Revised: 07/25/2017] [Accepted: 07/26/2017] [Indexed: 12/20/2022]
Abstract
Stroke, the second highest leading cause of death, is caused by an abrupt interruption of blood to the brain. Supply of blood needs to be promptly restored to salvage brain tissues from irreversible neuronal death. Existing assessment of stroke patients is based largely on detailed clinical evaluation that is complemented by neuroimaging methods. However, emerging data point to the potential use of blood-derived biomarkers in aiding clinical decision-making especially in the diagnosis of ischemic stroke, triaging patients for acute reperfusion therapies, and in informing stroke mechanisms and prognosis. The demand for newer techniques to deliver individualized information on-site for incorporation into a time-sensitive work-flow has become greater. In this review, we examine the roles of a portable and easy to use point-of-care-test (POCT) in shortening the time-to-treatment, classifying stroke subtypes and improving patient's outcome. We first examine the conventional stroke management workflow, then highlight situations where a bedside biomarker assessment might aid clinical decision-making. A novel stroke POCT approach is presented, which combines the use of quantitative and multiplex POCT platforms for the detection of specific stroke biomarkers, as well as data-mining tools to drive analytical processes. Further work is needed in the development of POCTs to fulfill an unmet need in acute stroke management.
Collapse
Affiliation(s)
- Dorin Harpaz
- Department of Biotechnology Engineering, Ben-Gurion University of the Negev, Beer-Sheva 84105, Israel.
- School of Material Science & Engineering, Nanyang Technology University, 50 Nanyang Avenue, Singapore 639798, Singapore.
- Institute for Sports Research (ISR), Nanyang Technology University and Loughborough University, Nanyang Avenue, Singapore 639798, Singapore.
| | - Evgeni Eltzov
- Agriculture Research Organization (ARO), Volcani Centre, Rishon LeTsiyon 15159, Israel.
| | - Raymond C S Seet
- Department of Medicine, Yong Loo Lin School of Medicine, National University of Singapore, NUHS Tower Block, 1E Kent Ridge Road, Singapore 119228, Singapore.
| | - Robert S Marks
- Department of Biotechnology Engineering, Ben-Gurion University of the Negev, Beer-Sheva 84105, Israel.
- School of Material Science & Engineering, Nanyang Technology University, 50 Nanyang Avenue, Singapore 639798, Singapore.
- The National Institute for Biotechnology in the Negev, Ben-Gurion University of the Negev, Beer-Sheva 84105, Israel.
- The Ilse Katz Centre for Meso and Nanoscale Science and Technology, Ben-Gurion University of the Negev, Beer-Sheva 84105, Israel.
| | - Alfred I Y Tok
- School of Material Science & Engineering, Nanyang Technology University, 50 Nanyang Avenue, Singapore 639798, Singapore.
- Institute for Sports Research (ISR), Nanyang Technology University and Loughborough University, Nanyang Avenue, Singapore 639798, Singapore.
| |
Collapse
|
21
|
Delussu G, Lianas L, Frexia F, Zanetti G. A Scalable Data Access Layer to Manage Structured Heterogeneous Biomedical Data. PLoS One 2016; 11:e0168004. [PMID: 27936191 PMCID: PMC5148592 DOI: 10.1371/journal.pone.0168004] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2016] [Accepted: 11/24/2016] [Indexed: 01/10/2023] Open
Abstract
This work presents a scalable data access layer, called PyEHR, designed to support the implementation of data management systems for secondary use of structured heterogeneous biomedical and clinical data. PyEHR adopts the openEHR's formalisms to guarantee the decoupling of data descriptions from implementation details and exploits structure indexing to accelerate searches. Data persistence is guaranteed by a driver layer with a common driver interface. Interfaces for two NoSQL Database Management Systems are already implemented: MongoDB and Elasticsearch. We evaluated the scalability of PyEHR experimentally through two types of tests, called "Constant Load" and "Constant Number of Records", with queries of increasing complexity on synthetic datasets of ten million records each, containing very complex openEHR archetype structures, distributed on up to ten computing nodes.
Collapse
Affiliation(s)
| | - Luca Lianas
- Data-Intensive Computing Group, CRS4, Pula, Italy
| | | | | |
Collapse
|
22
|
Cui Y, Wu Z, Lu Y, Jin W, Dai X, Bai J. Effects of the performance management information system in improving performance: an empirical study in Shanghai Ninth People’s Hospital. SPRINGERPLUS 2016; 5:1785. [PMID: 27795927 PMCID: PMC5063828 DOI: 10.1186/s40064-016-3436-2] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/23/2016] [Accepted: 09/29/2016] [Indexed: 11/10/2022]
Abstract
Improving the performance of clinical departments is not only the significant content of the healthcare system reform in China, but also the essential approach to better satisfying the Chinese growing demand for medical services. Performance management is vital and meaningful to public hospitals in China. Several studies are conducted in hospital internal performance management, but almost none of them consider the effects of informational tools. Therefore, we carried out an empirical study on effects of using performance management information system in Shanghai Ninth People’s Hospital. The main feature of the system is that it provides a real-time query platform for users to analyze and dynamically monitor the key performance indexes, timely detect problems and make adjustments. We collected pivotal medical data on 35 clinical departments of this hospital from January 2013 until December 2014, 1 year before and after applying the performance management information system. Comparative analysis was conducted by statistical methods. The results show that the system is beneficial to improve performance scores of clinical departments and lower the proportion of drug expenses, meanwhile, shorten the average hospitalized days and increase the bed turnover rate. That is to say, with the increasing medical services, the quality and efficiency is greatly improved. In a word, application of the performance management information system has a positive effect on improving performance of clinical departments.
Collapse
|
23
|
Big Data in Health: a Literature Review from the Year 2005. J Med Syst 2016; 40:209. [PMID: 27520614 DOI: 10.1007/s10916-016-0565-7] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2016] [Accepted: 08/02/2016] [Indexed: 12/24/2022]
Abstract
The information stored in healthcare systems has increased over the last ten years, leading it to be considered Big Data. There is a wealth of health information ready to be analysed. However, the sheer volume raises a challenge for traditional methods. The aim of this article is to conduct a cutting-edge study on Big Data in healthcare from 2005 to the present. This literature review will help researchers to know how Big Data has developed in the health industry and open up new avenues for research. Information searches have been made on various scientific databases such as Pubmed, Science Direct, Scopus and Web of Science for Big Data in healthcare. The search criteria were "Big Data" and "health" with a date range from 2005 to the present. A total of 9724 articles were found on the databases. 9515 articles were discarded as duplicates or for not having a title of interest to the study. 209 articles were read, with the resulting decision that 46 were useful for this study. 52.6 % of the articles used were found in Science Direct, 23.7 % in Pubmed, 22.1 % through Scopus and the remaining 2.6 % through the Web of Science. Big Data has undergone extremely high growth since 2011 and its use is becoming compulsory in developed nations and in an increasing number of developing nations. Big Data is a step forward and a cost reducer for public and private healthcare.
Collapse
|
24
|
Griss J, Perez-Riverol Y, Lewis S, Tabb DL, Dianes JA, del-Toro N, Rurik M, Walzer MW, Kohlbacher O, Hermjakob H, Wang R, Vizcaíno JA. Recognizing millions of consistently unidentified spectra across hundreds of shotgun proteomics datasets. Nat Methods 2016; 13:651-656. [PMID: 27493588 PMCID: PMC4968634 DOI: 10.1038/nmeth.3902] [Citation(s) in RCA: 114] [Impact Index Per Article: 14.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2015] [Accepted: 05/24/2016] [Indexed: 12/13/2022]
Abstract
Mass spectrometry (MS) is the main technology used in proteomics approaches. However, on average 75% of spectra analysed in an MS experiment remain unidentified. We propose to use spectrum clustering at a large-scale to shed a light on these unidentified spectra. PRoteomics IDEntifications database (PRIDE) Archive is one of the largest MS proteomics public data repositories worldwide. By clustering all tandem MS spectra publicly available in PRIDE Archive, coming from hundreds of datasets, we were able to consistently characterize three distinct groups of spectra: 1) incorrectly identified spectra, 2) spectra correctly identified but below the set scoring threshold, and 3) truly unidentified spectra. Using a multitude of complementary analysis approaches, we were able to identify less than 20% of the consistently unidentified spectra. The complete spectrum clustering results are available through the new version of the PRIDE Cluster resource (http://www.ebi.ac.uk/pride/cluster). This resource is intended, among other aims, to encourage and simplify further investigation into these unidentified spectra.
Collapse
Affiliation(s)
- Johannes Griss
- Division of Immunology, Allergy and Infectious Diseases, Department of Dermatology, Medical University of Vienna, Austria
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom
| | - Yasset Perez-Riverol
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom
| | - Steve Lewis
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom
| | - David L. Tabb
- Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville
| | - José A. Dianes
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom
| | - Noemi del-Toro
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom
| | - Marc Rurik
- Dept. of Computer Science, University of Tübingen, Germany
- Center for Bioinformatics, University of Tübingen, Germany
| | - Mathias W. Walzer
- Dept. of Computer Science, University of Tübingen, Germany
- Center for Bioinformatics, University of Tübingen, Germany
| | - Oliver Kohlbacher
- Dept. of Computer Science, University of Tübingen, Germany
- Center for Bioinformatics, University of Tübingen, Germany
- Quantitative Biology Center, University of Tübingen, Germany
- Max Planck Institute for Developmental Biology, Germany
| | - Henning Hermjakob
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom
- National Center for Protein Sciences, Beijing, China
| | - Rui Wang
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom
| | - Juan Antonio Vizcaíno
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom
| |
Collapse
|