1
|
Chen H, Chen X, Peng L, Bai Y. Personalized Fair Split Learning for Resource-Constrained Internet of Things. Sensors (Basel) 2023; 24:88. [PMID: 38202949 PMCID: PMC10781178 DOI: 10.3390/s24010088] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/03/2023] [Revised: 12/03/2023] [Accepted: 12/20/2023] [Indexed: 01/12/2024]
Abstract
With the flourishing development of the Internet of Things (IoT), federated learning has garnered significant attention as a distributed learning method aimed at preserving the privacy of participant data. However, certain IoT devices, such as sensors, face challenges in effectively employing conventional federated learning approaches due to limited computational and storage resources, which hinder their ability to train complex local models. Additionally, in IoT environments, devices often face problems of data heterogeneity and uneven benefit distribution between them. To address these challenges, a personalized and fair split learning framework is proposed for resource-constrained clients. This framework first adopts a U-shaped structure, dividing the model to enable resource-constrained clients to offload subsets of the foundational model to a central server while retaining personalized model subsets locally to meet the specific personalized requirements of different clients. Furthermore, to ensure fair benefit distribution, a model-aggregation method with optimized aggregation weights is used. This method reasonably allocates model-aggregation weights based on the contributions of clients, thereby achieving collaborative fairness. Experimental results demonstrate that, in three distinct data heterogeneity scenarios, employing personalized training through this framework exhibits higher accuracy compared to existing baseline methods. Simultaneously, the framework ensures collaborative fairness, fostering a more balanced and sustainable cooperation among IoT devices.
Collapse
Affiliation(s)
- Haitian Chen
- College of Science, North China University of Science and Technology, Tangshan 063210, China; (H.C.)
- Hebei Key Laboratory of Data Science and Application, Tangshan 063210, China
- Tangshan Key Laboratory of Data Science, Tangshan 063210, China
| | - Xuebin Chen
- College of Science, North China University of Science and Technology, Tangshan 063210, China; (H.C.)
- Hebei Key Laboratory of Data Science and Application, Tangshan 063210, China
- Tangshan Key Laboratory of Data Science, Tangshan 063210, China
| | - Lulu Peng
- College of Science, North China University of Science and Technology, Tangshan 063210, China; (H.C.)
- Hebei Key Laboratory of Data Science and Application, Tangshan 063210, China
- Tangshan Key Laboratory of Data Science, Tangshan 063210, China
| | - Yuntian Bai
- College of Science, North China University of Science and Technology, Tangshan 063210, China; (H.C.)
| |
Collapse
|
2
|
Tang J, Ding X, Hu D, Guo B, Shen Y, Ma P, Jiang Y. FedRAD: Heterogeneous Federated Learning via Relational Adaptive Distillation. Sensors (Basel) 2023; 23:6518. [PMID: 37514811 PMCID: PMC10385861 DOI: 10.3390/s23146518] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/18/2023] [Revised: 07/05/2023] [Accepted: 07/17/2023] [Indexed: 07/30/2023]
Abstract
As the development of the Internet of Things (IoT) continues, Federated Learning (FL) is gaining popularity as a distributed machine learning framework that does not compromise the data privacy of each participant. However, the data held by enterprises and factories in the IoT often have different distribution properties (Non-IID), leading to poor results in their federated learning. This problem causes clients to forget about global knowledge during their local training phase and then tends to slow convergence and degrades accuracy. In this work, we propose a method named FedRAD, which is based on relational knowledge distillation that further enhances the mining of high-quality global knowledge by local models from a higher-dimensional perspective during their local training phase to better retain global knowledge and avoid forgetting. At the same time, we devise an entropy-wise adaptive weights module (EWAW) to better regulate the proportion of loss in single-sample knowledge distillation versus relational knowledge distillation so that students can weigh losses based on predicted entropy and learn global knowledge more effectively. A series of experiments on CIFAR10 and CIFAR100 show that FedRAD has better performance in terms of convergence speed and classification accuracy compared to other advanced FL methods.
Collapse
Affiliation(s)
- Jianwu Tang
- College of Computer Science, Sichuan University, Chengdu 610065, China
- Big Data Analysis and Fusion Application Technology Engineering Laboratory of Sichuan Province, Chengdu 610065, China
| | - Xuefeng Ding
- College of Computer Science, Sichuan University, Chengdu 610065, China
- Big Data Analysis and Fusion Application Technology Engineering Laboratory of Sichuan Province, Chengdu 610065, China
| | - Dasha Hu
- College of Computer Science, Sichuan University, Chengdu 610065, China
- Big Data Analysis and Fusion Application Technology Engineering Laboratory of Sichuan Province, Chengdu 610065, China
| | - Bing Guo
- College of Computer Science, Sichuan University, Chengdu 610065, China
- Big Data Analysis and Fusion Application Technology Engineering Laboratory of Sichuan Province, Chengdu 610065, China
| | - Yuncheng Shen
- College of Physics and Information Engineering, Zhaotong University, Zhaotong 657000, China
| | - Pan Ma
- College of Computer Science, Sichuan University, Chengdu 610065, China
- Big Data Analysis and Fusion Application Technology Engineering Laboratory of Sichuan Province, Chengdu 610065, China
| | - Yuming Jiang
- College of Computer Science, Sichuan University, Chengdu 610065, China
- Big Data Analysis and Fusion Application Technology Engineering Laboratory of Sichuan Province, Chengdu 610065, China
| |
Collapse
|
3
|
Wei Y, Li L, Zhao X, Yang H, Sa J, Cao H, Cui Y. Cancer subtyping with heterogeneous multi-omics data via hierarchical multi-kernel learning. Brief Bioinform 2023; 24:6847203. [PMID: 36433785 DOI: 10.1093/bib/bbac488] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2022] [Revised: 09/14/2022] [Accepted: 10/15/2022] [Indexed: 11/27/2022] Open
Abstract
Differentiating cancer subtypes is crucial to guide personalized treatment and improve the prognosis for patients. Integrating multi-omics data can offer a comprehensive landscape of cancer biological process and provide promising ways for cancer diagnosis and treatment. Taking the heterogeneity of different omics data types into account, we propose a hierarchical multi-kernel learning (hMKL) approach, a novel cancer molecular subtyping method to identify cancer subtypes by adopting a two-stage kernel learning strategy. In stage 1, we obtain a composite kernel borrowing the cancer integration via multi-kernel learning (CIMLR) idea by optimizing the kernel parameters for individual omics data type. In stage 2, we obtain a final fused kernel through a weighted linear combination of individual kernels learned from stage 1 using an unsupervised multiple kernel learning method. Based on the final fusion kernel, k-means clustering is applied to identify cancer subtypes. Simulation studies show that hMKL outperforms the one-stage CIMLR method when there is data heterogeneity. hMKL can estimate the number of clusters correctly, which is the key challenge in subtyping. Application to two real data sets shows that hMKL identified meaningful subtypes and key cancer-associated biomarkers. The proposed method provides a novel toolkit for heterogeneous multi-omics data integration and cancer subtypes identification.
Collapse
Affiliation(s)
- Yifang Wei
- Division of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, Shanxi 030001, PR China
| | - Lingmei Li
- Division of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, Shanxi 030001, PR China
| | - Xin Zhao
- Division of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, Shanxi 030001, PR China
| | - Haitao Yang
- Division of Health Statistics, School of Public Health, Hebei Medical University, Shijiazhuang, Hebei 050017, PR China
| | - Jian Sa
- Department of Science and Technology, Shanxi Provincial Key Laboratory of Major Disease Risk Assessment, Shanxi Medical University, Taiyuan, Shanxi 030001, PR China
| | - Hongyan Cao
- Division of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, Shanxi 030001, PR China.,Department of Mathematics, Shanxi Medical University, Taiyuan, Shanxi 030001, PR China
| | - Yuehua Cui
- Department of Statistics and Probability, Michigan State University, East Lansing, MI 48824, USA
| |
Collapse
|
4
|
Huling JD, Yu M. Sufficient dimension reduction for populations with structured heterogeneity. Biometrics 2022; 78:1626-1638. [PMID: 34520573 DOI: 10.1111/biom.13546] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2020] [Revised: 07/27/2021] [Accepted: 08/06/2021] [Indexed: 12/30/2022]
Abstract
A key challenge in building effective regression models for large and diverse populations is accounting for patient heterogeneity. An example of such heterogeneity is in health system risk modeling efforts where different combinations of comorbidities fundamentally alter the relationship between covariates and health outcomes. Accounting for heterogeneity arising combinations of factors can yield more accurate and interpretable regression models. Yet, in the presence of high-dimensional covariates, accounting for this type of heterogeneity can exacerbate estimation difficulties even with large sample sizes. To handle these issues, we propose a flexible and interpretable risk modeling approach based on semiparametric sufficient dimension reduction. The approach accounts for patient heterogeneity, borrows strength in estimation across related subpopulations to improve both estimation efficiency and interpretability, and can serve as a useful exploratory tool or as a powerful predictive model. In simulated examples, we show that our approach often improves estimation performance in the presence of heterogeneity and is quite robust to deviations from its key underlying assumptions. We demonstrate our approach in an analysis of hospital admission risk for a large health system and demonstrate its predictive power when tested on further follow-up data.
Collapse
Affiliation(s)
- Jared D Huling
- Division of Biostatistics, University of Minnesota, Minneapolis, Minnesota, USA
| | - Menggang Yu
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, Wisconsin, USA
| |
Collapse
|
5
|
Zhang M, Qu L, Singh P, Kalpathy-Cramer J, Rubin DL. SplitAVG: A Heterogeneity-Aware Federated Deep Learning Method for Medical Imaging. IEEE J Biomed Health Inform 2022; 26:4635-4644. [PMID: 35749336 PMCID: PMC9749741 DOI: 10.1109/jbhi.2022.3185956] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Federated learning is an emerging research paradigm for enabling collaboratively training deep learning models without sharing patient data. However, the data from different institutions are usually heterogeneous across institutions, which may reduce the performance of models trained using federated learning. In this study, we propose a novel heterogeneity-aware federated learning method, SplitAVG, to overcome the performance drops from data heterogeneity in federated learning. Unlike previous federated methods that require complex heuristic training or hyper parameter tuning, our SplitAVG leverages the simple network split and feature map concatenation strategies to encourage the federated model training an unbiased estimator of the target data distribution. We compare SplitAVG with seven state-of-the-art federated learning methods, using centrally hosted training data as the baseline on a suite of both synthetic and real-world federated datasets. We find that the performance of models trained using all the comparison federated learning methods degraded significantly with the increasing degrees of data heterogeneity. In contrast, SplitAVG method achieves comparable results to the baseline method under all heterogeneous settings, that it achieves 96.2% of the accuracy and 110.4% of the mean absolute error obtained by the baseline in a diabetic retinopathy binary classification dataset and a bone age prediction dataset, respectively, on highly heterogeneous data partitions. We conclude that SplitAVG method can effectively overcome the performance drops from variability in data distributions across institutions. Experimental results also show that SplitAVG can be adapted to different base convolutional neural networks (CNNs) and generalized to various types of medical imaging tasks. The code is publicly available at https://github.com/zm17943/SplitAVG.
Collapse
|
6
|
Kiser AC, Eilbeck K, Ferraro JP, Skarda DE, Samore MH, Bucher B. Standard Vocabularies to Improve Machine Learning Model Transferability With Electronic Health Record Data: Retrospective Cohort Study Using Health Care-Associated Infection. JMIR Med Inform 2022; 10:e39057. [PMID: 36040784 PMCID: PMC9472055 DOI: 10.2196/39057] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2022] [Revised: 08/09/2022] [Accepted: 08/15/2022] [Indexed: 11/13/2022] Open
Abstract
BACKGROUND With the widespread adoption of electronic healthcare records (EHRs) by US hospitals, there is an opportunity to leverage this data for the development of predictive algorithms to improve clinical care. A key barrier in model development and implementation includes the external validation of model discrimination, which is rare and often results in worse performance. One reason why machine learning models are not externally generalizable is data heterogeneity. A potential solution to address the substantial data heterogeneity between health care systems is to use standard vocabularies to map EHR data elements. The advantage of these vocabularies is a hierarchical relationship between elements, which allows the aggregation of specific clinical features to more general grouped concepts. OBJECTIVE This study aimed to evaluate grouping EHR data using standard vocabularies to improve the transferability of machine learning models for the detection of postoperative health care-associated infections across institutions with different EHR systems. METHODS Patients who underwent surgery from the University of Utah Health and Intermountain Healthcare from July 2014 to August 2017 with complete follow-up data were included. The primary outcome was a health care-associated infection within 30 days of the procedure. EHR data from 0-30 days after the operation were mapped to standard vocabularies and grouped using the hierarchical relationships of the vocabularies. Model performance was measured using the area under the receiver operating characteristic curve (AUC) and F1-score in internal and external validations. To evaluate model transferability, a difference-in-difference metric was defined as the difference in performance drop between internal and external validations for the baseline and grouped models. RESULTS A total of 5775 patients from the University of Utah and 15,434 patients from Intermountain Healthcare were included. The prevalence of selected outcomes was from 4.9% (761/15,434) to 5% (291/5775) for surgical site infections, from 0.8% (44/5775) to 1.1% (171/15,434) for pneumonia, from 2.6% (400/15,434) to 3% (175/5775) for sepsis, and from 0.8% (125/15,434) to 0.9% (50/5775) for urinary tract infections. In all outcomes, the grouping of data using standard vocabularies resulted in a reduced drop in AUC and F1-score in external validation compared to baseline features (all P<.001, except urinary tract infection AUC: P=.002). The difference-in-difference metrics ranged from 0.005 to 0.248 for AUC and from 0.075 to 0.216 for F1-score. CONCLUSIONS We demonstrated that grouping machine learning model features based on standard vocabularies improved model transferability between data sets across 2 institutions. Improving model transferability using standard vocabularies has the potential to improve the generalization of clinical prediction models across the health care system.
Collapse
Affiliation(s)
- Amber C Kiser
- Department of Biomedical Informatics, School of Medicine, University of Utah, Salt Lake City, UT, United States
| | - Karen Eilbeck
- Department of Biomedical Informatics, School of Medicine, University of Utah, Salt Lake City, UT, United States
| | - Jeffrey P Ferraro
- Department of Medicine, School of Medicine, University of Utah, Salt Lake City, UT, United States
| | - David E Skarda
- Center for Value-Based Surgery, Intermountain Healthcare, Salt Lake City, UT, United States.,Department of Surgery, School of Medicine, University of Utah, Salt Lake City, UT, United States
| | - Matthew H Samore
- Department of Medicine, School of Medicine, University of Utah, Salt Lake City, UT, United States.,Informatics, Decision-Enhancement and Analytic Sciences Center 2.0, Veterans Affairs Salt Lake City Health Care System, Salt Lake City, UT, United States
| | - Brian Bucher
- Department of Biomedical Informatics, School of Medicine, University of Utah, Salt Lake City, UT, United States.,Department of Surgery, School of Medicine, University of Utah, Salt Lake City, UT, United States
| |
Collapse
|
7
|
Rocha Filho GP, Brandão AH, Nobre RA, Meneguette RI, Freitas H, Gonçalves VP. HOsT: Towards a Low-Cost Fog Solution via Smart Objects to Deal with the Heterogeneity of Data in a Residential Environment. Sensors (Basel) 2022; 22:6257. [PMID: 36016017 PMCID: PMC9414299 DOI: 10.3390/s22166257] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 05/28/2022] [Revised: 07/11/2022] [Accepted: 07/20/2022] [Indexed: 06/15/2023]
Abstract
With the fast and unstoppable development of technology, the amount of available technological devices and the data they produce is overwhelming. In analyzing the context of a smart home, a diverse group of intelligent devices generating constant reports of its environment information is needed for the proper control of the house. Due to this demand, many possible solutions have been developed in the literature to assess the need for processing power and storage capacity. This work proposes HOsT (home-context-aware fog-computing solution)-a solution that addresses the problems of data heterogeneity and the interoperability of smart objects in the context of a smart home. HOsT was modeled to compose a set of intelligent objects to form a computational infrastructure in fog. A publish/subscribe communication module was implemented to abstract the details of communication between objects to disseminate heterogeneous information. A performance evaluation was carried out to validate HOsT. The results show evidence of efficiency in the communication infrastructure; and in the impact of HOsT compared with a cloud infrastructure. Furthermore, HOsT provides scalability about the number of devices acting simultaneously and demonstrates its ability to work with different devices.
Collapse
Affiliation(s)
| | - Artur H. Brandão
- Department of Computer Science, University of Brasília, Brasília 70910-900, Brazil
| | - Renato A. Nobre
- Department of Computer Science, University of Brasília, Brasília 70910-900, Brazil
- Computer Science Department, Università degli Studi di Milano, 20122 Milano, Italy
| | - Rodolfo I. Meneguette
- Institute of Mathematical and Computer Sciences, University of São Paulo, São Carlos 13560-970, Brazil
| | | | - Vinícius P. Gonçalves
- Electrical Engineering Department, University of Brasília, Brasília 70910-900, Brazil
| |
Collapse
|
8
|
Schreiner P, Velasquez MP, Gottschalk S, Zhang J, Fan Y. Unifying heterogeneous expression data to predict targets for CAR-T cell therapy. Oncoimmunology 2021; 10:2000109. [PMID: 34858726 PMCID: PMC8632331 DOI: 10.1080/2162402x.2021.2000109] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2021] [Revised: 10/08/2021] [Accepted: 10/26/2021] [Indexed: 10/29/2022] Open
Abstract
Chimeric antigen receptor (CAR) T-cell therapy combines antigen-specific properties of monoclonal antibodies with the lytic capacity of T cells. An effective and safe CAR-T cell therapy strategy relies on identifying an antigen that has high expression and is tumor specific. This strategy has been successfully used to treat patients with CD19+ B-cell acute lymphoblastic leukemia (B-ALL). Finding a suitable target antigen for other cancers such as acute myeloid leukemia (AML) has proven challenging, as the majority of currently targeted AML antigens are also expressed on hematopoietic progenitor cells (HPCs) or mature myeloid cells. Herein, we developed a computational method to perform a data transformation to enable the comparison of publicly available gene expression data across different datasets or assay platforms. The resulting transformed expression values (TEVs) were used in our antigen prediction algorithm to assess suitable tumor-associated antigens (TAAs) that could be targeted with CAR-T cells. We validated this method by identifying B-ALL antigens with known clinical effectiveness, such as CD19 and CD22. Our algorithm predicted TAAs being currently explored preclinically and in clinical CAR-T AML therapy trials, as well as novel TAAs in pediatric megakaryoblastic AML. Thus, this analytical approach presents a promising new strategy to mine diverse datasets for identifying TAAs suitable for immunotherapy.
Collapse
Affiliation(s)
- Patrick Schreiner
- The Center for Applied Bioinformatics, St. Jude Children’s Research Hospital, Memphis, TN, USA
| | - Mireya Paulina Velasquez
- Department of Bone Marrow Transplantation and Cell Therapy, St. Jude Children’s Research Hospital, Memphis, TN, USA
| | - Stephen Gottschalk
- Department of Bone Marrow Transplantation and Cell Therapy, St. Jude Children’s Research Hospital, Memphis, TN, USA
| | - Jinghui Zhang
- Department of Computational Biology, St. Jude Children’s Research Hospital, Memphis, TN, USA
| | - Yiping Fan
- The Center for Applied Bioinformatics, St. Jude Children’s Research Hospital, Memphis, TN, USA
| |
Collapse
|
9
|
Chui KT, Gupta BB, Liu RW, Vasant P. Handling Data Heterogeneity in Electricity Load Disaggregation via Optimized Complete Ensemble Empirical Mode Decomposition and Wavelet Packet Transform. Sensors (Basel) 2021; 21:3133. [PMID: 33946443 DOI: 10.3390/s21093133] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/07/2021] [Revised: 04/24/2021] [Accepted: 04/26/2021] [Indexed: 11/16/2022]
Abstract
Global warming is a leading world issue driving the common social objective of reducing carbon emissions. People have witnessed the melting of ice and abrupt changes in climate. Reducing electricity usage is one possible method of slowing these changes. In recent decades, there have been massive worldwide rollouts of smart meters that automatically capture the total electricity usage of houses and buildings. Electricity load disaggregation (ELD) helps to break down total electricity usage into that of individual appliances. Studies have implemented ELD models based on various artificial intelligence techniques using a single ELD dataset. In this paper, a powerline noise transformation approach based on optimized complete ensemble empirical model decomposition and wavelet packet transform (OCEEMD-WPT) is proposed to merge the ELD datasets. The practical implications are that the method increases the size of training datasets and provides mutual benefits when utilizing datasets collected from other sources (especially from different countries). To reveal the effectiveness of the proposed method, it was compared with CEEMD-WPT (fixed controlled coefficients), standalone CEEMD, standalone WPT, and other existing works. The results show that the proposed approach improves the signal-to-noise ratio (SNR) significantly.
Collapse
|
10
|
Krassowski M, Das V, Sahu SK, Misra BB. State of the Field in Multi-Omics Research: From Computational Needs to Data Mining and Sharing. Front Genet 2020; 11:610798. [PMID: 33362867 PMCID: PMC7758509 DOI: 10.3389/fgene.2020.610798] [Citation(s) in RCA: 126] [Impact Index Per Article: 31.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2020] [Accepted: 11/20/2020] [Indexed: 12/24/2022] Open
Abstract
Multi-omics, variously called integrated omics, pan-omics, and trans-omics, aims to combine two or more omics data sets to aid in data analysis, visualization and interpretation to determine the mechanism of a biological process. Multi-omics efforts have taken center stage in biomedical research leading to the development of new insights into biological events and processes. However, the mushrooming of a myriad of tools, datasets, and approaches tends to inundate the literature and overwhelm researchers new to the field. The aims of this review are to provide an overview of the current state of the field, inform on available reliable resources, discuss the application of statistics and machine/deep learning in multi-omics analyses, discuss findable, accessible, interoperable, reusable (FAIR) research, and point to best practices in benchmarking. Thus, we provide guidance to interested users of the domain by addressing challenges of the underlying biology, giving an overview of the available toolset, addressing common pitfalls, and acknowledging current methods' limitations. We conclude with practical advice and recommendations on software engineering and reproducibility practices to share a comprehensive awareness with new researchers in multi-omics for end-to-end workflow.
Collapse
Affiliation(s)
- Michal Krassowski
- Nuffield Department of Women’s & Reproductive Health, University of Oxford, Oxford, United Kingdom
| | - Vivek Das
- Novo Nordisk Research Center Seattle, Inc, Seattle, WA, United States
| | | | | |
Collapse
|
11
|
Chitoiu L, Dobranici A, Gherghiceanu M, Dinescu S, Costache M. Multi-Omics Data Integration in Extracellular Vesicle Biology-Utopia or Future Reality? Int J Mol Sci 2020; 21:ijms21228550. [PMID: 33202771 PMCID: PMC7697477 DOI: 10.3390/ijms21228550] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2020] [Revised: 11/10/2020] [Accepted: 11/11/2020] [Indexed: 12/15/2022] Open
Abstract
Extracellular vesicles (EVs) are membranous structures derived from the endosomal system or generated by plasma membrane shedding. Due to their composition of DNA, RNA, proteins, and lipids, EVs have garnered a lot of attention as an essential mechanism of cell-to-cell communication, with various implications in physiological and pathological processes. EVs are not only a highly heterogeneous population by means of size and biogenesis, but they are also a source of diverse, functionally rich biomolecules. Recent advances in high-throughput processing of biological samples have facilitated the development of databases comprised of characteristic genomic, transcriptomic, proteomic, metabolomic, and lipidomic profiles for EV cargo. Despite the in-depth approach used to map functional molecules in EV-mediated cellular cross-talk, few integrative methods have been applied to analyze the molecular interplay in these targeted delivery systems. New perspectives arise from the field of systems biology, where accounting for heterogeneity may lead to finding patterns in an apparently random pool of data. In this review, we map the biological and methodological causes of heterogeneity in EV multi-omics data and present current applications or possible statistical methods for integrating such data while keeping track of the current bottlenecks in the field.
Collapse
Affiliation(s)
- Leona Chitoiu
- Ultrastructural Pathology and Bioimaging Laboratory, ‘Victor Babeș’ National Institute of Pathology, Bucharest 050096, Romania; (L.C.); (M.G.)
| | - Alexandra Dobranici
- Department of Biochemistry and Molecular Biology, University of Bucharest, Bucharest 050095, Romania; (A.D.); (M.C.)
| | - Mihaela Gherghiceanu
- Ultrastructural Pathology and Bioimaging Laboratory, ‘Victor Babeș’ National Institute of Pathology, Bucharest 050096, Romania; (L.C.); (M.G.)
- Department of Cellular, Molecular Biology and Histology, ‘Carol Davila’ University of Medicine and Pharmacy, Bucharest 050474, Romania
| | - Sorina Dinescu
- Department of Biochemistry and Molecular Biology, University of Bucharest, Bucharest 050095, Romania; (A.D.); (M.C.)
- Research Institute of the University of Bucharest, University of Bucharest, Bucharest 050663, Romania
- Correspondence:
| | - Marieta Costache
- Department of Biochemistry and Molecular Biology, University of Bucharest, Bucharest 050095, Romania; (A.D.); (M.C.)
- Research Institute of the University of Bucharest, University of Bucharest, Bucharest 050663, Romania
| |
Collapse
|
12
|
Mavrogiorgou A, Kiourtis A, Perakis K, Pitsios S, Kyriazis D. IoT in Healthcare: Achieving Interoperability of High-Quality Data Acquired by IoT Medical Devices. Sensors (Basel) 2019; 19:s19091978. [PMID: 31035612 PMCID: PMC6539021 DOI: 10.3390/s19091978] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/27/2019] [Revised: 04/23/2019] [Accepted: 04/24/2019] [Indexed: 11/28/2022]
Abstract
It is an undeniable fact that Internet of Things (IoT) technologies have become a milestone advancement in the digital healthcare domain, since the number of IoT medical devices is grown exponentially, and it is now anticipated that by 2020 there will be over 161 million of them connected worldwide. Therefore, in an era of continuous growth, IoT healthcare faces various challenges, such as the collection, the quality estimation, as well as the interpretation and the harmonization of the data that derive from the existing huge amounts of heterogeneous IoT medical devices. Even though various approaches have been developed so far for solving each one of these challenges, none of these proposes a holistic approach for successfully achieving data interoperability between high-quality data that derive from heterogeneous devices. For that reason, in this manuscript a mechanism is produced for effectively addressing the intersection of these challenges. Through this mechanism, initially, the collection of the different devices’ datasets occurs, followed by the cleaning of them. In sequel, the produced cleaning results are used in order to capture the levels of the overall data quality of each dataset, in combination with the measurements of the availability of each device that produced each dataset, and the reliability of it. Consequently, only the high-quality data is kept and translated into a common format, being able to be used for further utilization. The proposed mechanism is evaluated through a specific scenario, producing reliable results, achieving data interoperability of 100% accuracy, and data quality of more than 90% accuracy.
Collapse
Affiliation(s)
- Argyro Mavrogiorgou
- Department of Digital Systems, University of Piraeus, M. Karaoli & A. Dimitriou 80, 18534 Piraeus, Greece.
| | - Athanasios Kiourtis
- Department of Digital Systems, University of Piraeus, M. Karaoli & A. Dimitriou 80, 18534 Piraeus, Greece.
| | | | - Stamatios Pitsios
- Singular Logic EU Projects Department, Achaias 3, 14564 Kifisia, Greece.
| | - Dimosthenis Kyriazis
- Department of Digital Systems, University of Piraeus, M. Karaoli & A. Dimitriou 80, 18534 Piraeus, Greece.
| |
Collapse
|
13
|
Abstract
Next-generation RNA sequencing (RNA-seq) technology has been widely used to assess full-length RNA isoform abundance in a high-throughput manner. RNA-seq data offer insight into gene expression levels and transcriptome structures, enabling us to better understand the regulation of gene expression and fundamental biological processes. Accurate isoform quantification from RNA-seq data is challenging due to the information loss in sequencing experiments. A recent accumulation of multiple RNA-seq data sets from the same tissue or cell type provides new opportunities to improve the accuracy of isoform quantification. However, existing statistical or computational methods for multiple RNA-seq samples either pool the samples into one sample or assign equal weights to the samples when estimating isoform abundance. These methods ignore the possible heterogeneity in the quality of different samples and could result in biased and unrobust estimates. In this article, we develop a method, which we call "joint modeling of multiple RNA-seq samples for accurate isoform quantification" (MSIQ), for more accurate and robust isoform quantification by integrating multiple RNA-seq samples under a Bayesian framework. Our method aims to (1) identify a consistent group of samples with homogeneous quality and (2) improve isoform quantification accuracy by jointly modeling multiple RNA-seq samples by allowing for higher weights on the consistent group. We show that MSIQ provides a consistent estimator of isoform abundance, and we demonstrate the accuracy and effectiveness of MSIQ compared with alternative methods through simulation studies on D. melanogaster genes. We justify MSIQ's advantages over existing approaches via application studies on real RNA-seq data from human embryonic stem cells, brain tissues, and the HepG2 immortalized cell line. We also perform a comprehensive analysis of how the isoform quantification accuracy would be affected by RNA-seq sample heterogeneity and different experimental protocols.
Collapse
|
14
|
Di Salle P, Incerti G, Colantuono C, Chiusano ML. Gene co-expression analyses: an overview from microarray collections in Arabidopsis thaliana. Brief Bioinform 2017; 18:215-225. [PMID: 26891982 DOI: 10.1093/bib/bbw002] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2015] [Indexed: 01/08/2023] Open
Abstract
Bioinformatics web-based resources and databases are precious references for most biological laboratories worldwide. However, the quality and reliability of the information they provide depends on them being used in an appropriate way that takes into account their specific features. Huge collections of gene expression data are currently publicly available, ready to support the understanding of gene and genome functionalities. In this context, tools and resources for gene co-expression analyses have flourished to exploit the 'guilty by association' principle, which assumes that genes with correlated expression profiles are functionally related. In the case of Arabidopsis thaliana, the reference species in plant biology, the resources available mainly consist of microarray results. After a general overview of such resources, we tested and compared the results they offer for gene co-expression analysis. We also discuss the effect on the results when using different data sets, as well as different data normalization approaches and parameter settings, which often consider different metrics for establishing co-expression. A dedicated example analysis of different gene pools, implemented by including/excluding mutant samples in a reference data set, showed significant variation of gene co-expression occurrence, magnitude and direction. We conclude that, as the heterogeneity of the resources and methods may produce different results for the same query genes, the exploration of more than one of the available resources is strongly recommended. The aim of this article is to show how best to integrate data sources and/or merge outputs to achieve robust analyses and reliable interpretations, thereby making use of diverse data resources an opportunity for added value.
Collapse
Affiliation(s)
- Pasquale Di Salle
- Department of Agriculture, University of Naples Federico II, Portici, Italy
| | - Guido Incerti
- Dipartimento di Agraria , University of Naples Federico II, via Università, Portici (NA), Italy
| | - Chiara Colantuono
- Department of Agriculture, University of Naples Federico II, Portici, Italy
| | | |
Collapse
|
15
|
Lamm SH, Li J, Robbins SA, Dissen E, Chen R, Feinleib M. Are residents of mountain-top mining counties more likely to have infants with birth defects? The West Virginia experience. Birth Defects Res A Clin Mol Teratol 2015. [PMID: 25388330 DOI: 10.1002/bdra.23322/abstract] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 04/25/2023]
Abstract
BACKGROUND Pooled 1996 to 2003 birth certificate data for four central states in Appalachia indicated higher rates of infants with birth defects born to residents of counties with mountain-top mining (MTM) than born to residents of non-mining-counties (Ahern 2011). However, those analyses did not consider sources of uncertainty such as unbalanced distributions or quality of data. Quality issues have been a continuing problem with birth certificate analyses. We used 1990 to 2009 live birth certificate data for West Virginia to reassess this hypothesis. METHODS Forty-four hospitals contributed 98% of the MTM-county births and 95% of the non-mining-county births, of which six had more than 1000 births from both MTM and nonmining counties. Adjusted and stratified prevalence rate ratios (PRRs) were computed both by using Poisson regression and Mantel-Haenszel analysis. RESULTS Unbalanced distribution of hospital births was observed by mining groups. The prevalence rate of infants with reported birth defects, higher in MTM-counties (0.021) than in non-mining-counties (0.015), yielded a significant crude PRR (cPRR = 1.43; 95% confidence interval [CI] = 1.36-1.52) but a nonsignificant hospital-adjusted PRR (adjPRR = 1.08; 95% CI = 0.97-1.20; p = 0.16) for the 44 hospitals. So did the six hospital data analysis ([cPRR = 2.39; 95% CI = 2.15-2.65] and [adjPRR = 1.01; 95% CI, 0.89-1.14; p = 0.87]). CONCLUSION No increased risk of birth defects was observed for births from MTM-counties after adjustment for, or stratification by, hospital of birth. These results have consistently demonstrated that the reported association between birth defect rates and MTM coal mining was a consequence of data heterogeneity. The data do not demonstrate evidence of a "Mountain-top Mining" effect on the prevalence of infants with reported birth defects in WV.
Collapse
Affiliation(s)
- Steven H Lamm
- Consultants in Epidemiology and Occupational Health (CEOH), LLC, Washington, District of Columbia, USA; Department of Health Policy and Management, Johns Hopkins University- Bloomberg School of Public Health, Baltimore, Maryland, USA; Department of Pediatrics, Georgetown University School of Medicine, Washington, District of Columbia, USA
| | | | | | | | | | | |
Collapse
|
16
|
Lamm SH, Li J, Robbins SA, Dissen E, Chen R, Feinleib M. Are residents of mountain-top mining counties more likely to have infants with birth defects? The West Virginia experience. ACTA ACUST UNITED AC 2014; 103:76-84. [PMID: 25388330 DOI: 10.1002/bdra.23322] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
Abstract
BACKGROUND Pooled 1996 to 2003 birth certificate data for four central states in Appalachia indicated higher rates of infants with birth defects born to residents of counties with mountain-top mining (MTM) than born to residents of non-mining-counties (Ahern 2011). However, those analyses did not consider sources of uncertainty such as unbalanced distributions or quality of data. Quality issues have been a continuing problem with birth certificate analyses. We used 1990 to 2009 live birth certificate data for West Virginia to reassess this hypothesis. METHODS Forty-four hospitals contributed 98% of the MTM-county births and 95% of the non-mining-county births, of which six had more than 1000 births from both MTM and nonmining counties. Adjusted and stratified prevalence rate ratios (PRRs) were computed both by using Poisson regression and Mantel-Haenszel analysis. RESULTS Unbalanced distribution of hospital births was observed by mining groups. The prevalence rate of infants with reported birth defects, higher in MTM-counties (0.021) than in non-mining-counties (0.015), yielded a significant crude PRR (cPRR = 1.43; 95% confidence interval [CI] = 1.36-1.52) but a nonsignificant hospital-adjusted PRR (adjPRR = 1.08; 95% CI = 0.97-1.20; p = 0.16) for the 44 hospitals. So did the six hospital data analysis ([cPRR = 2.39; 95% CI = 2.15-2.65] and [adjPRR = 1.01; 95% CI, 0.89-1.14; p = 0.87]). CONCLUSION No increased risk of birth defects was observed for births from MTM-counties after adjustment for, or stratification by, hospital of birth. These results have consistently demonstrated that the reported association between birth defect rates and MTM coal mining was a consequence of data heterogeneity. The data do not demonstrate evidence of a "Mountain-top Mining" effect on the prevalence of infants with reported birth defects in WV.
Collapse
Affiliation(s)
- Steven H Lamm
- Consultants in Epidemiology and Occupational Health (CEOH), LLC, Washington, District of Columbia, USA; Department of Health Policy and Management, Johns Hopkins University- Bloomberg School of Public Health, Baltimore, Maryland, USA; Department of Pediatrics, Georgetown University School of Medicine, Washington, District of Columbia, USA
| | | | | | | | | | | |
Collapse
|