1
|
Ngo H, Fang H, Rumbut J, Wang H. Federated Fuzzy Clustering for Decentralized Incomplete Longitudinal Behavioral Data. IEEE INTERNET OF THINGS JOURNAL 2024; 11:14657-14670. [PMID: 38605934 PMCID: PMC11006372 DOI: 10.1109/jiot.2023.3343719] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/13/2024]
Abstract
The use of medical data for machine learning, including unsupervised methods such as clustering, is often restricted by privacy regulations such as the Health Insurance Portability and Accountability Act (HIPAA). Medical data is sensitive and highly regulated and anonymization is often insufficient to protect a patient's identity. Traditional clustering algorithms are also unsuitable for longitudinal behavioral health trials, which often have missing data and observe individual behaviors over varying time periods. In this work, we develop a new decentralized federated multiple imputation-based fuzzy clustering algorithm for complex longitudinal behavioral trial data collected from multisite randomized controlled trials over different time periods. Federated learning (FL) preserves privacy by aggregating model parameters instead of data. Unlike previous FL methods, this proposed algorithm requires only two rounds of communication and handles clients with varying numbers of time points for incomplete longitudinal data. The model is evaluated on both empirical longitudinal dietary health data and simulated clusters with different numbers of clients, effect sizes, correlations, and sample sizes. The proposed algorithm converges rapidly and achieves desirable performance on multiple clustering metrics. This new method allows for targeted treatments for various patient groups while preserving their data privacy and enables the potential for broader applications in the Internet of Medical Things.
Collapse
Affiliation(s)
- Hieu Ngo
- College of Engineering, University of Massachusetts Dartmouth, North Dartmouth, MA, 02747
| | - Hua Fang
- Department of Computer and Information Science, University of Massachusetts Dartmouth, North Dartmouth, MA, 02747 and the Department of Population and Quantitative Health Science, University of Massachusetts Chan Medical School, Worcester, MA 01655 USA
| | - Joshua Rumbut
- College of Engineering, University of Massachusetts Dartmouth, North Dartmouth, MA, 02747 and the Department of Population and Quantitative Health Science, University of Massachusetts Chan Medical School, Worcester, MA 01655 USA
| | - Honggang Wang
- Department of Graduate Computer Science and Engineering, Katz School of Science and Health, Yeshiva University, New York City, NY, 10033
| |
Collapse
|
2
|
Muludi K, Setianingsih R, Sholehurrohman R, Junaidi A. Exploiting nearest neighbor data and fuzzy membership function to address missing values in classification. PeerJ Comput Sci 2024; 10:e1968. [PMID: 38660203 PMCID: PMC11042039 DOI: 10.7717/peerj-cs.1968] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2023] [Accepted: 03/07/2024] [Indexed: 04/26/2024]
Abstract
The accuracy of most classification methods is significantly affected by missing values. Therefore, this study aimed to propose a data imputation method to handle missing values through the application of nearest neighbor data and fuzzy membership function as well as to compare the results with standard methods. A total of five datasets related to classification problems obtained from the UCI Machine Learning Repository were used. The results showed that the proposed method had higher accuracy than standard imputation methods. Moreover, triangular method performed better than Gaussian fuzzy membership function. This showed that the combination of nearest neighbor data and fuzzy membership function was more effective in handling missing values and improving classification accuracy.
Collapse
Affiliation(s)
- Kurnia Muludi
- Informatics and Business Institute Darmajaya, Bandar Lampung, Lampung Province, Indonesia
| | - Revita Setianingsih
- Computer Science Department, Faculty of Science, Lampung University, Bandar Lampung, Lampung Province, Indonesia
| | - Ridho Sholehurrohman
- Computer Science Department, Faculty of Science, Lampung University, Bandar Lampung, Lampung Province, Indonesia
| | - Akmal Junaidi
- Computer Science Department, Faculty of Science, Lampung University, Bandar Lampung, Lampung Province, Indonesia
| |
Collapse
|
3
|
Emmanuel T, Maupong T, Mpoeleng D, Semong T, Mphago B, Tabona O. A survey on missing data in machine learning. JOURNAL OF BIG DATA 2021; 8:140. [PMID: 34722113 PMCID: PMC8549433 DOI: 10.1186/s40537-021-00516-9] [Citation(s) in RCA: 90] [Impact Index Per Article: 30.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/22/2021] [Accepted: 09/12/2021] [Indexed: 05/04/2023]
Abstract
Machine learning has been the corner stone in analysing and extracting information from data and often a problem of missing values is encountered. Missing values occur because of various factors like missing completely at random, missing at random or missing not at random. All these may result from system malfunction during data collection or human error during data pre-processing. Nevertheless, it is important to deal with missing values before analysing data since ignoring or omitting missing values may result in biased or misinformed analysis. In literature there have been several proposals for handling missing values. In this paper, we aggregate some of the literature on missing data particularly focusing on machine learning techniques. We also give insight on how the machine learning approaches work by highlighting the key features of missing values imputation techniques, how they perform, their limitations and the kind of data they are most suitable for. We propose and evaluate two methods, the k nearest neighbor and an iterative imputation method (missForest) based on the random forest algorithm. Evaluation is performed on the Iris and novel power plant fan data with induced missing values at missingness rate of 5% to 20%. We show that both missForest and the k nearest neighbor can successfully handle missing values and offer some possible future research direction.
Collapse
Affiliation(s)
- Tlamelo Emmanuel
- Department of Computer Science and Information Systems, Botswana International University of Science and Technology, Palapye, Botswana
| | - Thabiso Maupong
- Department of Computer Science and Information Systems, Botswana International University of Science and Technology, Palapye, Botswana
| | - Dimane Mpoeleng
- Department of Computer Science and Information Systems, Botswana International University of Science and Technology, Palapye, Botswana
| | - Thabo Semong
- Department of Computer Science and Information Systems, Botswana International University of Science and Technology, Palapye, Botswana
| | - Banyatsang Mphago
- Department of Computer Science and Information Systems, Botswana International University of Science and Technology, Palapye, Botswana
| | - Oteng Tabona
- Department of Computer Science and Information Systems, Botswana International University of Science and Technology, Palapye, Botswana
| |
Collapse
|
4
|
Mahmud MS, Fang H, Carreiro S, Wang H, Boyer EW. Wearables technology for drug abuse detection: A survey of recent advancement. ACTA ACUST UNITED AC 2019. [DOI: 10.1016/j.smhl.2018.09.002] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
|
5
|
Angier H, Huguet N, Marino M, Green B, Holderness H, Gold R, Hoopes M, DeVoe J. Observational study protocol for evaluating control of hypertension and the effects of social determinants. BMJ Open 2019; 9:e025975. [PMID: 30878987 PMCID: PMC6429873 DOI: 10.1136/bmjopen-2018-025975] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/28/2018] [Revised: 02/04/2019] [Accepted: 02/07/2019] [Indexed: 11/29/2022] Open
Abstract
INTRODUCTION Hypertension is a common chronic health condition. Having health insurance reduces hypertension risk; health insurance coverage could improve hypertension screening, treatment and management. The Medicaid eligibility expansion of the Affordable Care Act was ruled not to be required by the US Supreme Court. Subsequently, a 'natural experiment' was produced with some states expanding Medicaid eligibility while others did not. This presents a unique opportunity to learn whether and to what extent Medicaid expansion can affect healthcare access and services for patients at risk for and diagnosed with hypertension, and patients with undiagnosed hypertension. Additionally, social determinants of health (SDH), at both the individual- and community-level, influence diagnosis and care for hypertension and it is important to understand how they interact with health insurance coverage changes. METHODS/DESIGN We will use electronic health record (EHR) data from the Accelerating Data Value Across a National Community Health Center Network clinical data research network, which has data from community health centres in 22 states, some that did and some that did not expand Medicaid. Data include information on changes in health insurance, service receipt and health outcomes from 2012 through the most recent data available. We will include patients between the ages of 19 and 64 years (n=1 524 241) with ≥1 ambulatory visit to a community health centre. We will estimate differences in outcomes using difference-in-difference and difference-in-difference-in-difference approaches. We will test three-way interactions with insurance group, time and social determinants of health factors to compare the potential effect of gaining insurance on our proposed outcomes. ETHICS AND DISSEMINATION This study uses secondary data analysis and therefore approval for consent to participate was waived. The Institutional Review Board for OHSU approved this study. Approval reference number is: IRB00011858. We plan to disseminate our findings at relevant conferences, meetings and through peer-reviewed journals. TRIAL REGISTRATION NUMBER NCT03545763.
Collapse
Affiliation(s)
- Heather Angier
- Family Medicine, Oregon Health & Science University, Portland, Oregon, USA
| | - Nathalie Huguet
- Family Medicine, Oregon Health & Science University, Portland, Oregon, USA
| | - Miguel Marino
- Family Medicine, Oregon Health & Science University, Portland, Oregon, USA
| | - Beverly Green
- Research, Kaiser Permanente Washington Health Research Institute, Seattle, Washington, USA
| | - Heather Holderness
- Family Medicine, Oregon Health & Science University, Portland, Oregon, USA
| | | | | | - Jennifer DeVoe
- Family Medicine, Oregon Health & Science University, Portland, Oregon, USA
| |
Collapse
|
6
|
Fang H, Zhang Z. An Enhanced Visualization Method to Aid Behavioral Trajectory Pattern Recognition Infrastructure for Big Longitudinal Data. IEEE TRANSACTIONS ON BIG DATA 2018; 4:289-298. [PMID: 29888298 PMCID: PMC5990046 DOI: 10.1109/tbdata.2017.2653815] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Big longitudinal data provide more reliable information for decision making and are common in all kinds of fields. Trajectory pattern recognition is in an urgent need to discover important structures for such data. Developing better and more computationally-efficient visualization tool is crucial to guide this technique. This paper proposes an enhanced projection pursuit (EPP) method to better project and visualize the structures (e.g. clusters) of big high-dimensional (HD) longitudinal data on a lower-dimensional plane. Unlike classic PP methods potentially useful for longitudinal data, EPP is built upon nonlinear mapping algorithms to compute its stress (error) function by balancing the paired weights for between and within structure stress while preserving original structure membership in the high-dimensional space. Specifically, EPP solves an NP hard optimization problem by integrating gradual optimization and non-linear mapping algorithms, and automates the searching of an optimal number of iterations to display a stable structure for varying sample sizes and dimensions. Using publicized UCI and real longitudinal clinical trial datasets as well as simulation, EPP demonstrates its better performance in visualizing big HD longitudinal data.
Collapse
Affiliation(s)
- Hua Fang
- Department of Computer and Information Science, Department of Mathematics, University of Massachusetts Dartmouth, 285 Old Westport Rd, Dartmouth, MA, 02747, and Department of Quantitative Health Sciences, University of Massachusetts Medical School, Worcester, MA, 01605
| | - Zhaoyang Zhang
- College of Engineering, University of Massachusetts Dartmouth and Department of Quantitative Health Sciences, University of Massachusetts Medical School
| |
Collapse
|
7
|
Gurugubelli VS, Li Z, Wang H, Fang H. eFCM: An Enhanced Fuzzy C-Means Algorithm for Longitudinal Intervention Data. INTERNATIONAL CONFERENCE ON COMPUTING, NETWORKING, AND COMMUNICATIONS : [PROCEEDINGS]. INTERNATIONAL CONFERENCE ON COMPUTING, NETWORKING AND COMMUNICATIONS 2018; 2018:912-916. [PMID: 30906794 PMCID: PMC6428443 DOI: 10.1109/iccnc.2018.8390419] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Abstract
Clustering methods become increasingly important in analyzing heterogeneity of treatment effects, especially in longitudinal behavioral intervention studies. Methods such as K-means and Fuzzy C-means (FCM) have been widely endorsed to identify distinct groups of different types of data. Build upon our MIFuzzy [1], our goal is to concurrently handle multiple methodological issues in studying high dimensional longitudinal intervention data with missing values. Particularly, this paper focuses on the initialization issue of FCM and proposes a new initialization method to overcome the local optimal problem and decrease the convergence time in handling high-dimensional data with missing values for overlapping clusters. Based on the idea of K-means++ [9], we proposed an enhanced Fuzzy C-means clustering (eFCM) and incorporated it into our MIFuzzy. This method was evaluated using real longitudinal intervention data, classic and generic datasets. Compared to conventional FCM, our findings indicate eFCM can improve computational efficiency and avoid the local optimization.
Collapse
Affiliation(s)
- Venkata Sukumar Gurugubelli
- Department of Computer and Information Science, University of Massachusetts - Dartmouth, Dartmouth, MA, 02747
- Department of Quantitative Health Sciences, University of Massachusetts Medical School, Worcester, MA
| | - Zhouzhou Li
- Department of Electrical and Computer Engineering, University of Massachusetts - Dartmouth, Dartmouth, MA, 02747
- Department of Quantitative Health Sciences, University of Massachusetts Medical School, Worcester, MA
| | - Honggang Wang
- Department of Electrical and Computer Engineering, University of Massachusetts - Dartmouth, Dartmouth, MA, 02747
| | - Hua Fang
- Department of Computer and Information Science, University of Massachusetts - Dartmouth, Dartmouth, MA, 02747
- Department of Quantitative Health Sciences, University of Massachusetts Medical School, Worcester, MA
| |
Collapse
|
8
|
Reza Soroushmehr SM, Najarian K. Transforming big data into computational models for personalized medicine and health care. DIALOGUES IN CLINICAL NEUROSCIENCE 2017. [PMID: 27757067 PMCID: PMC5067150 DOI: 10.31887/dcns.2016.18.3/ssoroushmehr] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 01/10/2023]
Abstract
Health care systems generate a huge volume of different types of data. Due to the complexity and challenges inherent in studying medical information, it is not yet possible to create a comprehensive model capable of considering all the aspects of health care systems. There are different points of view regarding what the most efficient approaches toward utilization of this data would be. In this paper, we describe the potential role of big data approaches in improving health care systems and review the most common challenges facing the utilization of health care big data.
Collapse
Affiliation(s)
- S M Reza Soroushmehr
- Emergency Medicine Department, University of Michigan, Ann Arbor, Michigan, USA; University of Michigan Center for Integrative Research in Critical Care (MCIRCC), University of Michigan, Ann Arbor, Michigan, USA; Department of Computational Medicine and Bio-informatics, University of Michigan, Ann Arbor, Michigan, USA
| | - Kayvan Najarian
- Emergency Medicine Department, University of Michigan, Ann Arbor, Michigan, USA; University of Michigan Center for Integrative Research in Critical Care (MCIRCC), University of Michigan, Ann Arbor, Michigan, USA; Department of Computational Medicine and Bio-informatics, University of Michigan, Ann Arbor, Michigan, USA
| |
Collapse
|
9
|
He D, Wang Z, Yang L, Dai W. Study on missing data imputation and modeling for the leaching process. Chem Eng Res Des 2017. [DOI: 10.1016/j.cherd.2017.05.023] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
10
|
Kim SS, Fang H, Bernstein K, Zhang Z, DiFranza J, Ziedonis D, Allison J. Acculturation, Depression, and Smoking Cessation: a trajectory pattern recognition approach. Tob Induc Dis 2017; 15:33. [PMID: 28747857 PMCID: PMC5525352 DOI: 10.1186/s12971-017-0135-x] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2016] [Accepted: 07/06/2017] [Indexed: 12/02/2022] Open
Abstract
BACKGROUND Korean Americans are known for a high smoking prevalence within the Asian American population. This study examined the effects of acculturation and depression on Korean Americans' smoking cessation and abstinence. METHODS This is a secondary data analysis of a smoking cessation study that implemented eight weekly individualized counseling sessions of a culturally adapted cessation intervention for the treatment arm and a standard cognitive behavioral therapy for the comparison arm. Both arms also received nicotine patches for 8 weeks. A newly developed non-parametric trajectory pattern recognition model (MI-Fuzzy) was used to identify cognitive and behavioral response patterns to a smoking cessation intervention among 97 Korean American smokers (81 men and 16 women). RESULTS Three distinctive response patterns were revealed: (a) Culturally Adapted (CA), since all identified members received the culturally adapted intervention; (b) More Bicultural (MB), for having higher scores of bicultural acculturation; and (c) Less Bicultural (LB), for having lower scores of bicultural acculturation. The CA smokers were those from the treatment arm, while MB and LB groups were from the comparison arm. The LB group differed in depression from the CA and MB groups and no difference was found between the CA and MB groups. Although depression did not directly affect 12-month prolonged abstinence, the LB group was most depressed and achieved the lowest rate of abstinence (LB: 1.03%; MB: 5.15%; CA: 21.65%). CONCLUSION A culturally adaptive intervention should target Korean American smokers with a high level of depression and a low level of biculturalism to assist in their smoking cessation. TRIAL REGISTRATION NCT01091363. Registered 21 March 2010.
Collapse
Affiliation(s)
- Sun S Kim
- University of Massachusetts, Boston, Boston, MA 02125 USA
| | - Hua Fang
- University of Massachusetts Dartmouth and Medical School Dartmouth, Dartmouth, MA 02747 USA
- Department of Computer and Information Science, College of Engineering, University of Massachusetts Dartmouth, Dion Building, Room 317 285 Old Westport Road Dartmouth, Dartmouth, MA 02747-2300 USA
- Division of Biostatistics and Health Services Research Department of Quantitative Health Sciences, University of Massachusetts Medical School, Albert Sherman Bldg, Office: AS8-2061, 368 Plantation St. Worcester, Dartmouth, MA 01605-0002 USA
| | - Kunsook Bernstein
- Hunter College, City University of New York, New York, New York 10010 USA
| | - Zhaoyang Zhang
- University of Massachusetts Dartmouth and Medical School Dartmouth, Dartmouth, MA 02747 USA
| | - Joseph DiFranza
- University of Massachusetts Dartmouth and Medical School Dartmouth, Dartmouth, MA 02747 USA
| | - Douglas Ziedonis
- University of California San Diego, Deparetment of Psychiatry, 9500 Gilman Drive #0602, La Jolla, CA 92093-0602 USA
| | - Jeroan Allison
- University of Massachusetts Dartmouth and Medical School Dartmouth, Dartmouth, MA 02747 USA
| |
Collapse
|
11
|
Abstract
Missing data are common in longitudinal observational and randomized controlled trials in smart health studies. Multiple-imputation based fuzzy clustering is an emerging non-parametric soft computing method, used for either semi-supervised or unsupervised learning. Multiple imputation (MI) has been widely-used in missing data analyses, but has not yet been scrutinized for unsupervised learning methods, although they are important for explaining the heterogeneity of treatment effects. Built upon our previous work on MIfuzzy clustering, this paper introduces the MIFuzzy concepts and performance, theoretically, empirically and numerically demonstrate how MI-based approach can reduce the uncertainty of clustering accuracy in comparison to non- and single-imputation based clustering approach. This paper advances our understanding of the utility and strength of MIFuzzy clustering approach to processing incomplete longitudinal behavioral intervention data.
Collapse
Affiliation(s)
- Hua Fang
- Department of Computer and Information Science, University of Massachusetts Dartmouth, Department of Quantitative Health Sciences, University of Massachusetts Medical School, Worcester, MA 01655
| |
Collapse
|
12
|
Reza Soroushmehr SM. Transforming big data into computational models for personalized medicine and health care. DIALOGUES IN CLINICAL NEUROSCIENCE 2016; 18:339-343. [PMID: 27757067 PMCID: PMC5067150] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 11/05/2023]
Abstract
Health care systems generate a huge volume of different types of data. Due to the complexity and challenges inherent in studying medical information, it is not yet possible to create a comprehensive model capable of considering all the aspects of health care systems. There are different points of view regarding what the most efficient approaches toward utilization of this data would be. In this paper, we describe the potential role of big data approaches in improving health care systems and review the most common challenges facing the utilization of health care big data.
Collapse
Affiliation(s)
- S. M. Reza Soroushmehr
- Emergency Medicine Department, University of Michigan, Ann Arbor, Michigan, USA; University of Michigan Center for Integrative Research in Critical Care (MCIRCC), University of Michigan, Ann Arbor, Michigan, USA; Department of Computational Medicine and Bio-informatics, University of Michigan, Ann Arbor, Michigan, USA
| |
Collapse
|
13
|
Zhang Z, Fang H. Multiple- vs Non- or Single-Imputation based Fuzzy Clustering for Incomplete Longitudinal Behavioral Intervention Data. ...IEEE...INTERNATIONAL CONFERENCE ON CONNECTED HEALTH: APPLICATIONS, SYSTEMS AND ENGINEERING TECHNOLOGIES. IEEE INTERNATIONAL CONFERENCE ON CONNECTED HEALTH: APPLICATIONS, SYSTEMS AND ENGINEERING TECHNOLOGIES 2016; 2016:219-228. [PMID: 29034067 PMCID: PMC5635859 DOI: 10.1109/chase.2016.19] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
Disentangling patients' behavioral variations is a critical step for better understanding an intervention's effects on individual outcomes. Missing data commonly exist in longitudinal behavioral intervention studies. Multiple imputation (MI) has been well studied for missing data analyses in the statistical field, however, has not yet been scrutinized for clustering or unsupervised learning, which are important techniques for explaining the heterogeneity of treatment effects. Built upon previous work on MI fuzzy clustering, this paper theoretically, empirically and numerically demonstrate how MI-based approach can reduce the uncertainty of clustering accuracy in comparison to non-and single-imputation based clustering approach. This paper advances our understanding of the utility and strength of multiple-imputation (MI) based fuzzy clustering approach to processing incomplete longitudinal behavioral intervention data.
Collapse
Affiliation(s)
- Zhaoyang Zhang
- Division of Biostatistics and Health Services Research, Department of Quantitative Health Science, University of Massachusetts Medical School, Worcester, MA 01655
| | - Hua Fang
- Division of Biostatistics and Health Services Research, Department of Quantitative Health Science, University of Massachusetts Medical School, Worcester, MA 01655
| |
Collapse
|
14
|
Zhang Z, Fang H, Wang H. A New MI-Based Visualization Aided Validation Index for Mining Big Longitudinal Web Trial Data. IEEE ACCESS : PRACTICAL INNOVATIONS, OPEN SOLUTIONS 2016; 4:2272-2280. [PMID: 27482473 PMCID: PMC4963037 DOI: 10.1109/access.2016.2569074] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Web-delivered clinical trials generate big complex data. To help untangle the heterogeneity of treatment effects, unsupervised learning methods have been widely applied. However, identifying valid patterns is a priority but challenging issue for these methods. This paper, built upon our previous research on multiple imputation (MI)-based fuzzy clustering and validation, proposes a new MI-based Visualization-aided validation index (MIVOOS) to determine the optimal number of clusters for big incomplete longitudinal Web-trial data with inflated zeros. Different from a recently developed fuzzy clustering validation index, MIVOOS uses a more suitable overlap and separation measures for Web-trial data but does not depend on the choice of fuzzifiers as the widely used Xie and Beni (XB) index. Through optimizing the view angles of 3-D projections using Sammon mapping, the optimal 2-D projection-guided MIVOOS is obtained to better visualize and verify the patterns in conjunction with trajectory patterns. Compared with XB and VOS, our newly proposed MIVOOS shows its robustness in validating big Web-trial data under different missing data mechanisms using real and simulated Web-trial data.
Collapse
Affiliation(s)
- Zhaoyang Zhang
- Department of Quantitative Health Science, University of Massachusetts Medical School, Worcester, MA 01605, USA
| | - Hua Fang
- Department of Quantitative Health Science, University of Massachusetts Medical School, Worcester, MA 01605, USA
| | - Honggang Wang
- Department of Electrical and Computer Engineering, University of Massachusetts Dartmouth, North Dartmouth, MA 02747, USA
| |
Collapse
|