1
|
Pastorini M, Rodríguez R, Etcheverry L, Castro A, Gorgoglione A. Enhancing environmental data imputation: A physically-constrained machine learning framework. Sci Total Environ 2024; 926:171773. [PMID: 38522546 DOI: 10.1016/j.scitotenv.2024.171773] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/18/2023] [Revised: 03/14/2024] [Accepted: 03/15/2024] [Indexed: 03/26/2024]
Abstract
In water resources management, new computational capabilities have made it possible to develop integrated models to jointly analyze climatic conditions and water quantity/quality of the entire watershed system. Although the value of this integrated approach has been demonstrated so far, the limited availability of field data may hinder its applicability by causing high uncertainty in the model response. In this context, before collecting additional data, it is recommended first to recognize what improvement in model performance would occur if all available records could be well exploited. This work proposes a novel machine learning framework with physical constraints capable of successfully imputing a high percentage of missing data belonging to several environmental domains (meteorology, water quantity, water quality), yielding satisfactory results. In particular, the minimum NSE computed for meteorologic variables is 0.72. For hydrometric variables, NSE is always >0.97. More than 78 % of the physical-water-quality variables is characterized by NSE > 0.45, and >66 % of the chemical-water quality variables reaches NSE > 0.35. This work's results demonstrate the proposed framework's effectiveness as a data augmentation tool to improve the performance of integrated environmental modeling.
Collapse
Affiliation(s)
- Marcos Pastorini
- Department of Computer Science, School of Engineering, Universidad de la República, Herreira y Reissig, 565, Montevideo 11300, Uruguay.
| | - Rafael Rodríguez
- Department of Fluid Mechanics and Environmental Engineering, School of Engineering, Universidad de la República, Herreira y Reissig, 565, Montevideo 11300, Uruguay.
| | - Lorena Etcheverry
- Department of Computer Science, School of Engineering, Universidad de la República, Herreira y Reissig, 565, Montevideo 11300, Uruguay.
| | - Alberto Castro
- Department of Computer Science, School of Engineering, Universidad de la República, Herreira y Reissig, 565, Montevideo 11300, Uruguay.
| | - Angela Gorgoglione
- Department of Fluid Mechanics and Environmental Engineering, School of Engineering, Universidad de la República, Herreira y Reissig, 565, Montevideo 11300, Uruguay.
| |
Collapse
|
2
|
Nandagopal M, Seerangan K, Govindaraju T, Abi NE, Balusamy B, Selvarajan S. A Deep Auto-Optimized Collaborative Learning (DACL) model for disease prognosis using AI-IoMT systems. Sci Rep 2024; 14:10280. [PMID: 38704423 DOI: 10.1038/s41598-024-59846-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2023] [Accepted: 04/16/2024] [Indexed: 05/06/2024] Open
Abstract
In modern healthcare, integrating Artificial Intelligence (AI) and Internet of Medical Things (IoMT) is highly beneficial and has made it possible to effectively control disease using networks of interconnected sensors worn by individuals. The purpose of this work is to develop an AI-IoMT framework for identifying several of chronic diseases form the patients' medical record. For that, the Deep Auto-Optimized Collaborative Learning (DACL) Model, a brand-new AI-IoMT framework, has been developed for rapid diagnosis of chronic diseases like heart disease, diabetes, and stroke. Then, a Deep Auto-Encoder Model (DAEM) is used in the proposed framework to formulate the imputed and preprocessed data by determining the fields of characteristics or information that are lacking. To speed up classification training and testing, the Golden Flower Search (GFS) approach is then utilized to choose the best features from the imputed data. In addition, the cutting-edge Collaborative Bias Integrated GAN (ColBGaN) model has been created for precisely recognizing and classifying the types of chronic diseases from the medical records of patients. The loss function is optimally estimated during classification using the Water Drop Optimization (WDO) technique, reducing the classifier's error rate. Using some of the well-known benchmarking datasets and performance measures, the proposed DACL's effectiveness and efficiency in identifying diseases is evaluated and compared.
Collapse
Affiliation(s)
- Malarvizhi Nandagopal
- Department of CSE, School of Computing, Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology, Chennai, Tamil Nadu, 600062, India
| | - Koteeswaran Seerangan
- Department of CSE (AI&ML), S.A. Engineering College (Autonomous), Chennai, Tamil Nadu, 600077, India
| | - Tamilmani Govindaraju
- Department of Computational Intelligence, SRM Institute of Science and Technology, Kattankulathur, Chennai, Tamil Nadu, 603203, India
| | - Neeba Eralil Abi
- Department of Information Technology, Rajagiri School of Engineering and Technology, Kochi, Kerala, 682039, India
| | - Balamurugan Balusamy
- Shiv Nadar (Institution of Eminence Deemed to be University), Greater Noida, Uttar Pradesh, 201314, India
| | - Shitharth Selvarajan
- Department of Computer Science, Kebri Dehar University, 250, Kebri Dehar, Ethiopia.
- School of Built Environment, Engineering and Computing, Leeds Beckett University, LS1 3HE, Leeds, UK.
| |
Collapse
|
3
|
Tao Z, Tanaka T, Zhao Q. Nonparametric tensor ring decomposition with scalable amortized inference. Neural Netw 2024; 169:431-441. [PMID: 37931474 DOI: 10.1016/j.neunet.2023.10.031] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2023] [Revised: 10/15/2023] [Accepted: 10/22/2023] [Indexed: 11/08/2023]
Abstract
Multi-dimensional data are common in many applications, such as videos and multi-variate time series. While tensor decomposition (TD) provides promising tools for analyzing such data, there still remains several limitations. First, traditional TDs assume multi-linear structures of the latent embeddings, which greatly limits their expressive power. Second, TDs cannot be straightforwardly applied to datasets with massive samples. To address these issues, we propose a nonparametric TD with amortized inference networks. Specifically, we establish a non-linear extension of tensor ring decomposition, using neural networks, to model complex latent structures. To jointly model the cross-sample correlations and physical structures, a matrix Gaussian process (GP) prior is imposed over the core tensors. From learning perspective, we develop a VAE-like amortized inference network to infer the posterior of core tensors corresponding to new tensor data, which enables TDs to be applied to large datasets. Our model can be also viewed as a kind of decomposition of VAE, which can additionally capture hidden tensor structure and enhance the expressiveness power. Finally, we derive an evidence lower bound such that a scalable optimization algorithm is developed. The advantages of our method have been evaluated extensively by data imputation on the Healing MNIST dataset and four multi-variate time series data.
Collapse
Affiliation(s)
- Zerui Tao
- Department of Electronic and Information Engineering, Tokyo University of Agriculture and Technology, 184-8588, Tokyo, Japan; RIKEN Center for Advanced Intelligence Project (AIP), 103-0027, Tokyo, Japan.
| | - Toshihisa Tanaka
- Department of Electronic and Information Engineering, Tokyo University of Agriculture and Technology, 184-8588, Tokyo, Japan; RIKEN Center for Advanced Intelligence Project (AIP), 103-0027, Tokyo, Japan.
| | - Qibin Zhao
- Department of Electronic and Information Engineering, Tokyo University of Agriculture and Technology, 184-8588, Tokyo, Japan; RIKEN Center for Advanced Intelligence Project (AIP), 103-0027, Tokyo, Japan.
| |
Collapse
|
4
|
Su J, Reynier JB, Fu X, Zhong G, Jiang J, Escalante RS, Wang Y, Aparicio L, Izar B, Knowles DA, Rabadan R. Smoother: a unified and modular framework for incorporating structural dependency in spatial omics data. Genome Biol 2023; 24:291. [PMID: 38110959 PMCID: PMC10726548 DOI: 10.1186/s13059-023-03138-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2023] [Accepted: 12/04/2023] [Indexed: 12/20/2023] Open
Abstract
Spatial omics technologies can help identify spatially organized biological processes, but existing computational approaches often overlook structural dependencies in the data. Here, we introduce Smoother, a unified framework that integrates positional information into non-spatial models via modular priors and losses. In simulated and real datasets, Smoother enables accurate data imputation, cell-type deconvolution, and dimensionality reduction with remarkable efficiency. In colorectal cancer, Smoother-guided deconvolution reveals plasma cell and fibroblast subtype localizations linked to tumor microenvironment restructuring. Additionally, joint modeling of spatial and single-cell human prostate data with Smoother allows for spatial mapping of reference populations with significantly reduced ambiguity.
Collapse
Affiliation(s)
- Jiayu Su
- Program for Mathematical Genomics, Columbia University, New York, NY, USA.
- Department of Systems Biology, Columbia University, New York, NY, USA.
- New York Genome Center, New York, NY, USA.
| | - Jean-Baptiste Reynier
- Program for Mathematical Genomics, Columbia University, New York, NY, USA
- Department of Biomedical Informatics, Columbia University, New York, NY, USA
| | - Xi Fu
- Program for Mathematical Genomics, Columbia University, New York, NY, USA
- Department of Biomedical Informatics, Columbia University, New York, NY, USA
| | - Guojie Zhong
- Department of Systems Biology, Columbia University, New York, NY, USA
| | - Jiahao Jiang
- Wellcome Centre for Human Genetics, University of Oxford, Oxford, UK
| | | | - Yiping Wang
- Program for Mathematical Genomics, Columbia University, New York, NY, USA
- Division of Hematology/Oncology, Department of Medicine, Herbert Irving Comprehensive Cancer Center, Columbia University Irving Medical Center, New York, NY, USA
| | - Luis Aparicio
- Program for Mathematical Genomics, Columbia University, New York, NY, USA
- Department of Systems Biology, Columbia University, New York, NY, USA
| | - Benjamin Izar
- Program for Mathematical Genomics, Columbia University, New York, NY, USA
- Division of Hematology/Oncology, Department of Medicine, Herbert Irving Comprehensive Cancer Center, Columbia University Irving Medical Center, New York, NY, USA
| | - David A Knowles
- Department of Systems Biology, Columbia University, New York, NY, USA
- New York Genome Center, New York, NY, USA
- Department of Computer Science, Columbia University, New York, NY, USA
| | - Raul Rabadan
- Program for Mathematical Genomics, Columbia University, New York, NY, USA.
- Department of Systems Biology, Columbia University, New York, NY, USA.
- Department of Biomedical Informatics, Columbia University, New York, NY, USA.
| |
Collapse
|
5
|
Sabat NK, Pati UC, Das SK. ABTCN: an efficient hybrid deep learning approach for atmospheric temperature prediction. Environ Sci Pollut Res Int 2023; 30:125295-125312. [PMID: 37418192 DOI: 10.1007/s11356-023-27985-0] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/28/2022] [Accepted: 05/25/2023] [Indexed: 07/08/2023]
Abstract
Temperature prediction is an important and significant step for monitoring global warming and the environment to save and protect human lives. The climatology parameters such as temperature, pressure, and wind speed are time-series data and are well predicted with data driven models. However, data-driven models have certain constraints, due to which these models are unable to predict the missing values and erroneous data caused by factors like sensor failure and natural disasters. In order to solve this issue, an efficient hybrid model, i.e., attention-based bidirectional long short term memory temporal convolution network (ABTCN) architecture is proposed. ABTCN uses k-nearest neighbor (KNN) imputation method for handling the missing data. A bidirectional long short term memory (Bi-LSTM) network with self-attention mechanism and temporal convolutional network (TCN) model that aids in the extraction of features from complex data and prediction of long data sequence. The performance of the proposed model is evaluated in comparison to various state-of-the-art deep learning models using error metrics such as MAE, MSE, RMSE, and R2 score. It is observed that our proposed model is superior over other models with high accuracy.
Collapse
Affiliation(s)
- Naba Krushna Sabat
- Department of Electronics and Communication Engineering, National Institute of Technology, Rourkela, Sector-1, Rourkela, 769008, Odisha, India
| | - Umesh Chandra Pati
- Department of Electronics and Communication Engineering, National Institute of Technology, Rourkela, Sector-1, Rourkela, 769008, Odisha, India
| | - Santos Kumar Das
- Department of Electronics and Communication Engineering, National Institute of Technology, Rourkela, Sector-1, Rourkela, 769008, Odisha, India.
| |
Collapse
|
6
|
Ferri P, Romero-Garcia N, Badenes R, Lora-Pablos D, Morales TG, Gómez de la Cámara A, García-Gómez JM, Sáez C. Extremely missing numerical data in Electronic Health Records for machine learning can be managed through simple imputation methods considering informative missingness: A comparative of solutions in a COVID-19 mortality case study. Comput Methods Programs Biomed 2023; 242:107803. [PMID: 37703700 DOI: 10.1016/j.cmpb.2023.107803] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/21/2023] [Revised: 08/28/2023] [Accepted: 09/05/2023] [Indexed: 09/15/2023]
Abstract
BACKGROUND AND OBJECTIVE Reusing Electronic Health Records (EHRs) for Machine Learning (ML) leads on many occasions to extremely incomplete and sparse tabular datasets, which can hinder the model development processes and limit their performance and generalization. In this study, we aimed to characterize the most effective data imputation techniques and ML models for dealing with highly missing numerical data in EHRs, in the case where only a very limited number of data are complete, as opposed to the usual case of having a reduced number of missing values. METHODS We used a case study including full blood count laboratory data, demographic and survival data in the context of COVID-19 hospital admissions and evaluated 30 processing pipelines combining imputation methods with ML classifiers. The imputation methods included missing mask, translation and encoding, mean imputation, k-nearest neighbors' imputation, Bayesian ridge regression imputation and generative adversarial imputation networks. The classifiers included k-nearest neighbors, logistic regression, random forest, gradient boosting and deep multilayer perceptron. RESULTS Our results suggest that in the presence of highly missing data, combining translation and encoding imputation-which considers informative missingness-with tree ensemble classifiers-random forest and gradient boosting-is a sensible choice when aiming to maximize performance, in terms of area under curve. CONCLUSIONS Based on our findings, we recommend the consideration of this imputer-classifier configuration when constructing models in the presence of extremely incomplete numerical data in EHR.
Collapse
Affiliation(s)
- Pablo Ferri
- Biomedical Data Science Lab, Instituto Universitario de Tecnologías de la Información y Comunicaciones, Universitat Politècnica de València, Camino de Vera s/n, Valencia 46022, Spain.
| | | | - Rafael Badenes
- Departament de Cirugia, Universitat de València, Spain; Instituto INCLIVA, Hospital Clínico Universitario de Valencia, Spain; Department Anesthesiology, Surgical-Trauma Intensive Care and Pain Clinic, Hospital Clínic Universitari, Valencia, Spain
| | - David Lora-Pablos
- Instituto de Investigación imas12, Hospital 12 de Octubre, Madrid, Spain; Facultad de Estudios Estadísticos, Universidad Complutense de Madrid, Spain
| | | | | | - Juan M García-Gómez
- Biomedical Data Science Lab, Instituto Universitario de Tecnologías de la Información y Comunicaciones, Universitat Politècnica de València, Camino de Vera s/n, Valencia 46022, Spain
| | - Carlos Sáez
- Biomedical Data Science Lab, Instituto Universitario de Tecnologías de la Información y Comunicaciones, Universitat Politècnica de València, Camino de Vera s/n, Valencia 46022, Spain
| |
Collapse
|
7
|
Choi J, Lim KJ, Ji B. Robust imputation method with context-aware voting ensemble model for management of water-quality data. Water Res 2023; 243:120369. [PMID: 37499538 DOI: 10.1016/j.watres.2023.120369] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/04/2023] [Revised: 07/06/2023] [Accepted: 07/14/2023] [Indexed: 07/29/2023]
Abstract
Water-quality monitoring and management are crucial for ensuring the safety and sustainability of water resources. However, missing data is a frequent problem in water-quality datasets, which can result in biased results in hydrological modeling and data analysis. While classic statistical methods and emerging machine/deep learning methods have been applied for imputing missing values, most existing studies perform well in specific missing scenarios, but not in universal scenarios. Therefore, existing imputation methods often fail to robustly impute missing values across various scenarios. To address the problem, we propose an imputation method that uses a context-aware voting-ensemble model to dynamically select optimal weights to integrate various imputation models across different missingness scenarios. For first identify the attributes of missingness scenarios that influence imputation accuracy. Then after introducing missing values in collected data according to the missingness scenarios, we measure the accuracy of various imputation models across the missingness scenarios. Weights of imputation models are optimized by estimating non-linear functions with regression model that can capture relationships between missingness scenarios and imputation accuracies of models. The final imputed value of the ensemble model for a missing scenario can be determined by multiplying each imputation model's weight by its imputed value, then summing the products. The method inherits the advantages of state-of-art imputation models, including the ability to learn long-term dependencies in time series, as well as the flexibility of using a dynamic weighting strategy to process various missingness scenarios. To validate the superiority of our method, we evaluate on real-world water-quality data from a river in South Korea. The proposed method achieves higher accuracy and lower variation of imputed values than baseline models across various missingness scenarios. Furthermore, we showed the applicability of our method to various hydrological environment by validating our method on industrial water quality dataset. This study highlights the potential value of the ensemble model with dynamic weighting in robust imputation of water-quality data.
Collapse
Affiliation(s)
- Junhyuk Choi
- Department of Industrial and Management Engineering, Pohang University of Science and Technology (POSTECH), Republic of Korea
| | - Kyoung Jae Lim
- Department of Regional Infrastructure Engineering, Kangwon National University, Republic of Korea
| | - Bongjun Ji
- Department of Regional Infrastructure Engineering, Kangwon National University, Republic of Korea.
| |
Collapse
|
8
|
S H, V MA. An idiosyncratic MIMBO-NBRF based automated system for child birth mode prediction. Artif Intell Med 2023; 143:102621. [PMID: 37673564 DOI: 10.1016/j.artmed.2023.102621] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2023] [Revised: 05/11/2023] [Accepted: 07/01/2023] [Indexed: 09/08/2023]
Abstract
Predicting the mode of child birth is still remains one of the most complex and challenging tasks in ancient times. Also, there is no such strong methodologies are developed in the conventional works for birth mode prediction. Therefore, the proposed work objects to develop a novel and distinct optimization based machine learning technique for creating the child birth mode prediction system. This framework includes the modules of data imputation, feature selection, classification, and prediction. Initially, the data imputation process is performed to improve the quality of dataset by normalizing the attributes and filling the missed fields. Then, the Multivariate Intensified Mine Blast Optimization (MIMBO) technique is implemented to choose the best set of features by estimating the optimal function. After that, an integrated Naïve Bayes - Random Forest (NBRF) technique is developed by incorporating the functions of conventional NB and RF techniques. The novel contribution of this technique, a Bird Mating (BM) optimization technique is used in NBRF classifier for estimating the likelihood parameter to generate the Bayesian rules. The main idea of this paper is to develop a simple as well as efficient automated system with the use of hybrid machine learning model for predicting the mode of child birth. For this purpose, advanced algorithms such as MIMBO based feature selection, and NBRF based classification are implemented in this work. Due to the inclusion of MIMBO and BM optimization techniques, the performance of classifier is greatly improved with low computational burden and increased prediction accuracy. Moreover, the combination of proposed MIMBO-NBRF technique outperforms the existing child birth prediction methods with superior results in terms of average accuracy up to 99 %. In addition, some other parameters are also estimated and compared with the existing techniques for proving the overall superiority of the proposed framework.
Collapse
Affiliation(s)
- Hemalatha S
- Department of Computer Science and Engineering, Sathyabama Institute of Science and Technology, Chennai 600 119, Tamilnadu, India.
| | - Maria Anu V
- Department of Computer Science and Engineering, Vellore Institute of Technology, Chennai, Tamilnadu, India
| |
Collapse
|
9
|
Bernardini M, Doinychko A, Romeo L, Frontoni E, Amini MR. A novel missing data imputation approach based on clinical conditional Generative Adversarial Networks applied to EHR datasets. Comput Biol Med 2023; 163:107188. [PMID: 37393785 DOI: 10.1016/j.compbiomed.2023.107188] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2023] [Revised: 06/13/2023] [Accepted: 06/19/2023] [Indexed: 07/04/2023]
Abstract
The missing data mechanism is a relevant problem in Machine Learning (ML) and biomedical informatics communities. Real-world Electronic Health Record (EHR) datasets comprise several missing values, thus revealing a high level of spatiotemporal sparsity in the predictors' matrix. Several approaches in the state-of-the-art tried to deal with this problem by proposing different data imputation strategies that (i) are often unrelated to the ML model, (ii) are not conceived for EHR data where laboratory exams are not prescribed uniformly over time and percentage of missing values is high (iii) exploit only univariate and linear information on the observed features. Our paper proposes a data imputation strategy based on a clinical conditional Generative Adversarial Network (ccGAN) capable of imputing missing values by exploiting non-linear and multivariate information across patients. Unlike other GAN data imputation-based approaches, our method deals explicitly with the high level of missingness of routine EHR data by conditioning the imputing strategy to the observable values and those fully-annotated. We demonstrated the statistical significance of the ccGAN to other state-of-the-art approaches in terms of imputation (around 19.79% of gain to the best competitor) and predictive performance (up to 1.60% of gain to the best competitor) on a real multi-diabetic centers dataset. We also demonstrated its robustness across different missingness rates (up to 1.61% of gain to the best competitor in the highest missingness rates condition) on an additional benchmark EHR dataset.
Collapse
Affiliation(s)
- Michele Bernardini
- Department of Information Engineering (DII), Università Politecnica delle Marche, Ancona, Italy.
| | - Anastasiia Doinychko
- Grenoble Informatics Laboratory, Université Grenoble Alpes, Saint-Martin-d'Hères, France.
| | - Luca Romeo
- Department of Economics and Law, University of Macerata, Macerata, Italy.
| | - Emanuele Frontoni
- Department of Political Sciences, Communication and International Relations, University of Macerata, Macerata, Italy.
| | - Massih-Reza Amini
- Grenoble Informatics Laboratory, Université Grenoble Alpes, Saint-Martin-d'Hères, France.
| |
Collapse
|
10
|
Xi NM, Li JJ. Exploring the optimization of autoencoder design for imputing single-cell RNA sequencing data. Comput Struct Biotechnol J 2023; 21:4079-4095. [PMID: 37671239 PMCID: PMC10475479 DOI: 10.1016/j.csbj.2023.07.041] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2023] [Revised: 07/22/2023] [Accepted: 07/31/2023] [Indexed: 09/07/2023] Open
Abstract
Autoencoders are the backbones of many imputation methods that aim to relieve the sparsity issue in single-cell RNA sequencing (scRNA-seq) data. The imputation performance of an autoencoder relies on both the neural network architecture and the hyperparameter choice. So far, literature in the single-cell field lacks a formal discussion on how to design the neural network and choose the hyperparameters. Here, we conducted an empirical study to answer this question. Our study used many real and simulated scRNA-seq datasets to examine the impacts of the neural network architecture, the activation function, and the regularization strategy on imputation accuracy and downstream analyses. Our results show that (i) deeper and narrower autoencoders generally lead to better imputation performance; (ii) the sigmoid and tanh activation functions consistently outperform other commonly used functions including ReLU; (iii) regularization improves the accuracy of imputation and downstream cell clustering and DE gene analyses. Notably, our results differ from common practices in the computer vision field regarding the activation function and the regularization strategy. Overall, our study offers practical guidance on how to optimize the autoencoder design for scRNA-seq data imputation.
Collapse
Affiliation(s)
- Nan Miles Xi
- Department of Mathematics and Statistics, Loyola University Chicago, Chicago, IL 60660, USA
| | - Jingyi Jessica Li
- Department of Statistics and Data Science, University of California, Los Angeles, CA 90095-1554, USA
- Department of Human Genetics, University of California, Los Angeles, CA 90095-7088, USA
- Department of Computational Medicine, University of California, Los Angeles, CA 90095-1766, USA
- Department of Biostatistics, University of California, Los Angeles, CA 90095-1772, USA
| |
Collapse
|
11
|
Fu J, Abdel-Aty M, Yan X. Full data imputation for freeway time-specific safety performance functions' estimation. Accid Anal Prev 2023; 190:107178. [PMID: 37364362 DOI: 10.1016/j.aap.2023.107178] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/15/2023] [Revised: 05/25/2023] [Accepted: 06/15/2023] [Indexed: 06/28/2023]
Abstract
Time-specific Safety Performance Functions (SPFs) were proposed to achieve accurate and dynamic crash frequency predictions. Unfortunately, some states do not have or archive the needed high-resolution traffic data to develop time-specific SPFs. This study proposes a novel iterative imputation method to impute the 100% missing volume and speed data from different states with similar crash rates. First, this study calculated the crash rates for 18 states and applied the One-Way Analysis of variance (ANOVA) test to group the states with similar crash rates. Second, as an example FL and VA, which both have traffic data, were used to test the proposed iterative imputation method. The results indicated that the imputed traffic data could capture the same traffic pattern as the real-collected traffic data. Further, the Mean Absolute Error (MAE) between the imputed Ln Volume and the real-collected Ln Volume for FL is only 2.47 vehicles for each segment for three hours. The MAE between the imputed Ln AvgSpeed and the real-collected Ln AvgSpeed for FL is only 1.36 mph. The Mean Absolute Percentage Error (MAPE) between the imputed Ln Volume and the real-collected Ln Volume is 11.07%. Meanwhile, the MAPE between the imputed Ln AvgSpeed and the real-collected Ln AvgSpeed is 7.40%. Finally, this study applied the proposed iterative imputation method to develop time-specific SPFs for the state without traffic data and compared the results. The results illustrated that the time-specific SPFs developed by imputed traffic data perfectly reflected the significant variables for both morning and afternoon peak models, with a prediction accuracy of 87.1% for the morning peak model.
Collapse
Affiliation(s)
- Jingwan Fu
- Department of Civil, Environmental, and Construction Engineering, Department of Statistics and Data Science, University of Central Florida (UCF), Orlando, FL 32816-2450, United States
| | - Mohamed Abdel-Aty
- Department of Civil, Environmental, and Construction Engineering, Department of Statistics and Data Science, University of Central Florida (UCF), Orlando, FL 32816-2450, United States
| | - Xin Yan
- Department of Civil, Environmental, and Construction Engineering, Department of Statistics and Data Science, University of Central Florida (UCF), Orlando, FL 32816-2450, United States
| |
Collapse
|
12
|
Huang W, Meir AY, Olapeju B, Wang G, Hong X, Venkataramani M, Cheng TL, Igusa T, Liang L, Wang X. Defining longitudinal trajectory of body mass index percentile and predicting childhood obesity: methodologies and findings in the Boston Birth Cohort. Precis Nutr 2023; 2:e00037. [PMID: 37745028 PMCID: PMC10513013 DOI: 10.1097/pn9.0000000000000037] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/06/2023] [Revised: 03/11/2023] [Accepted: 03/26/2023] [Indexed: 09/26/2023]
Abstract
Background Overweight or obesity (OWO) in school-age childhood tends to persist into adulthood. This study aims to address a critical need for early identification of children at high risk of developing OWO by defining and analyzing longitudinal trajectories of body mass index percentile (BMIPCT) during early developmental windows. Methods We included 3029 children from the Boston Birth Cohort (BBC) with repeated BMI measurements from birth to age 18 years. We applied locally weighted scatterplot smoothing with a time-limit scheme and predefined rules for imputation of missing data. We then used time-series K-means cluster analysis and latent class growth analysis to define longitudinal trajectories of BMIPCT from infancy up to age 18 years. Then, we investigated early life determinants of the BMI trajectories. Finally, we compared whether using early BMIPCT trajectories performs better than BMIPCT at a given age for predicting future risk of OWO. Results After imputation, the percentage of missing data ratio decreased from 36.0% to 10.1%. We identified four BMIPCT longitudinal trajectories: early onset OWO; late onset OWO; normal stable; and low stable. Maternal OWO, smoking, and preterm birth were identified as important determinants of the two OWO trajectories. Our predictive models showed that BMIPCT trajectories in early childhood (birth to age 1 or 2 years) were more predictive of childhood OWO (age 5-10 years) than a single BMIPCT at age 1 or 2 years. Conclusions Using longitudinal BMIPCT data from birth to age 18 years, this study identified distinct BMIPCT trajectories, examined early life determinants of these trajectories, and demonstrated their advantages in predicting childhood risk of OWO over BMIPCT at a single time point.
Collapse
Affiliation(s)
- Wanyu Huang
- Department of Civil and Systems Engineering, Johns Hopkins University Whiting School of Engineering, Baltimore, MD, USA
| | - Anat Yaskolka Meir
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Bolanle Olapeju
- School of Medicine, Uniformed Services University of the Health Sciences, Bethesda, MD, USA
| | - Guoying Wang
- Center on Early Life Origins of Disease, Department of Population, Family and Reproductive Health, Johns Hopkins University Bloomberg School of Public Health, Baltimore, MD, USA
| | - Xiumei Hong
- Center on Early Life Origins of Disease, Department of Population, Family and Reproductive Health, Johns Hopkins University Bloomberg School of Public Health, Baltimore, MD, USA
| | - Maya Venkataramani
- Department of Pediatrics, Johns Hopkins University School of Medicine, Baltimore, MD, USA
- Department of Medicine, Johns Hopkins University School of Medicine, Baltimore, MD, USA
| | - Tina L. Cheng
- Cincinnati Children’s Hospital Medical Center and Department of Pediatrics, University of Cincinnati, Cincinnati, OH, USA
| | - Tak Igusa
- Department of Civil and Systems Engineering, Johns Hopkins University Whiting School of Engineering, Baltimore, MD, USA
| | - Liming Liang
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Xiaobin Wang
- Center on Early Life Origins of Disease, Department of Population, Family and Reproductive Health, Johns Hopkins University Bloomberg School of Public Health, Baltimore, MD, USA
- Department of Pediatrics, Johns Hopkins University School of Medicine, Baltimore, MD, USA
| |
Collapse
|
13
|
Awawdeh S, Rawashdeh H, Aljalodi H, Shamleh RA, Alshorman S. Vaginal birth after cesarean section prediction model for Jordanian population. Comput Biol Chem 2023; 104:107877. [PMID: 37182360 DOI: 10.1016/j.compbiolchem.2023.107877] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2022] [Revised: 04/08/2023] [Accepted: 04/29/2023] [Indexed: 05/16/2023]
Abstract
The rate of cesarean section has increased significantly worldwide, creating a group of women with one lower segment cesarean section concerned about the mode of delivery in their future pregnancies. This group of mothers will face a complex discussion because the likelihood for a successful vaginal birth after cesarean section provided to them is a general one. The probability of having a successful vaginal birth is the cornerstone factor of the mothers' decision. Therefore, providing a case-specific likelihood that respects the characteristics of each pregnancy will refine counseling, lower the decision conflict, and improve the success rate of vaginal birth trials eventually improving maternal and fetal outcomes. This paper aims to develop a clinical decision support system to evaluate the individualized likelihood mode of delivery for pregnant women with a previous lower segment cesarean section based on their unique characteristics. The study included six hundred fifty-nine pregnant women, where three hundred twenty-seven records had missing values. Various pre-processing steps, including missing data imputation and feature selection, were applied to the original dataset before model development to improve the data quality. Missing values were handled first, then a feature selection process using a genetic algorithm was applied to select the relevant features and to exclude features that may have been affected negatively by missing data imputation. After that, four machine learning classifiers, namely Decision Tree, Random Forest, K-Nearest Neighbors (KNN), and Logistic Regression, were used to build the prediction model. The results showed that imputing missing values followed by feature selection was more efficient than deleting them since the Area Under the Curve (AUC) has increased from 0.655 to 0.812 using the KNN classifier.
Collapse
Affiliation(s)
- Shatha Awawdeh
- King Abdullah II School for Information Technology, The University of Jordan, Amman, Jordan; School of Information Technology, Applied Science Private University, Amman, Jordan.
| | - Hasan Rawashdeh
- Department of Obstetrics and Gynaecology, Jordan University of Science and Technology, Jordan
| | - Haneen Aljalodi
- Department of Obstetrics and Gynaecology, Jordan University of Science and Technology, Jordan
| | - Rafeef Abu Shamleh
- Department of Obstetrics and Gynaecology, Jordan University of Science and Technology, Jordan
| | - Sumyah Alshorman
- Obstetrics and Gynecology Department, King Abdullah University Hospital, Al-Ramtha, Jordan
| |
Collapse
|
14
|
Wang J, Gong X, Hu M, Zhao L. Improved GSimp: A Flexible Missing Value Imputation Method to Support Regulatory Bioequivalence Assessment. Ann Biomed Eng 2023; 51:163-173. [PMID: 36107365 DOI: 10.1007/s10439-022-03070-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2022] [Accepted: 08/30/2022] [Indexed: 01/13/2023]
Abstract
Missing values are not uncommon in in vivo bioequivalence (BE) studies and pose non-trivial challenges for BE assessment. Missing values typically appear as a mixture of different types, such as Missing Not at Random (MNAR) and Missing Completely at Random (MCAR), however, current data imputation methods were usually developed for a certain type of missing values (e.g., MNAR). Among them, an iterative Gibbs sampler-based left-censored missing value imputation approach (GSimp) was recently developed and showed superior performance over other methods in handling MNAR data. In this study, we introduce an improved GSimp ("Improved GSimp" thereafter) that offers flexibility in handling mixed types of missing data and better imputation accuracy to support BE assessment for studies with missing values. Simulations mimicking different missing value scenarios (e.g., mixture of different missing types and proportion of missing values) were conducted to compare performance of the Improved GSimp with other methods (e.g., original GSimp and half of minimal value). Normalized root mean square error (NRMSE) was used to evaluate imputation accuracy. Our results showed that the Improved GSimp always had the best accuracy in all simulated scenarios compared to other methods.
Collapse
Affiliation(s)
- Jing Wang
- Division of Quantitative Methods and Modeling, Office of Research and Standards, Office of Generic Drugs, Center for Drug Evaluation and Research, U.S. Food and Drug Administration, Silver Spring, MD, USA
| | - Xiajing Gong
- Division of Quantitative Methods and Modeling, Office of Research and Standards, Office of Generic Drugs, Center for Drug Evaluation and Research, U.S. Food and Drug Administration, Silver Spring, MD, USA
| | - Meng Hu
- Division of Quantitative Methods and Modeling, Office of Research and Standards, Office of Generic Drugs, Center for Drug Evaluation and Research, U.S. Food and Drug Administration, 10903 New Hampshire Ave., Bldg 75, Room 4649, Silver Spring, MD, 20993-0002, USA.
| | - Liang Zhao
- Division of Quantitative Methods and Modeling, Office of Research and Standards, Office of Generic Drugs, Center for Drug Evaluation and Research, U.S. Food and Drug Administration, Silver Spring, MD, USA
| |
Collapse
|
15
|
Meng H, Tong X, Zheng Y, Xie G, Ji W, Hei X. Railway accident prediction strategy based on ensemble learning. Accid Anal Prev 2022; 176:106817. [PMID: 36057162 DOI: 10.1016/j.aap.2022.106817] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/31/2021] [Revised: 08/19/2022] [Accepted: 08/20/2022] [Indexed: 06/15/2023]
Abstract
Railway accident prediction is of great significance for establishing an early warning mechanism and preventing the occurrences of accidents. Safety agencies rely on prediction models to design railroad risk management strategies. Based on historical railway accident data, an ensemble learning strategy for accident prediction is proposed. Firstly, an improved K-nearest neighbors (KNN) data imputation algorithm is proposed to solve the problem of missing data in the dataset. Then, to reduce the impact of imbalanced data on prediction performance, an AdaBoost-Bagging method is presented. Finally, according to the feature importance in the prediction model, accident features are ranked to identify new insights into the cause of the accident. The AdaBoost-Bagging prediction method is applied to the Federal Railroad Administration (FRA) dataset. The application results show that, compared with Artificial Neural Network (ANN), XGBoost, GBDT, Stacking and AdaBoost methods, AdaBoost-Bagging method has a smaller prediction error and faster inference time in predicting railway accidents. Accuracy, Precision, Recall and F1-score are 0.879, 0.879, 0.883 and 0.881 respectively, and the inference time is reduced by 23.38%, 12.15%, 6.66%, 3.17% and 11.41% respectively. The prediction method can well mine important features of railway accidents without knowing the accident mechanism or the relationship between various railway accidents and factors, e.g., the critic risk factors related to derailment and collision accidents are investigated in the prediction. The findings will be helpful to the prevention and management of railway accidents.
Collapse
Affiliation(s)
- Haining Meng
- School of Computer Science and Engineering, Xi'an University of Technology, Xi'an, Shaanxi 710048, China; Shaanxi Key Lab Network Computer and Security Technology, Xi'an, Shaanxi 710048, China.
| | - Xinyu Tong
- School of Computer Science and Engineering, Xi'an University of Technology, Xi'an, Shaanxi 710048, China
| | - Yi Zheng
- School of Computer Science and Engineering, Xi'an University of Technology, Xi'an, Shaanxi 710048, China
| | - Guo Xie
- School of Automation and Information Engineering, Xi'an University of Technology, Xi'an, Shaanxi 710048, China
| | - Wenjiang Ji
- School of Computer Science and Engineering, Xi'an University of Technology, Xi'an, Shaanxi 710048, China
| | - Xinhong Hei
- School of Computer Science and Engineering, Xi'an University of Technology, Xi'an, Shaanxi 710048, China.
| |
Collapse
|
16
|
Webber JW, Elias KM. Fast and robust imputation for miRNA expression data using constrained least squares. BMC Bioinformatics 2022; 23:145. [PMID: 35459087 PMCID: PMC9027475 DOI: 10.1186/s12859-022-04656-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2021] [Accepted: 03/29/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND High dimensional transcriptome profiling, whether through next generation sequencing techniques or high-throughput arrays, may result in scattered variables with missing data. Data imputation is a common strategy to maximize the inclusion of samples by using statistical techniques to fill in missing values. However, many data imputation methods are cumbersome and risk introduction of systematic bias. RESULTS We present a new data imputation method using constrained least squares and algorithms from the inverse problems literature and present applications for this technique in miRNA expression analysis. The proposed technique is shown to offer an imputation orders of magnitude faster, with greater than or equal accuracy when compared to similar methods from the literature. CONCLUSIONS This study offers a robust and efficient algorithm for data imputation, which can be used, e.g., to improve cancer prediction accuracy in the presence of missing data.
Collapse
Affiliation(s)
- James W Webber
- Department of Oncology and Gynecology, Brigham and Women's Hospital, Boston, MA, USA.
| | - Kevin M Elias
- Department of Oncology and Gynecology, Brigham and Women's Hospital, Boston, MA, USA
| |
Collapse
|
17
|
Pan X, Li Z, Qin S, Yu M, Hu H. ScLRTC: imputation for single-cell RNA-seq data via low-rank tensor completion. BMC Genomics 2021; 22:860. [PMID: 34844559 DOI: 10.1186/s12864-021-08101-3] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2021] [Accepted: 10/13/2021] [Indexed: 01/04/2023] Open
Abstract
BACKGROUND With single-cell RNA sequencing (scRNA-seq) methods, gene expression patterns at the single-cell resolution can be revealed. But as impacted by current technical defects, dropout events in scRNA-seq lead to missing data and noise in the gene-cell expression matrix and adversely affect downstream analyses. Accordingly, the true gene expression level should be recovered before the downstream analysis is carried out. RESULTS In this paper, a novel low-rank tensor completion-based method, termed as scLRTC, is proposed to impute the dropout entries of a given scRNA-seq expression. It initially exploits the similarity of single cells to build a third-order low-rank tensor and employs the tensor decomposition to denoise the data. Subsequently, it reconstructs the cell expression by adopting the low-rank tensor completion algorithm, which can restore the gene-to-gene and cell-to-cell correlations. ScLRTC is compared with other state-of-the-art methods on simulated datasets and real scRNA-seq datasets with different data sizes. Specific to simulated datasets, scLRTC outperforms other methods in imputing the dropouts closest to the original expression values, which is assessed by both the sum of squared error (SSE) and Pearson correlation coefficient (PCC). In terms of real datasets, scLRTC achieves the most accurate cell classification results in spite of the choice of different clustering methods (e.g., SC3 or t-SNE followed by K-means), which is evaluated by using adjusted rand index (ARI) and normalized mutual information (NMI). Lastly, scLRTC is demonstrated to be also effective in cell visualization and in inferring cell lineage trajectories. CONCLUSIONS a novel low-rank tensor completion-based method scLRTC gave imputation results better than the state-of-the-art tools. Source code of scLRTC can be accessed at https://github.com/jianghuaijie/scLRTC .
Collapse
|
18
|
Aureli D, Bruni R, Daraio C. Optimization methods for the imputation of missing values in Educational Institutions Data. MethodsX 2021; 8:101208. [PMID: 34434731 DOI: 10.1016/j.mex.2020.101208] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2020] [Accepted: 12/30/2020] [Indexed: 11/20/2022] Open
Abstract
The imputation of missing values in the detail data of Educational Institutions is a difficult task. These data contain multivariate time series, which cannot be satisfactory imputed by many existing imputation techniques. Moreover, almost all the data of an Institution are interconnected: the number of graduates is not independent from the number of students, the expenditure is not independent from the staff, etc. In other words, each imputed value has an impact on the whole set of data of the institution. Therefore, imputation techniques for this specific case should be designed very carefully. We describe here the methods and the codes of the imputation methodology developed to impute the various patterns of missing values which appear in similar interconnected data. In particular, a first part of the proposed methodology, called ``trend smoothing imputation'', is designed to impute missing values in time series by respecting the trend and the other features of an Institution. The second part of the proposed methodology, called ``donor imputation'', is designed to impute larger chunks of missing data by using values taken form similar Institutions in order to respect again their size and trend.•Trend smoothing imputation can handle missing subsequences in time series, and is given by a weighted combination of: (a) weighed average of the other available values of the sequence, and (b) linear regression.•Donor imputation can handle full sequence missing in time series. It imputes the Recipient Institution using the values taken from a similar institution, called Donor, selected using optimization criteria.•The values imputed by our techniques should respect the trend, the size and the ratios of each Institution.
Collapse
|
19
|
Yoon JH, Dias S, Hahn S. A method for assessing robustness of the results of a star-shaped network meta-analysis under the unidentifiable consistency assumption. BMC Med Res Methodol 2021; 21:113. [PMID: 34074239 PMCID: PMC8171049 DOI: 10.1186/s12874-021-01290-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2020] [Accepted: 04/21/2021] [Indexed: 11/16/2022] Open
Abstract
Background In a star-shaped network, pairwise comparisons link treatments with a reference treatment (often placebo or standard care), but not with each other. Thus, comparisons between non-reference treatments rely on indirect evidence, and are based on the unidentifiable consistency assumption, limiting the reliability of the results. We suggest a method of performing a sensitivity analysis through data imputation to assess the robustness of results with an unknown degree of inconsistency. Methods The method involves imputation of data for randomized controlled trials comparing non-reference treatments, to produce a complete network. The imputed data simulate a situation that would allow mixed treatment comparison, with a statistically acceptable extent of inconsistency. By comparing the agreement between the results obtained from the original star-shaped network meta-analysis and the results after incorporating the imputed data, the robustness of the results of the original star-shaped network meta-analysis can be quantified and assessed. To illustrate this method, we applied it to two real datasets and some simulated datasets. Results Applying the method to the star-shaped network formed by discarding all comparisons between non-reference treatments from a real complete network, 33% of the results from the analysis incorporating imputed data under acceptable inconsistency indicated that the treatment ranking would be different from the ranking obtained from the star-shaped network. Through a simulation study, we demonstrated the sensitivity of the results after data imputation for a star-shaped network with different levels of within- and between-study variability. An extended usability of the method was also demonstrated by another example where some head-to-head comparisons were incorporated. Conclusions Our method will serve as a practical technique to assess the reliability of results from a star-shaped network meta-analysis under the unverifiable consistency assumption. Supplementary Information The online version contains supplementary material available at 10.1186/s12874-021-01290-1.
Collapse
Affiliation(s)
- Jeong-Hwa Yoon
- Interdisciplinary Program in Medical Informatics, Seoul National University College of Medicine, Seoul, South Korea.,Institute of Health Policy and Management, Medical Research Center, Seoul National University, Seoul, South Korea
| | - Sofia Dias
- Centre for Reviews and Dissemination, University of York, York, UK
| | - Seokyung Hahn
- Institute of Health Policy and Management, Medical Research Center, Seoul National University, Seoul, South Korea. .,Department of Human Systems Medicine, Medical Statistics Laboratory, Seoul National University College of Medicine, 103 Daehak-ro, Jongno-gu, Seoul, 03080, South Korea.
| |
Collapse
|
20
|
Peralta M, Jannin P, Haegelen C, Baxter JSH. Data imputation and compression for Parkinson's disease clinical questionnaires. Artif Intell Med 2021; 114:102051. [PMID: 33875162 DOI: 10.1016/j.artmed.2021.102051] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2019] [Revised: 01/27/2021] [Accepted: 02/21/2021] [Indexed: 10/22/2022]
Abstract
Medical questionnaires are a valuable source of information but are often difficult to analyse due to both their size and the high possibility of them having missing values. This is a problematic issue in biomedical data science as it may complicate how individual questionnaire data is represented for statistical or machine learning analysis. In this paper, we propose a deeply-learnt residual autoencoder to simultaneously perform non-linear data imputation and dimensionality reduction. We present an extensive analysis of the dynamics of the performance of this autoencoder regarding the compression rate and the proportion of missing values. This method is evaluated on motor and non-motor clinical questionnaires of the Parkinson's Progression Markers Initiative (PPMI) database and consistently outperforms linear coupled imputation and reduction approaches.
Collapse
Affiliation(s)
- Maxime Peralta
- Laboratoire Traitement du Signal et de l'Image - INSERM UMR 1099, Université de Rennes 1, F-35000 Rennes, France
| | - Pierre Jannin
- Laboratoire Traitement du Signal et de l'Image - INSERM UMR 1099, Université de Rennes 1, F-35000 Rennes, France
| | - Claire Haegelen
- Laboratoire Traitement du Signal et de l'Image - INSERM UMR 1099, Université de Rennes 1, F-35000 Rennes, France; Neurosurgery Department, Centre Hospitalier Universitaire de Rennes, F-35000 Rennes, France
| | - John S H Baxter
- Laboratoire Traitement du Signal et de l'Image - INSERM UMR 1099, Université de Rennes 1, F-35000 Rennes, France.
| |
Collapse
|
21
|
Bruni R, Daraio C, Aureli D. Information reconstruction in educational institutions data from the European tertiary education registry. Data Brief 2021; 34:106611. [PMID: 33364267 PMCID: PMC7750319 DOI: 10.1016/j.dib.2020.106611] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2020] [Revised: 11/25/2020] [Accepted: 11/26/2020] [Indexed: 11/24/2022] Open
Abstract
Universities and other organizations providing higher level education are collectively called Higher Education Institutions. Their detail data, for instance number of students, number of graduates, etc., constitute the basis for several important analyses of the educational systems. This work provides data of the European Tertiary Education Register (ETER), which describes the Educational Institutions of Europe. These data have been gathered through the National Statistical Authorities of all the Countries participant in the ETER Project. However, they include many scattered missing values. Therefore, we have developed and applied an imputation methodology (see "Imputation Techniques for the Reconstruction of Missing Interconnected Data from Higher Educational Institutions, Bruni et al. [3]) to replace the missing values with feasible values being as similar as possible to the original values that have been lost and are now unknown. Thus, we also provide the imputed version of the same dataset, which allows more in-depth analyses of the European Higher Education Institutions. Both datasets (before and after imputation) are provided in two versions: with or without bibliometric information for the Institutions, so the user can also consider these additional information if interested.
Collapse
|
22
|
Xia Y, Zhang L, Ravikumar N, Attar R, Piechnik SK, Neubauer S, Petersen SE, Frangi AF. Recovering from missing data in population imaging - Cardiac MR image imputation via conditional generative adversarial nets. Med Image Anal 2021; 67:101812. [PMID: 33129140 DOI: 10.1016/j.media.2020.101812] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2020] [Revised: 07/05/2020] [Accepted: 08/19/2020] [Indexed: 11/21/2022]
Abstract
Accurate ventricular volume measurements are the primary indicators of normal/abnor- mal cardiac function and are dependent on the Cardiac Magnetic Resonance (CMR) volumes being complete. However, missing or unusable slices owing to the presence of image artefacts such as respiratory or motion ghosting, aliasing, ringing and signal loss in CMR sequences, significantly hinder accuracy of anatomical and functional cardiac quantification, and recovering from those is insufficiently addressed in population imaging. In this work, we propose a new robust approach, coined Image Imputation Generative Adversarial Network (I2-GAN), to learn key features of cardiac short axis (SAX) slices near missing information, and use them as conditional variables to infer missing slices in the query volumes. In I2-GAN, the slices are first mapped to latent vectors with position features through a regression net. The latent vector corresponding to the desired position is then projected onto the slice manifold, conditioned on intensity features through a generator net. The generator comprises residual blocks with normalisation layers that are modulated with auxiliary slice information, enabling propagation of fine details through the network. In addition, a multi-scale discriminator was implemented, along with a discriminator-based feature matching loss, to further enhance performance and encourage the synthesis of visually realistic slices. Experimental results show that our method achieves significant improvements over the state-of-the-art, in missing slice imputation for CMR, with an average SSIM of 0.872. Linear regression analysis yields good agreement between reference and imputed CMR images for all cardiac measurements, with correlation coefficients of 0.991 for left ventricular volume, 0.977 for left ventricular mass and 0.961 for right ventricular volume.
Collapse
|
23
|
Abstract
The data-dependent acquisition in mass spectrometry-based proteomics combined with quantitative analysis using isobaric labeling (iTRAQ and TMT) inevitably introduces missing values in proteomic experiments where a number of LC-runs are combined, especially in the growing field of shotgun clinical proteomics, where the protein profiles from the proteomics analysis of several hundred patient samples are compared and correlated to clinical traits such as a specific disease or disease treatment in order to link specific outcomes to one or more proteins. In the context of clinical research it is evident that missing values in such datasets reduce the power of the downstream statistical analysis therefore may hampers the linking of the expression of disease traits to the expression of specific proteins that may be useful for prognostic, diagnostic, or predictive purposes. In our study, we tested three data imputation approaches initially developed for microarray data for the imputation of missing values in datasets that are generated by several runs of shotgun proteomic experiments and where the data were relative protein abundances based on isobaric tags (iTRAQ and TMT). Our conclusion is that imputation methods based on k Nearest Neighbors successfully impute missing values in datasets with up to 50% missing values.
Collapse
Affiliation(s)
| | - Rune Matthiesen
- Computational and Experimental Biology Group, CEDOC, Chronic Diseases Research Centre, NOVA Medical School, Faculdade de Ciências Médicas, Universidade NOVA de Lisboa, Lisbon, Portugal
| | - Hans Christian Beck
- Department of Clinical Biochemistry and Pharmacology, Odense University Hospital, Odense C, Denmark.
| |
Collapse
|
24
|
Fusco T, Bi Y, Wang H, Browne F. Data mining and machine learning approaches for prediction modelling of schistosomiasis disease vectors: Epidemic disease prediction modelling. INT J MACH LEARN CYB 2019; 11:1159-1178. [PMID: 33727985 PMCID: PMC7224118 DOI: 10.1007/s13042-019-01029-x] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2018] [Accepted: 10/29/2019] [Indexed: 11/30/2022]
Abstract
This research presents viable solutions for prediction modelling of schistosomiasis disease based on vector density. Novel training models proposed in this work aim to address various aspects of interest in the artificial intelligence applications domain. Topics discussed include data imputation, semi-supervised labelling and synthetic instance simulation when using sparse training data. Innovative semi-supervised ensemble learning paradigms are proposed focusing on labelling threshold selection and stringency of classification confidence levels. A regression-correlation combination (RCC) data imputation method is also introduced for handling of partially complete training data. Results presented in this work show data imputation precision improvement over benchmark value replacement using proposed RCC on 70% of test cases. Proposed novel incremental transductive models such as ITSVM have provided interesting findings based on threshold constraints outperforming standard SVM application on 21% of test cases and can be applied with alternative environment-based epidemic disease domains. The proposed incremental transductive ensemble approach model enables the combination of complimentary algorithms to provide labelling for unlabelled vector density instances. Liberal (LTA) and strict training approaches provided varied results with LTA outperforming Stacking ensemble on 29.1% of test cases. Proposed novel synthetic minority over-sampling technique (SMOTE) equilibrium approach has yielded subtle classification performance increases which can be further interrogated to assess classification performance and efficiency relationships with synthetic instance generation.
Collapse
Affiliation(s)
- Terence Fusco
- Faculty of Computing and Engineering, University of Ulster, Newtownabbey, UK
| | - Yaxin Bi
- Faculty of Computing and Engineering, University of Ulster, Newtownabbey, UK
| | - Haiying Wang
- Faculty of Computing and Engineering, University of Ulster, Newtownabbey, UK
| | - Fiona Browne
- Faculty of Computing and Engineering, University of Ulster, Newtownabbey, UK
| |
Collapse
|
25
|
Chung CJ, Hsieh YY, Lin HC. Fuzzy inference system for modeling the environmental risk map of air pollutants in Taiwan. J Environ Manage 2019; 246:808-820. [PMID: 31228694 DOI: 10.1016/j.jenvman.2019.06.038] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/29/2019] [Revised: 06/10/2019] [Accepted: 06/10/2019] [Indexed: 06/09/2023]
Abstract
This study aimed to improve the uncertainty in spatial data of risk assessment through a Fuzzy inference system (FIS) as a way to conduct an environmental risk map of air pollution in Taiwan. In modeling, the feature inputs of FIS included the geographic coordinates and time, while the outputs are the pollutant concentrations. The outputs are supplements to the concentration contour on the map in comparison with Kriging interpolation. In our model, the FIS was designed using the official open data of air pollutants, including Pb and PM2.5 that were collected from the monitoring stations in mid-southern Taiwan. The model involved data filtration and imputation in the preliminary scheme to extract the historical data for analysis. We used the data of Pb (2001-2013) and PM2.5 (2006-2013) for the training process, and then used the data from 2014 to 2015 for validation. Our model was able to compute the smaller errors of inferred and measured values of Pb and PM2.5 than the conventional method. The approach was applied to deduce the exposure of PM2.5 distributed over the Taiwan Island in accordance with the governmental open data of seventy-three stations during 2006-2016 in order to produce our risk map. The designed model upon Fuzzy inference accesses potential risks of spatiotemporal exposures in the unmeasured locations with feasibility and adaptability for environmental management.
Collapse
Affiliation(s)
- Chi-Jung Chung
- Department of Health Risk Management, College of Public Health, China Medical University, Taichung, Taiwan; and Department of Medical Research, China Medical University Hospital, Taichung, Taiwan.
| | - Yun-Yu Hsieh
- Department of Health Risk Management, College of Public Health, China Medical University, Taichung, Taiwan.
| | - Hsueh-Chun Lin
- Department of Health Services Administration and Department of Health Risk Management, College of Public Health, China Medical University, 91 Hsueh-Shih Rd., Taichung, 40402, Taiwan.
| |
Collapse
|
26
|
Grimes T, Walker AR, Datta S, Datta S. Predicting survival times for neuroblastoma patients using RNA-seq expression profiles. Biol Direct 2018; 13:11. [PMID: 29848365 PMCID: PMC5977759 DOI: 10.1186/s13062-018-0213-x] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2017] [Accepted: 05/01/2018] [Indexed: 11/10/2022] Open
Abstract
Background Neuroblastoma is the most common tumor of early childhood and is notorious for its high variability in clinical presentation. Accurate prognosis has remained a challenge for many patients. In this study, expression profiles from RNA-sequencing are used to predict survival times directly. Several models are investigated using various annotation levels of expression profiles (genes, transcripts, and introns), and an ensemble predictor is proposed as a heuristic for combining these different profiles. Results The use of RNA-seq data is shown to improve accuracy in comparison to using clinical data alone for predicting overall survival times. Furthermore, clinically high-risk patients can be subclassified based on their predicted overall survival times. In this effort, the best performing model was the elastic net using both transcripts and introns together. This model separated patients into two groups with 2-year overall survival rates of 0.40±0.11 (n=22) versus 0.80±0.05 (n=68). The ensemble approach gave similar results, with groups 0.42±0.10 (n=25) versus 0.82±0.05 (n=65). This suggests that the ensemble is able to effectively combine the individual RNA-seq datasets. Conclusions Using predicted survival times based on RNA-seq data can provide improved prognosis by subclassifying clinically high-risk neuroblastoma patients. Reviewers This article was reviewed by Subharup Guha and Isabel Nepomuceno.
Collapse
Affiliation(s)
- Tyler Grimes
- Department of BiostatisticsUniversity of Florida, 2004 Mowry Rd, Gainesville, 32611, USA
| | - Alejandro R Walker
- Department of BiostatisticsUniversity of Florida, 2004 Mowry Rd, Gainesville, 32611, USA
| | - Susmita Datta
- Department of BiostatisticsUniversity of Florida, 2004 Mowry Rd, Gainesville, 32611, USA
| | - Somnath Datta
- Department of BiostatisticsUniversity of Florida, 2004 Mowry Rd, Gainesville, 32611, USA.
| |
Collapse
|
27
|
Wang T, Nabavi S. SigEMD: A powerful method for differential gene expression analysis in single-cell RNA sequencing data. Methods 2018; 145:25-32. [PMID: 29702224 DOI: 10.1016/j.ymeth.2018.04.017] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2018] [Revised: 04/13/2018] [Accepted: 04/19/2018] [Indexed: 10/17/2022] Open
Abstract
Differential gene expression analysis is one of the significant efforts in single cell RNA sequencing (scRNAseq) analysis to discover the specific changes in expression levels of individual cell types. Since scRNAseq exhibits multimodality, large amounts of zero counts, and sparsity, it is different from the traditional bulk RNA sequencing (RNAseq) data. The new challenges of scRNAseq data promote the development of new methods for identifying differentially expressed (DE) genes. In this study, we proposed a new method, SigEMD, that combines a data imputation approach, a logistic regression model and a nonparametric method based on the Earth Mover's Distance, to precisely and efficiently identify DE genes in scRNAseq data. The regression model and data imputation are used to reduce the impact of large amounts of zero counts, and the nonparametric method is used to improve the sensitivity of detecting DE genes from multimodal scRNAseq data. By additionally employing gene interaction network information to adjust the final states of DE genes, we further reduce the false positives of calling DE genes. We used simulated datasets and real datasets to evaluate the detection accuracy of the proposed method and to compare its performance with those of other differential expression analysis methods. Results indicate that the proposed method has an overall powerful performance in terms of precision in detection, sensitivity, and specificity.
Collapse
Affiliation(s)
- Tianyu Wang
- Computer Science and Engineering Department, University of Connecticut, Storrs, CT, USA.
| | - Sheida Nabavi
- Computer Science and Engineering Department and Institute for Systems Genomics, University of Connecticut, Storrs, CT, USA.
| |
Collapse
|
28
|
Thung KH, Yap PT, Adeli E, Lee SW, Shen D. Conversion and time-to-conversion predictions of mild cognitive impairment using low-rank affinity pursuit denoising and matrix completion. Med Image Anal 2018; 45:68-82. [PMID: 29414437 PMCID: PMC6892173 DOI: 10.1016/j.media.2018.01.002] [Citation(s) in RCA: 40] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2016] [Revised: 12/12/2017] [Accepted: 01/12/2018] [Indexed: 10/18/2022]
Abstract
In this paper, we aim to predict conversion and time-to-conversion of mild cognitive impairment (MCI) patients using multi-modal neuroimaging data and clinical data, via cross-sectional and longitudinal studies. However, such data are often heterogeneous, high-dimensional, noisy, and incomplete. We thus propose a framework that includes sparse feature selection, low-rank affinity pursuit denoising (LRAD), and low-rank matrix completion (LRMC) in this study. Specifically, we first use sparse linear regressions to remove unrelated features. Then, considering the heterogeneity of the MCI data, which can be assumed as a union of multiple subspaces, we propose to use a low rank subspace method (i.e., LRAD) to denoise the data. Finally, we employ LRMC algorithm with three data fitting terms and one inequality constraint for joint conversion and time-to-conversion predictions. Our framework aims to answer a very important but yet rarely explored question in AD study, i.e., when will the MCI convert to AD? This is different from survival analysis, which provides the probabilities of conversion at different time points that are mainly used for global analysis, while our time-to-conversion prediction is for each individual subject. Evaluations using the ADNI dataset indicate that our method outperforms conventional LRMC and other state-of-the-art methods. Our method achieves a maximal pMCI classification accuracy of 84% and time prediction correlation of 0.665.
Collapse
Affiliation(s)
- Kim-Han Thung
- Department of Radiology and BRIC, University of North Carolina, Chapel Hill 27599, USA.
| | - Pew-Thian Yap
- Department of Radiology and BRIC, University of North Carolina, Chapel Hill 27599, USA
| | - Ehsan Adeli
- Department of Radiology and BRIC, University of North Carolina, Chapel Hill 27599, USA
| | - Seong-Whan Lee
- Department of Brain and Cognitive Engineering, Korea University, Seoul 02841, Republic of Korea
| | - Dinggang Shen
- Department of Radiology and BRIC, University of North Carolina, Chapel Hill 27599, USA; Department of Brain and Cognitive Engineering, Korea University, Seoul 02841, Republic of Korea.
| |
Collapse
|
29
|
Singla NK, Meske DS, Desjardins PJ. Exploring the Interplay between Rescue Drugs, Data Imputation, and Study Outcomes: Conceptual Review and Qualitative Analysis of an Acute Pain Data Set. Pain Ther 2017; 6:165-175. [PMID: 28676997 PMCID: PMC5693805 DOI: 10.1007/s40122-017-0074-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2017] [Indexed: 11/01/2022] Open
Abstract
In placebo-controlled acute surgical pain studies, provisions must be made for study subjects to receive adequate analgesic therapy. As such, most protocols allow study subjects to receive a pre-specified regimen of open-label analgesic drugs (rescue drugs) as needed. The selection of an appropriate rescue regimen is a critical experimental design choice. We hypothesized that a rescue regimen that is too liberal could lead to all study arms receiving similar levels of pain relief (thereby confounding experimental results), while a regimen that is too stringent could lead to a high subject dropout rate (giving rise to a preponderance of missing data). Despite the importance of rescue regimen as a study design feature, there exist no published review articles or meta-analysis focusing on the impact of rescue therapy on experimental outcomes. Therefore, when selecting a rescue regimen, researchers must rely on clinical factors (what analgesics do patients usually receive in similar surgical scenarios) and/or anecdotal evidence. In the following article, we attempt to bridge this gap by reviewing and discussing the experimental impacts of rescue therapy on a common acute surgical pain population: first metatarsal bunionectomy. The function of this analysis is to (1) create a framework for discussion and future exploration of rescue as a methodological study design feature, (2) discuss the interplay between data imputation techniques and rescue drugs, and (3) inform the readership regarding the impact of data imputation techniques on the validity of study conclusions. Our findings indicate that liberal rescue may degrade assay sensitivity, while stringent rescue may lead to unacceptably high dropout rates.
Collapse
Affiliation(s)
- Neil K Singla
- Lotus Clinical Research, Huntington Hospital, Department of Anesthesiology, Pasadena, CA, USA.
| | | | | |
Collapse
|
30
|
Regnerus M. Is structural stigma's effect on the mortality of sexual minorities robust? A failure to replicate the results of a published study. Soc Sci Med 2016; 188:157-165. [PMID: 27889281 DOI: 10.1016/j.socscimed.2016.11.018] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2016] [Revised: 11/01/2016] [Accepted: 11/11/2016] [Indexed: 10/20/2022]
Abstract
BACKGROUND The study of stigma's influence on health has surged in recent years. Hatzenbuehler et al.'s (2014) study of structural stigma's effect on mortality revealed an average of 12 years' shorter life expectancy for sexual minorities who resided in communities thought to exhibit high levels of anti-gay prejudice, using data from the 1988-2002 administrations of the US General Social Survey linked to mortality outcome data in the 2008 National Death Index. METHODS In the original study, the key predictor variable (structural stigma) led to results suggesting the profound negative influence of structural stigma on the mortality of sexual minorities. Attempts to replicate the study, in order to explore alternative hypotheses, repeatedly failed to generate the original study's key finding on structural stigma. Efforts to discern the source of the disparity in results revealed complications in the multiple imputation process for missing values of the components of structural stigma. This prompted efforts at replication using 10 different imputation approaches. RESULTS Efforts to replicate Hatzenbuehler et al.'s (2014) key finding on structural stigma's notable influence on the premature mortality of sexual minorities, including a more refined imputation strategy than described in the original study, failed. No data imputation approach yielded parameters that supported the original study's conclusions. Alternative hypotheses, which originally motivated the present study, revealed little new information. CONCLUSION Ten different approaches to multiple imputation of missing data yielded none in which the effect of structural stigma on the mortality of sexual minorities was statistically significant. Minimally, the original study's structural stigma variable (and hence its key result) is so sensitive to subjective measurement decisions as to be rendered unreliable.
Collapse
Affiliation(s)
- Mark Regnerus
- Department of Sociology, University of Texas at Austin, 305 E 23rd St, A1700, Austin, TX 78712-1086, USA; Austin Institute for the Study of Family and Culture, 2021 Guadalupe St., Suite 260, Austin, TX 78705, USA.
| |
Collapse
|
31
|
Liu J, Khattak AJ, Richards SH, Nambisan S. What are the differences in driver injury outcomes at highway-rail grade crossings? Untangling the role of pre-crash behaviors. Accid Anal Prev 2015; 85:157-169. [PMID: 26432991 DOI: 10.1016/j.aap.2015.09.004] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/06/2015] [Revised: 07/23/2015] [Accepted: 09/08/2015] [Indexed: 06/05/2023]
Abstract
Crashes at highway-rail grade crossings can result in severe injuries and fatalities to vehicle occupants. Using a crash database from the Federal Railroad Administration (N=15,639 for 2004-2013), this study explores differences in safety outcomes from crashes between passive controls (Crossbucks and STOP signs) and active controls (flashing lights, gates, audible warnings and highway signals). To address missing data, an imputation model is developed, creating a complete dataset for estimation. Path analysis is used to quantify the direct and indirect associations of passive and active controls with pre-crash behaviors and crash outcomes in terms of injury severity. The framework untangles direct and indirect associations of controls by estimating two models, one for pre-crash driving behaviors (e.g., driving around active controls), and another model for injury severity. The results show that while the presence of gates is not directly associated with injury severity, the indirect effect through stopping behavior is statistically significant (95% confidence level) and substantial. Drivers are more likely to stop at gates that also have flashing lights and audible warnings, and stopping at gates is associated with lower injury severity. This indirect association lowers the chances of injury by 16%, compared with crashes at crossings without gates. Similar relationships between other controls and injury severity are explored. Generally, crashes occurring at active controls are less severe than crashes at passive controls. The results of study can be used to modify Crash Modification Factors (CMFs) to account for crash injury severity. The study contributes to enhancing the understanding of safety by incorporating pre-crash behaviors in a broader framework that quantifies correlates of crash injury severity at active and passive crossings.
Collapse
Affiliation(s)
- Jun Liu
- Department of Civil and Environmental Engineering, The University of Tennessee, 311 John Tickle Building, Knoxville, TN 37996, United States.
| | - Asad J Khattak
- Department of Civil and Environmental Engineering, The University of Tennessee, 322 John Tickle Building, Knoxville, TN 37996, United States.
| | - Stephen H Richards
- Center for Transportation Research, The University of Tennessee, 309 Conference Center Building, Knoxville, TN 37996, United States.
| | - Shashi Nambisan
- Department of Civil and Environmental Engineering, The University of Tennessee, 320 John Tickle Building, Knoxville, TN 37996, United States.
| |
Collapse
|
32
|
Thung KH, Wee CY, Yap PT, Shen D. Neurodegenerative disease diagnosis using incomplete multi-modality data via matrix shrinkage and completion. Neuroimage 2014; 91:386-400. [PMID: 24480301 PMCID: PMC4096013 DOI: 10.1016/j.neuroimage.2014.01.033] [Citation(s) in RCA: 50] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2013] [Revised: 01/13/2014] [Accepted: 01/18/2014] [Indexed: 12/17/2022] Open
Abstract
In this work, we are interested in predicting the diagnostic statuses of potentially neurodegenerated patients using feature values derived from multi-modality neuroimaging data and biological data, which might be incomplete. Collecting the feature values into a matrix, with each row containing a feature vector of a sample, we propose a framework to predict the corresponding associated multiple target outputs (e.g., diagnosis label and clinical scores) from this feature matrix by performing matrix shrinkage following matrix completion. Specifically, we first combine the feature and target output matrices into a large matrix and then partition this large incomplete matrix into smaller submatrices, each consisting of samples with complete feature values (corresponding to a certain combination of modalities) and target outputs. Treating each target output as the outcome of a prediction task, we apply a 2-step multi-task learning algorithm to select the most discriminative features and samples in each submatrix. Features and samples that are not selected in any of the submatrices are discarded, resulting in a shrunk version of the original large matrix. The missing feature values and unknown target outputs of the shrunk matrix is then completed simultaneously. Experimental results using the ADNI dataset indicate that our proposed framework achieves higher classification accuracy at a greater speed when compared with conventional imputation-based classification methods and also yields competitive performance when compared with the state-of-the-art methods.
Collapse
Affiliation(s)
- Kim-Han Thung
- Biomedical Research Imaging Center (BRIC) and Department of Radiology, University of North Carolina at Chapel Hill, USA.
| | - Chong-Yaw Wee
- Biomedical Research Imaging Center (BRIC) and Department of Radiology, University of North Carolina at Chapel Hill, USA
| | - Pew-Thian Yap
- Biomedical Research Imaging Center (BRIC) and Department of Radiology, University of North Carolina at Chapel Hill, USA
| | - Dinggang Shen
- Biomedical Research Imaging Center (BRIC) and Department of Radiology, University of North Carolina at Chapel Hill, USA; Department of Brain and Cognitive Engineering, Korea University, Seoul, Korea.
| |
Collapse
|
33
|
Zoffoli HJO, Varella CAA, do Amaral-Sobrinho NMB, Zonta E, Tolón-Becerra A. Method of median semi-variance for the analysis of left-censored data: comparison with other techniques using environmental data. Chemosphere 2013; 93:1701-1709. [PMID: 23830887 DOI: 10.1016/j.chemosphere.2013.05.041] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/24/2012] [Revised: 05/12/2013] [Accepted: 05/20/2013] [Indexed: 05/28/2023]
Abstract
In environmental monitoring, variables with analytically non-detected values are commonly encountered. For the statistical evaluation of these data, most of the methods that produce a less biased performance require specific computer programs. In this paper, a statistical method based on the median semi-variance (SemiV) is proposed to estimate the position and spread statistics in a dataset with single left-censoring. The performances of the SemiV method and 12 other statistical methods are evaluated using real and complete datasets. The performances of all the methods are influenced by the percentage of censored data. In general, the simple substitution and deletion methods showed biased performance, with exceptions for L/2, Inter and L/√2 methods that can be used with caution under specific conditions. In general, the SemiV method and other parametric methods showed similar performances and were less biased than other methods. The SemiV method is a simple and accurate procedure that can be used in the analysis of datasets with less than 50% of left-censored data.
Collapse
|
34
|
Tanrikulu Y, Kondru R, Schneider G, So WV, Bitter HM. Missing Value Estimation for Compound-Target Activity Data. Mol Inform 2010; 29:678-84. [PMID: 27464011 DOI: 10.1002/minf.201000073] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2010] [Accepted: 09/03/2010] [Indexed: 01/24/2023]
Abstract
Relationships between drug targets and associated diseases have traditionally been investigated by means of sequence similarity, comparative protein modeling, and pathway analysis. Recently, a complementary paradigm has emerged to link targets and drugs via biological responses within activity data and visualize findings in networks. It has been indicated that one of the obstacles towards the identification of novel interactions is the sparsity of available data. In this article, we provide a survey of estimation methods that address the challenge of data sparsity. Each method is described in terms of its advantages and limitations, and an exemplary application on compound-target activity data is demonstrated. With such imputation methods in-hand, the opportunity to combine efforts in molecular informatics can be realized, yielding novel insights into ligand-target space.
Collapse
Affiliation(s)
- Yusuf Tanrikulu
- Pharma Research & Early Development Informatics, Hoffmann-La Roche Inc. 340 Kingsland Street, Nutley, NJ 07110, USA phone/fax: +1-973-235-6834/-8531.
| | - Rama Kondru
- Discovery Chemistry, Hoffmann-La Roche Inc. 340 Kingsland Street, Nutley, NJ 07110, USA
| | - Gisbert Schneider
- ETH Zürich, Computer-Assisted Drug Design, Wolfgang-Pauli Str. 10, 8093 Zürich, Switzerland
| | - W Venus So
- Pharma Research & Early Development Informatics, Hoffmann-La Roche Inc. 340 Kingsland Street, Nutley, NJ 07110, USA phone/fax: +1-973-235-6834/-8531
| | - Hans-Marcus Bitter
- Translational Research Sciences, Hoffmann-La Roche Inc., 340 Kingsland Street, Nutley, NJ 07110, USA
| |
Collapse
|