51
|
Jang JH, Kim TY, Lim HS, Yoon D. Unsupervised feature learning for electrocardiogram data using the convolutional variational autoencoder. PLoS One 2021; 16:e0260612. [PMID: 34852002 PMCID: PMC8635334 DOI: 10.1371/journal.pone.0260612] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2021] [Accepted: 11/13/2021] [Indexed: 11/18/2022] Open
Abstract
Most existing electrocardiogram (ECG) feature extraction methods rely on rule-based approaches. It is difficult to manually define all ECG features. We propose an unsupervised feature learning method using a convolutional variational autoencoder (CVAE) that can extract ECG features with unlabeled data. We used 596,000 ECG samples from 1,278 patients archived in biosignal databases from intensive care units to train the CVAE. Three external datasets were used for feature validation using two approaches. First, we explored the features without an additional training process. Clustering, latent space exploration, and anomaly detection were conducted. We confirmed that CVAE features reflected the various types of ECG rhythms. Second, we applied CVAE features to new tasks as input data and CVAE weights to weight initialization for different models for transfer learning for the classification of 12 types of arrhythmias. The f1-score for arrhythmia classification with extreme gradient boosting was 0.86 using CVAE features only. The f1-score of the model in which weights were initialized with the CVAE encoder was 5% better than that obtained with random initialization. Unsupervised feature learning with CVAE can extract the characteristics of various types of ECGs and can be an alternative to the feature extraction method for ECGs.
Collapse
Affiliation(s)
- Jong-Hwan Jang
- Department of Biomedical Systems Informatics, Yonsei University College of Medicine, Yongin, Gyeonggi-do, Republic of Korea
| | | | - Hong-Seok Lim
- Department of Cardiology, Ajou University School of Medicine, Suwon, Gyeonggi-do, Republic of Korea
| | - Dukyong Yoon
- Department of Biomedical Systems Informatics, Yonsei University College of Medicine, Yongin, Gyeonggi-do, Republic of Korea.,Center for Digital Health, Yongin Severance Hospital, Yonsei University Health System, Yongin, Republic of Korea
| |
Collapse
|
52
|
Chi EC. Discovering Geometry in Data Arrays. Comput Sci Eng 2021; 23:42-51. [PMID: 35784398 PMCID: PMC9248489 DOI: 10.1109/mcse.2021.3120039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/01/2024]
Abstract
Modern technologies produce a deluge of complicated data. In neuroscience, for example, minimally invasive experimental methods can take recordings of large populations of neurons at high resolution under a multitude of conditions. Such data arrays possess non-trivial interdependencies along each of their axes. Insights into these data arrays may lay the foundations of advanced treatments for nervous system disorders. The potential impacts of such data, however, will not be fully realized unless the techniques for analyzing them keep pace. Specifically, there is an urgent, growing need for methods for estimating the low-dimensional structure and geometry in big and noisy data arrays. This article reviews a framework for identifying complicated underlying patterns in such data and also recounts the key role that the Department of Energy Computational Sciences Graduate Fellowship played in setting the stage for this work to be done by the author.
Collapse
Affiliation(s)
- Eric C Chi
- Department of Statistics, Rice University
| |
Collapse
|
53
|
Sorino P, Campanella A, Bonfiglio C, Mirizzi A, Franco I, Bianco A, Caruso MG, Misciagna G, Aballay LR, Buongiorno C, Liuzzi R, Cisternino AM, Notarnicola M, Chiloiro M, Fallucchi F, Pascoschi G, Osella AR. Development and validation of a neural network for NAFLD diagnosis. Sci Rep 2021; 11:20240. [PMID: 34642390 PMCID: PMC8511336 DOI: 10.1038/s41598-021-99400-y] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2021] [Accepted: 09/24/2021] [Indexed: 12/18/2022] Open
Abstract
Non-Alcoholic Fatty Liver Disease (NAFLD) affects about 20–30% of the adult population in developed countries and is an increasingly important cause of hepatocellular carcinoma. Liver ultrasound (US) is widely used as a noninvasive method to diagnose NAFLD. However, the intensive use of US is not cost-effective and increases the burden on the healthcare system. Electronic medical records facilitate large-scale epidemiological studies and, existing NAFLD scores often require clinical and anthropometric parameters that may not be captured in those databases. Our goal was to develop and validate a simple Neural Network (NN)-based web app that could be used to predict NAFLD particularly its absence. The study included 2970 subjects; training and testing of the neural network using a train–test-split approach was done on 2869 of them. From another population consisting of 2301 subjects, a further 100 subjects were randomly extracted to test the web app. A search was made to find the best parameters for the NN and then this NN was exported for incorporation into a local web app. The percentage of accuracy, area under the ROC curve, confusion matrix, Positive (PPV) and Negative Predicted Value (NPV) values, precision, recall and f1-score were verified. After that, Explainability (XAI) was analyzed to understand the diagnostic reasoning of the NN. Finally, in the local web app, the specificity and sensitivity values were checked. The NN achieved a percentage of accuracy during testing of 77.0%, with an area under the ROC curve value of 0.82. Thus, in the web app the NN evidenced to achieve good results, with a specificity of 1.00 and sensitivity of 0.73. The described approach can be used to support NAFLD diagnosis, reducing healthcare costs. The NN-based web app is easy to apply and the required parameters are easily found in healthcare databases.
Collapse
Affiliation(s)
- Paolo Sorino
- Laboratory of Epidemiology and Biostatistics, National Institute of Gastroenterology, "S de Bellis" Research Hospital, Via Turi 27, 70013, Castellana Grotte, BA, Italy
| | - Angelo Campanella
- Laboratory of Epidemiology and Biostatistics, National Institute of Gastroenterology, "S de Bellis" Research Hospital, Via Turi 27, 70013, Castellana Grotte, BA, Italy
| | - Caterina Bonfiglio
- Laboratory of Epidemiology and Biostatistics, National Institute of Gastroenterology, "S de Bellis" Research Hospital, Via Turi 27, 70013, Castellana Grotte, BA, Italy
| | - Antonella Mirizzi
- Laboratory of Epidemiology and Biostatistics, National Institute of Gastroenterology, "S de Bellis" Research Hospital, Via Turi 27, 70013, Castellana Grotte, BA, Italy
| | - Isabella Franco
- Laboratory of Epidemiology and Biostatistics, National Institute of Gastroenterology, "S de Bellis" Research Hospital, Via Turi 27, 70013, Castellana Grotte, BA, Italy
| | - Antonella Bianco
- Laboratory of Epidemiology and Biostatistics, National Institute of Gastroenterology, "S de Bellis" Research Hospital, Via Turi 27, 70013, Castellana Grotte, BA, Italy
| | - Maria Gabriella Caruso
- Laboratory of Nutritional Biochemistry, National Institute of Gastroenterology, "S de Bellis" Research Hospital, Via Turi 27, 70013, Castellana Grotte, BA, Italy
| | - Giovanni Misciagna
- Scientific and Ethical Committee, Polyclinic Hospital, University of Bari, Piazza Giulio Cesare, 11, 70124, Bari, BA, Italy
| | - Laura R Aballay
- Human Nutrition Research Center (CenINH), School of Nutrition, Faculty of Medical Sciences, Universidad Nacional de Córdoba, Córdoba, Argentina
| | - Claudia Buongiorno
- Laboratory of Epidemiology and Biostatistics, National Institute of Gastroenterology, "S de Bellis" Research Hospital, Via Turi 27, 70013, Castellana Grotte, BA, Italy
| | - Rosalba Liuzzi
- Laboratory of Epidemiology and Biostatistics, National Institute of Gastroenterology, "S de Bellis" Research Hospital, Via Turi 27, 70013, Castellana Grotte, BA, Italy
| | - Anna Maria Cisternino
- Clinical Nutrition Outpatient Clinic, National Institute of Gastroenterology, "S de Bellis" Research Hospital, Via Turi 27, 70013, Castellana Grotte, BA, Italy
| | - Maria Notarnicola
- Laboratory of Nutritional Biochemistry, National Institute of Gastroenterology, "S de Bellis" Research Hospital, Via Turi 27, 70013, Castellana Grotte, BA, Italy
| | - Marisa Chiloiro
- San Giacomo Hospital, Largo S. Veneziani, 21, 70043, Monopoli, BA, Italy
| | - Francesca Fallucchi
- Department of Engineering Sciences, Guglielmo Marconi University, Via plinio 44, 00193, Rome, Italy
| | - Giovanni Pascoschi
- Department of Electrical and Information Engineering, Polytechnic of Bari, Via Re David, 200, 70125, Bari, BA, Italy
| | - Alberto Rubén Osella
- Laboratory of Epidemiology and Biostatistics, National Institute of Gastroenterology, "S de Bellis" Research Hospital, Via Turi 27, 70013, Castellana Grotte, BA, Italy.
| |
Collapse
|
54
|
Class distribution-aware adaptive margins and cluster embedding for classification of fruit and vegetables at supermarket self-checkouts. Neurocomputing 2021. [DOI: 10.1016/j.neucom.2021.07.040] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
|
55
|
Bryan de la Peña J, Kunder N, Lou TF, Chase R, Stanowick A, Barragan-Iglesias P, Pancrazio JJ, Campbell ZT. A Role for Translational Regulation by S6 Kinase and a Downstream Target in Inflammatory Pain. Br J Pharmacol 2021; 178:4675-4690. [PMID: 34355805 DOI: 10.1111/bph.15646] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2021] [Revised: 07/23/2021] [Accepted: 07/26/2021] [Indexed: 11/30/2022] Open
Abstract
BACKGROUND AND PURPOSE Translational controls pervade neurobiology. Nociceptors play an integral role in the detection and propagation of pain signals. Nociceptors can undergo persistent changes in their intrinsic excitability. Pharmacologic disruption of nascent protein synthesis diminishes acute and chronic forms of pain-associated behaviors. Yet, the targets of translational controls that facilitate plasticity in nociceptors are unclear. EXPERIMENTAL APPROACH We used ribosome profiling to probe the translational landscape in DRG neurons after treatment of the inflammatory mediators NGF and IL-6. We validated the expression dynamics of c-Fos using immunoblotting and immunohistochemistry. Given that inflammation is known to stimulate mTOR signaling, we reasoned that downstream factors (e.g., ribosomal protein S6 kinase 1, S6K1) might control c-Fos levels. We utilized small-molecule inhibitors of S6K1 (DG2) or c-Fos (T-5224) to probe their effects on nociceptor activity in vitro using multi-electrode arrays (MEAs) and pain behavior in vivo using a hyperalgesic priming model. KEY RESULTS We demonstrate that c-Fos is expressed in sensory neurons. Inflammatory mediators that promote pain in both humans and rodents promote c-Fos translation. We demonstrate that the mTOR effector S6K1 is essential for c-Fos biosynthesis. Inhibition of S6K1 or c-Fos with small molecules diminish mechanical and thermal hypersensitivity in response to inflammatory cues. Additionally, both inhibitors reduce evoked nociceptor activity. CONCLUSION Our data reveal a novel role of S6K1 in modulating rapid response to inflammatory mediators, with c-Fos being one key downstream target. Targeting the S6 kinase pathway or c-Fos is an exciting new avenue for pain-modulating compounds.
Collapse
Affiliation(s)
- June Bryan de la Peña
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX, USA
| | - Nikesh Kunder
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX, USA
| | - Tzu-Fang Lou
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX, USA
| | - Rebecca Chase
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX, USA
| | - Alexander Stanowick
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX, USA
| | - Paulino Barragan-Iglesias
- School of Behavioral and Brain Sciences, University of Texas at Dallas, Richardson, TX, USA.,Department of Physiology and Pharmacology, Center for Basic Sciences, Autonomous University of Aguascalientes, Aguascalientes, Mexico
| | - Joseph J Pancrazio
- Department of Bioengineering, University of Texas at Dallas, Richardson, TX, USA.,Center for Advanced Pain Studies, University of Texas at Dallas, Richardson, TX, USA
| | - Zachary T Campbell
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX, USA.,Department of Bioengineering, University of Texas at Dallas, Richardson, TX, USA.,Center for Advanced Pain Studies, University of Texas at Dallas, Richardson, TX, USA
| |
Collapse
|
56
|
Visualization of vibrational spectroscopy for agro-food samples using t-Distributed Stochastic Neighbor Embedding. Food Control 2021. [DOI: 10.1016/j.foodcont.2020.107812] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/31/2023]
|
57
|
Zhao Y, Fang ZY, Lin CX, Deng C, Xu YP, Li HD. RFCell: A Gene Selection Approach for scRNA-seq Clustering Based on Permutation and Random Forest. Front Genet 2021; 12:665843. [PMID: 34386033 PMCID: PMC8354212 DOI: 10.3389/fgene.2021.665843] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2021] [Accepted: 04/01/2021] [Indexed: 11/13/2022] Open
Abstract
In recent years, the application of single cell RNA-seq (scRNA-seq) has become more and more popular in fields such as biology and medical research. Analyzing scRNA-seq data can discover complex cell populations and infer single-cell trajectories in cell development. Clustering is one of the most important methods to analyze scRNA-seq data. In this paper, we focus on improving scRNA-seq clustering through gene selection, which also reduces the dimensionality of scRNA-seq data. Studies have shown that gene selection for scRNA-seq data can improve clustering accuracy. Therefore, it is important to select genes with cell type specificity. Gene selection not only helps to reduce the dimensionality of scRNA-seq data, but also can improve cell type identification in combination with clustering methods. Here, we proposed RFCell, a supervised gene selection method, which is based on permutation and random forest classification. We first use RFCell and three existing gene selection methods to select gene sets on 10 scRNA-seq data sets. Then, three classical clustering algorithms are used to cluster the cells obtained by these gene selection methods. We found that the gene selection performance of RFCell was better than other gene selection methods.
Collapse
Affiliation(s)
- Yuan Zhao
- Hunan Provincial Key Laboratory on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha, China
| | - Zhao-Yu Fang
- School of Mathematics and Statistics, Central South University, Changsha, China
| | - Cui-Xiang Lin
- Hunan Provincial Key Laboratory on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha, China
| | - Chao Deng
- Hunan Provincial Key Laboratory on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha, China
| | - Yun-Pei Xu
- Hunan Provincial Key Laboratory on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha, China
| | - Hong-Dong Li
- Hunan Provincial Key Laboratory on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha, China
| |
Collapse
|
58
|
Single-Trial Kernel-Based Functional Connectivity for Enhanced Feature Extraction in Motor-Related Tasks. SENSORS 2021; 21:s21082750. [PMID: 33924672 PMCID: PMC8069819 DOI: 10.3390/s21082750] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/15/2021] [Revised: 04/01/2021] [Accepted: 04/08/2021] [Indexed: 02/06/2023]
Abstract
Motor learning is associated with functional brain plasticity, involving specific functional connectivity changes in the neural networks. However, the degree of learning new motor skills varies among individuals, which is mainly due to the between-subject variability in brain structure and function captured by electroencephalographic (EEG) recordings. Here, we propose a kernel-based functional connectivity measure to deal with inter/intra-subject variability in motor-related tasks. To this end, from spatio-temporal-frequency patterns, we extract the functional connectivity between EEG channels through their Gaussian kernel cross-spectral distribution. Further, we optimize the spectral combination weights within a sparse-based ℓ2-norm feature selection framework matching the motor-related labels that perform the dimensionality reduction of the extracted connectivity features. From the validation results in three databases with motor imagery and motor execution tasks, we conclude that the single-trial Gaussian functional connectivity measure provides very competitive classifier performance values, being less affected by feature extraction parameters, like the sliding time window, and avoiding the use of prior linear spatial filtering. We also provide interpretability for the clustered functional connectivity patterns and hypothesize that the proposed kernel-based metric is promising for evaluating motor skills.
Collapse
|
59
|
Wang P, Zhang G, Li Y, Oad A, Huang G. Stochastic Neighbor Embedding Algorithm and its Application in Molecular Biological Data. Curr Bioinform 2021. [DOI: 10.2174/1574893615999200414093636] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
With the advent of the era of big data, the numbers and the dimensions of data are
increasingly becoming larger. It is very critical to reduce dimensions or visualize data and then
uncover the hidden patterns of characteristics or the mechanism underlying data. Stochastic
Neighbor Embedding (SNE) has been developed for data visualization over the last ten years. Due
to its efficiency in the visualization of data, SNE has been applied to a wide range of fields. We
briefly reviewed the SNE algorithm and its variants, summarizing application of it in visualizing
single-cell sequencing data, single nucleotide polymorphisms, and mass spectrometry imaging
data. We also discussed the strength and the weakness of the SNE, with a special emphasis on how
to set parameters to promote quality of visualization, and finally indicated potential development
of SNE in the coming future.
Collapse
Affiliation(s)
- Pan Wang
- Provincial Key Laboratory of Informational Service for Rural Area of Southwestern Hunan, Shaoyang University, Shaoyang 422000, China
| | - Guiyang Zhang
- Provincial Key Laboratory of Informational Service for Rural Area of Southwestern Hunan, Shaoyang University, Shaoyang 422000, China
| | - You Li
- Provincial Key Laboratory of Informational Service for Rural Area of Southwestern Hunan, Shaoyang University, Shaoyang 422000, China
| | - Ammar Oad
- Provincial Key Laboratory of Informational Service for Rural Area of Southwestern Hunan, Shaoyang University, Shaoyang 422000, China
| | - Guohua Huang
- Provincial Key Laboratory of Informational Service for Rural Area of Southwestern Hunan, Shaoyang University, Shaoyang 422000, China
| |
Collapse
|
60
|
Automatic Image-Based Event Detection for Large-N Seismic Arrays Using a Convolutional Neural Network. REMOTE SENSING 2021. [DOI: 10.3390/rs13030389] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Passive seismic experiments have been proposed as a cost-effective and non-invasive alternative to controlled-source seismology, allowing body–wave reflections based on seismic interferometry principles to be retrieved. However, from the huge volume of the recorded ambient noise, only selected time periods (noise panels) are contributing constructively to the retrieval of reflections. We address the issue of automatic scanning of ambient noise data recorded by a large-N array in search of body–wave energy (body–wave events) utilizing a convolutional neural network (CNN). It consists of computing first both amplitude and frequency attribute values at each receiver station for all divided portions of the recorded signal (noise panels). The created 2-D attribute maps are then converted to images and used to extract spatial and temporal patterns associated with the body–wave energy present in the data to build binary CNN-based classifiers. The ensemble of two multi-headed CNN models trained separately on the frequency and amplitude attribute maps demonstrates better generalization ability than each of its participating networks. We also compare the prediction performance of our deep learning (DL) framework with a conventional machine learning (ML) algorithm called XGBoost. The DL-based solution applied to 240 h of ambient seismic noise data recorded by the Kylylahti array in Finland demonstrates high detection accuracy and the superiority over the ML-based one. The ensemble of CNN-based models managed to find almost three times more verified body–wave events in the full unlabelled dataset than it was provided at the training stage. Moreover, the high-level abstraction features extracted at the deeper convolution layers can be used to perform unsupervised clustering of the classified panels with respect to their visual characteristics.
Collapse
|
61
|
Natural Language Processing Based Method for Clustering and Analysis of Aviation Safety Narratives. AEROSPACE 2020. [DOI: 10.3390/aerospace7100143] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
The complexity of commercial aviation operations has grown substantially in recent years, together with a diversification of techniques for collecting and analyzing flight data. As a result, data-driven frameworks for enhancing flight safety have grown in popularity. Data-driven techniques offer efficient and repeatable exploration of patterns and anomalies in large datasets. Text-based flight safety data presents a unique challenge in its subjectivity, and relies on natural language processing tools to extract underlying trends from narratives. In this paper, a methodology is presented for the analysis of aviation safety narratives based on text-based accounts of in-flight events and categorical metadata parameters which accompany them. An extensive pre-processing routine is presented, including a comparison between numeric models of textual representation for the purposes of document classification. A framework for categorizing and visualizing narratives is presented through a combination of k-means clustering and 2-D mapping with t-Distributed Stochastic Neighbor Embedding (t-SNE). A cluster post-processing routine is developed for identifying driving factors in each cluster and building a hierarchical structure of cluster and sub-cluster labels. The Aviation Safety Reporting System (ASRS), which includes over a million de-identified voluntarily submitted reports describing aviation safety incidents for commercial flights, is analyzed as a case study for the methodology. The method results in the identification of 10 major clusters and a total of 31 sub-clusters. The identified groupings are post-processed through metadata-based statistical analysis of the learned clusters. The developed method shows promise in uncovering trends from clusters that are not evident in existing anomaly labels in the data and offers a new tool for obtaining insights from text-based safety data that complement existing approaches.
Collapse
|
62
|
Chen L, Guo Q, Liu Z, Zhang S, Zhang H. Enhanced synchronization-inspired clustering for high-dimensional data. COMPLEX INTELL SYST 2020. [DOI: 10.1007/s40747-020-00191-y] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
AbstractThe synchronization-inspired clustering algorithm (Sync) is a novel and outstanding clustering algorithm, which can accurately cluster datasets with any shape, density and distribution. However, the high-dimensional dataset with high dimensionality, high noise, and high redundancy brings some new challenges for the synchronization-inspired clustering algorithm, resulting in a significant increase in clustering time and a decrease in clustering accuracy. To address these challenges, an enhanced synchronization-inspired clustering algorithm, namely SyncHigh, is developed in this paper to quickly and accurately cluster the high-dimensional datasets. First, a PCA-based (Principal Component Analysis) dimension purification strategy is designed to find the principal components in all attributes. Second, a density-based data merge strategy is constructed to reduce the number of objects participating in the synchronization-inspired clustering algorithm, thereby speeding up clustering time. Third, the Kuramoto Model is enhanced to deal with mass differences between objects caused by the density-based data merge strategy. Finally, extensive experimental results on synthetic and real-world datasets show the effectiveness and efficiency of our SyncHigh algorithm.
Collapse
|
63
|
Chatzimparmpas A, Martins RM, Kerren A. t-viSNE: Interactive Assessment and Interpretation of t-SNE Projections. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 2020; 26:2696-2714. [PMID: 32305922 DOI: 10.1109/tvcg.2020.2986996] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
t-Distributed Stochastic Neighbor Embedding (t-SNE) for the visualization of multidimensional data has proven to be a popular approach, with successful applications in a wide range of domains. Despite their usefulness, t-SNE projections can be hard to interpret or even misleading, which hurts the trustworthiness of the results. Understanding the details of t-SNE itself and the reasons behind specific patterns in its output may be a daunting task, especially for non-experts in dimensionality reduction. In this article, we present t-viSNE, an interactive tool for the visual exploration of t-SNE projections that enables analysts to inspect different aspects of their accuracy and meaning, such as the effects of hyper-parameters, distance and neighborhood preservation, densities and costs of specific neighborhoods, and the correlations between dimensions and visual patterns. We propose a coherent, accessible, and well-integrated collection of different views for the visualization of t-SNE projections. The applicability and usability of t-viSNE are demonstrated through hypothetical usage scenarios with real data sets. Finally, we present the results of a user study where the tool's effectiveness was evaluated. By bringing to light information that would normally be lost after running t-SNE, we hope to support analysts in using t-SNE and making its results better understandable.
Collapse
|
64
|
Peña-Solórzano CA, Albrecht DW, Bassed RB, Gillam J, Harris PC, Dimmock MR. Semi-supervised labelling of the femur in a whole-body post-mortem CT database using deep learning. Comput Biol Med 2020; 122:103797. [PMID: 32658723 DOI: 10.1016/j.compbiomed.2020.103797] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/29/2020] [Revised: 04/29/2020] [Accepted: 04/29/2020] [Indexed: 01/16/2023]
Abstract
A deep learning pipeline was developed and used to localize and classify a variety of implants in the femur contained in whole-body post-mortem computed tomography (PMCT) scans. The results provide a proof-of-principle approach for labelling content not described in medical/autopsy reports. The pipeline, which incorporated residual networks and an autoencoder, was trained and tested using n = 450 full-body PMCT scans. For the localization component, Dice scores of 0.99, 0.96, and 0.98 and mean absolute errors of 3.2, 7.1, and 4.2 mm were obtained in the axial, coronal, and sagittal views, respectively. A regression analysis found the orientation of the implant to the scanner axis and also the relative positioning of extremities to be statistically significant factors. For the classification component, test cases were properly labelled as nail (N+), hip replacement (H+), knee replacement (K+) or without-implant (I-) with an accuracy >97%. The recall for I- and H+ cases was 1.00, but fell to 0.82 and 0.65 for cases with K+ and N+. This semi-automatic approach provides a generalized structure for image-based labelling of features, without requiring time-consuming segmentation.
Collapse
Affiliation(s)
- C A Peña-Solórzano
- Department of Medical Imaging and Radiation Sciences, Monash University, Wellington Rd, Clayton, Melbourne, VIC, 3800, Australia.
| | - D W Albrecht
- Clayton School of Information Technology, Monash University, Wellington Rd, Clayton, Melbourne, VIC, 3800, Australia.
| | - R B Bassed
- Victorian Institute of Forensic Medicine, 57-83 Kavanagh St., Southbank, Melbourne, VIC, 3006, Australia; Department of Forensic Medicine, Monash University, Wellington Rd, Clayton, Melbourne, VIC, 3800, Australia.
| | - J Gillam
- Land Division, Defence Science and Technology Group, Fishermans Bend, Melbourne, VIC, 3207, Australia.
| | - P C Harris
- The Royal Children's Hospital Melbourne, 50 Flemington Road, Parkville, Melbourne, VIC, 3052, Australia; Department of Orthopaedic Surgery, Western Health, Footscray Hospital, Gordon St, Footscray, Melbourne, VIC, 3011, Australia.
| | - M R Dimmock
- Department of Medical Imaging and Radiation Sciences, Monash University, Wellington Rd, Clayton, Melbourne, VIC, 3800, Australia.
| |
Collapse
|
65
|
Aliverti E, Tilson JL, Filer DL, Babcock B, Colaneri A, Ocasio J, Gershon TR, Wilhelmsen KC, Dunson DB. Projected t-SNE for batch correction. Bioinformatics 2020; 36:3522-3527. [PMID: 32176244 PMCID: PMC7267829 DOI: 10.1093/bioinformatics/btaa189] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2019] [Revised: 03/02/2020] [Accepted: 03/12/2020] [Indexed: 12/31/2022] Open
Abstract
MOTIVATION Low-dimensional representations of high-dimensional data are routinely employed in biomedical research to visualize, interpret and communicate results from different pipelines. In this article, we propose a novel procedure to directly estimate t-SNE embeddings that are not driven by batch effects. Without correction, interesting structure in the data can be obscured by batch effects. The proposed algorithm can therefore significantly aid visualization of high-dimensional data. RESULTS The proposed methods are based on linear algebra and constrained optimization, leading to efficient algorithms and fast computation in many high-dimensional settings. Results on artificial single-cell transcription profiling data show that the proposed procedure successfully removes multiple batch effects from t-SNE embeddings, while retaining fundamental information on cell types. When applied to single-cell gene expression data to investigate mouse medulloblastoma, the proposed method successfully removes batches related with mice identifiers and the date of the experiment, while preserving clusters of oligodendrocytes, astrocytes, and endothelial cells and microglia, which are expected to lie in the stroma within or adjacent to the tumours. AVAILABILITY AND IMPLEMENTATION Source code implementing the proposed approach is available as an R package at https://github.com/emanuelealiverti/BC_tSNE, including a tutorial to reproduce the simulation studies. CONTACT aliverti@stat.unipd.it.
Collapse
Affiliation(s)
- Emanuele Aliverti
- Department of Statistical Sciences, University of Padova, Padova 35121, Italy
| | | | - Dayne L Filer
- RENCI, University of North Carolina, Chapel Hill, NC 27517, USA
- Department of Genetics
| | | | | | | | - Timothy R Gershon
- Department of Neurology
- UNC Neuroscience Center
- Carolina Institute for Developmental Disabilities
- Lineberger Comprehensive Cancer Center, University of North Carolina School of Medicine, Chapel Hill, NC 27599, USA
| | - Kirk C Wilhelmsen
- RENCI, University of North Carolina, Chapel Hill, NC 27517, USA
- Department of Genetics
- Department of Neurology
| | - David B Dunson
- Department of Statistical Science, Duke University, Durham, NC 27708, USA
| |
Collapse
|
66
|
Linderman GC, Mishne G, Jaffe A, Kluger Y, Steinerberger S. Randomized near-neighbor graphs, giant components and applications in data science. J Appl Probab 2020; 57:458-476. [PMID: 32913373 PMCID: PMC7480951 DOI: 10.1017/jpr.2020.21] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
If we pick n random points uniformly in [0, 1] d and connect each point to its c d log n-nearest neighbors, where d ≥ 2 is the dimension and c d is a constant depending on the dimension, then it is well known that the graph is connected with high probability. We prove that it suffices to connect every point to c d,1 log log n points chosen randomly among its c d,2 log n-nearest neighbors to ensure a giant component of size n - o(n) with high probability. This construction yields a much sparser random graph with ~ n log log n instead of ~ n log n edges that has comparable connectivity properties. This result has nontrivial implications for problems in data science where an affinity matrix is constructed: instead of connecting each point to its k nearest neighbors, one can often pick k' ≪ k random points out of the k nearest neighbors and only connect to those without sacrificing quality of results. This approach can simplify and accelerate computation; we illustrate this with experimental results in spectral clustering of large-scale datasets.
Collapse
Affiliation(s)
- George C Linderman
- Postal address: Applied Mathematics, Yale University, New Haven, CT 06511
| | - Gal Mishne
- Postal address: Applied Mathematics, Yale University, New Haven, CT 06511
| | - Ariel Jaffe
- Postal address: Applied Mathematics, Yale University, New Haven, CT 06511
| | - Yuval Kluger
- Dept. of Pathology & Applied Mathematics, Yale University, New Haven, CT 06511
| | | |
Collapse
|
67
|
Škvorc U, Eftimov T, Korošec P. Understanding the problem space in single-objective numerical optimization using exploratory landscape analysis. Appl Soft Comput 2020. [DOI: 10.1016/j.asoc.2020.106138] [Citation(s) in RCA: 28] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
68
|
Zhang Y, Kim MS, Reichenberger ER, Stear B, Taylor DM. Scedar: A scalable Python package for single-cell RNA-seq exploratory data analysis. PLoS Comput Biol 2020; 16:e1007794. [PMID: 32339163 PMCID: PMC7217489 DOI: 10.1371/journal.pcbi.1007794] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2019] [Revised: 05/12/2020] [Accepted: 03/17/2020] [Indexed: 11/25/2022] Open
Abstract
In single-cell RNA-seq (scRNA-seq) experiments, the number of individual cells has increased exponentially, and the sequencing depth of each cell has decreased significantly. As a result, analyzing scRNA-seq data requires extensive considerations of program efficiency and method selection. In order to reduce the complexity of scRNA-seq data analysis, we present scedar, a scalable Python package for scRNA-seq exploratory data analysis. The package provides a convenient and reliable interface for performing visualization, imputation of gene dropouts, detection of rare transcriptomic profiles, and clustering on large-scale scRNA-seq datasets. The analytical methods are efficient, and they also do not assume that the data follow certain statistical distributions. The package is extensible and modular, which would facilitate the further development of functionalities for future requirements with the open-source development community. The scedar package is distributed under the terms of the MIT license at https://pypi.org/project/scedar.
Collapse
Affiliation(s)
- Yuanchao Zhang
- Department of Biomedical and Health Informatics, The Children’s Hospital of Philadelphia, Philadelphia, Pennsylvania, United States of America
- Department of Genetics, Rutgers University, Piscataway, New Jersey, United States of America
| | - Man S. Kim
- Department of Biomedical and Health Informatics, The Children’s Hospital of Philadelphia, Philadelphia, Pennsylvania, United States of America
| | - Erin R. Reichenberger
- Department of Biomedical and Health Informatics, The Children’s Hospital of Philadelphia, Philadelphia, Pennsylvania, United States of America
| | - Ben Stear
- Department of Biomedical and Health Informatics, The Children’s Hospital of Philadelphia, Philadelphia, Pennsylvania, United States of America
| | - Deanne M. Taylor
- Department of Biomedical and Health Informatics, The Children’s Hospital of Philadelphia, Philadelphia, Pennsylvania, United States of America
- Department of Pediatrics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America
| |
Collapse
|
69
|
Linderman GC, Steinerberger S. NUMERICAL INTEGRATION ON GRAPHS: WHERE TO SAMPLE AND HOW TO WEIGH. MATHEMATICS OF COMPUTATION 2020; 89:1933-1952. [PMID: 33927452 PMCID: PMC8081285 DOI: 10.1090/mcom/3515] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Let G = (V,E,w) be a finite, connected graph with weighted edges. We are interested in the problem of finding a subset W ⊂ V of vertices and weights aw such that 1 | V | ∑ v ∈ V f ( v ) ∼ ∑ w ∈ W a w f ( w ) for functions f : V → ℝ that are 'smooth' with respect to the geometry of the graph; here ~ indicates that we want the right-hand side to be as close to the left-hand side as possible. The main application are problems where f is known to vary smoothly over the underlying graph but is expensive to evaluate on even a single vertex. We prove an inequality showing that the integration problem can be rewritten as a geometric problem ('the optimal packing of heat balls'). We discuss how one would construct approximate solutions of the heat ball packing problem; numerical examples demonstrate the efficiency of the method.
Collapse
Affiliation(s)
- George C Linderman
- Program in Applied Mathematics, Yale University, New Haven, CT 06511, USA
| | | |
Collapse
|
70
|
Chi EC, Gaines BR, Sun WW, Zhou H, Yang J. Provable Convex Co-clustering of Tensors. JOURNAL OF MACHINE LEARNING RESEARCH : JMLR 2020; 21:214. [PMID: 33312074 PMCID: PMC7731944] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Cluster analysis is a fundamental tool for pattern discovery of complex heterogeneous data. Prevalent clustering methods mainly focus on vector or matrix-variate data and are not applicable to general-order tensors, which arise frequently in modern scientific and business applications. Moreover, there is a gap between statistical guarantees and computational efficiency for existing tensor clustering solutions due to the nature of their non-convex formulations. In this work, we bridge this gap by developing a provable convex formulation of tensor co-clustering. Our convex co-clustering (CoCo) estimator enjoys stability guarantees and its computational and storage costs are polynomial in the size of the data. We further establish a non-asymptotic error bound for the CoCo estimator, which reveals a surprising "blessing of dimensionality" phenomenon that does not exist in vector or matrix-variate cluster analysis. Our theoretical findings are supported by extensive simulated studies. Finally, we apply the CoCo estimator to the cluster analysis of advertisement click tensor data from a major online company. Our clustering results provide meaningful business insights to improve advertising effectiveness.
Collapse
Affiliation(s)
- Eric C Chi
- Department of Statistics, North Carolina State University, Raleigh, NC 27695, USA
| | - Brian R Gaines
- Advanced Analytics R&D, SAS Institute Inc., Cary, NC 27513, USA
| | - Will Wei Sun
- Krannert School of Management, Purdue University, West Lafayette, IN 47907, USA
| | - Hua Zhou
- Department of Biostatistics, University of California, Los Angeles, CA 90095, USA
| | - Jian Yang
- Advertising Sciences, Yahoo Research, Sunnyvale, CA 94089, USA
| |
Collapse
|
71
|
Abstract
Single-cell transcriptomics yields ever growing data sets containing RNA expression levels for thousands of genes from up to millions of cells. Common data analysis pipelines include a dimensionality reduction step for visualising the data in two dimensions, most frequently performed using t-distributed stochastic neighbour embedding (t-SNE). It excels at revealing local structure in high-dimensional data, but naive applications often suffer from severe shortcomings, e.g. the global structure of the data is not represented accurately. Here we describe how to circumvent such pitfalls, and develop a protocol for creating more faithful t-SNE visualisations. It includes PCA initialisation, a high learning rate, and multi-scale similarity kernels; for very large data sets, we additionally use exaggeration and downsampling-based initialisation. We use published single-cell RNA-seq data sets to demonstrate that this protocol yields superior results compared to the naive application of t-SNE.
Collapse
Affiliation(s)
- Dmitry Kobak
- Institute for Ophthalmic Research, University of Tübingen, Tübingen, Germany.
| | - Philipp Berens
- Institute for Ophthalmic Research, University of Tübingen, Tübingen, Germany.
- Bernstein Center for Computational Neuroscience, University of Tübingen, Tübingen, Germany.
- Center for Integrative Neuroscience, University of Tübingen, Tübingen, Germany.
- Institute for Bioinformatics and Medical Informatics, University of Tübingen, Tübingen, Germany.
| |
Collapse
|