1
|
Poulakis K, Westman E. Clustering and disease subtyping in Neuroscience, toward better methodological adaptations. Front Comput Neurosci 2023; 17:1243092. [PMID: 37927546 PMCID: PMC10620518 DOI: 10.3389/fncom.2023.1243092] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2023] [Accepted: 10/04/2023] [Indexed: 11/07/2023] Open
Affiliation(s)
- Konstantinos Poulakis
- Division of Clinical Geriatrics, Department of Neurobiology, Care Sciences and Society, Karolinska Institutet, Stockholm, Sweden
| | - Eric Westman
- Division of Clinical Geriatrics, Department of Neurobiology, Care Sciences and Society, Karolinska Institutet, Stockholm, Sweden
- Department of Neuroimaging, Centre for Neuroimaging Sciences, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, United Kingdom
| |
Collapse
|
2
|
Hosseinzadeh M, Yoo J, Ali S, Lansky J, Mildeova S, Yousefpoor MS, Ahmed OH, Rahmani AM, Tightiz L. A cluster-based trusted routing method using fire hawk optimizer (FHO) in wireless sensor networks (WSNs). Sci Rep 2023; 13:13046. [PMID: 37567984 PMCID: PMC10421948 DOI: 10.1038/s41598-023-40273-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2023] [Accepted: 08/08/2023] [Indexed: 08/13/2023] Open
Abstract
Today, wireless sensor networks (WSNs) are growing rapidly and provide a lot of comfort to human life. Due to the use of WSNs in various areas, like health care and battlefield, security is an important concern in the data transfer procedure to prevent data manipulation. Trust management is an affective scheme to solve these problems by building trust relationships between sensor nodes. In this paper, a cluster-based trusted routing technique using fire hawk optimizer called CTRF is presented to improve network security by considering the limited energy of nodes in WSNs. It includes a weighted trust mechanism (WTM) designed based on interactive behavior between sensor nodes. The main feature of this trust mechanism is to consider the exponential coefficients for the trust parameters, namely weighted reception rate, weighted redundancy rate, and energy state so that the trust level of sensor nodes is exponentially reduced or increased based on their hostile or friendly behaviors. Moreover, the proposed approach creates a fire hawk optimizer-based clustering mechanism to select cluster heads from a candidate set, which includes sensor nodes whose remaining energy and trust levels are greater than the average remaining energy and the average trust level of all network nodes, respectively. In this clustering method, a new cost function is proposed based on four objectives, including cluster head location, cluster head energy, distance from the cluster head to the base station, and cluster size. Finally, CTRF decides on inter-cluster routing paths through a trusted routing algorithm and uses these routes to transmit data from cluster heads to the base station. In the route construction process, CTRF regards various parameters such as energy of the route, quality of the route, reliability of the route, and number of hops. CTRF runs on the network simulator version 2 (NS2), and its performance is compared with other secure routing approaches with regard to energy, throughput, packet loss rate, latency, detection ratio, and accuracy. This evaluation proves the superior and successful performance of CTRF compared to other methods.
Collapse
Affiliation(s)
- Mehdi Hosseinzadeh
- Institute of Research and Development, Duy Tan University, Da Nang, Vietnam
- School of Medicine and Pharmacy, Duy Tan University, Da Nang, Vietnam
| | - Joon Yoo
- School of Computing, Gachon University, 1342 Seongnamdaero, Seongnam, 13120, Korea
| | - Saqib Ali
- Department of Information Systems, College of Economics and Political Science, Sultan Qaboos University, Al Khoudh, Muscat, Oman
| | - Jan Lansky
- Department of Computer Science and Mathematics, Faculty of Economic Studies, University of Finance and Administration, Prague, Czech Republic
| | - Stanislava Mildeova
- Department of Computer Science and Mathematics, Faculty of Economic Studies, University of Finance and Administration, Prague, Czech Republic
| | | | - Omed Hassan Ahmed
- Department of Information Technology, University of Human Development, Sulaymaniyah, Iraq
| | - Amir Masoud Rahmani
- Future Technology Research Center, National Yunlin University of Science and Technology, Yunlin, Taiwan.
| | - Lilia Tightiz
- School of Computing, Gachon University, 1342 Seongnamdaero, Seongnam, 13120, Korea.
| |
Collapse
|
3
|
Bacteria phototaxis optimizer. Neural Comput Appl 2023. [DOI: 10.1007/s00521-023-08391-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/17/2023]
|
4
|
Yue Y, Cao L, Lu D, Hu Z, Xu M, Wang S, Li B, Ding H. Review and empirical analysis of sparrow search algorithm. Artif Intell Rev 2023. [DOI: 10.1007/s10462-023-10435-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/05/2023]
|
5
|
Thrun MC. Identification of Explainable Structures in Data with a Human-in-the-Loop. KUNSTLICHE INTELLIGENZ 2022. [DOI: 10.1007/s13218-022-00782-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
Abstract
AbstractExplainable AIs (XAIs) often do not provide relevant or understandable explanations for a domain-specific human-in-the-loop (HIL). In addition, internally used metrics have biases that might not match existing structures in the data. The habilitation thesis presents an alternative solution approach by deriving explanations from high dimensional structures in the data rather than from predetermined classifications. Typically, the detection of such density- or distance-based structures in data has so far entailed the challenges of choosing appropriate algorithms and their parameters, which adds a considerable amount of complex decision-making options for the HIL. Central steps of the solution approach are a parameter-free methodology for the estimation and visualization of probability density functions (PDFs); followed by a hypothesis for selecting an appropriate distance metric independent of the data context in combination with projection-based clustering (PBC). PBC allows for subsequent interactive identification of separable structures in the data. Hence, the HIL does not need deep knowledge of the underlying algorithms to identify structures in data. The complete data-driven XAI approach involving the HIL is based on a decision tree guided by distance-based structures in data (DSD). This data-driven XAI shows initial success in the application to multivariate time series and non-sequential high-dimensional data. It generates meaningful and relevant explanations that are evaluated by Grice’s maxims.
Collapse
|
6
|
A field-based computing approach to sensing-driven clustering in robot swarms. SWARM INTELLIGENCE 2022. [DOI: 10.1007/s11721-022-00215-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/14/2022]
Abstract
AbstractSwarm intelligence leverages collective behaviours emerging from interaction and activity of several “simple” agents to solve problems in various environments. One problem of interest in large swarms featuring a variety of sub-goals is swarm clustering, where the individuals of a swarm are assigned or choose to belong to zero or more groups, also called clusters. In this work, we address the sensing-based swarm clustering problem, where clusters are defined based on both the values sensed from the environment and the spatial distribution of the values and the agents. Moreover, we address it in a setting characterised by decentralisation of computation and interaction, and dynamicity of values and mobility of agents. For the solution, we propose to use the field-based computing paradigm, where computation and interaction are expressed in terms of a functional manipulation of fields, distributed and evolving data structures mapping each individual of the system to values over time. We devise a solution to sensing-based swarm clustering leveraging multiple concurrent field computations with limited domain and evaluate the approach experimentally by means of simulations, showing that the programmed swarms form clusters that well reflect the underlying environmental phenomena dynamics.
Collapse
|
7
|
Databionic Swarm Intelligence to Screen Wastewater Recycling Quality with Factorial and Hyper-Parameter Non-Linear Orthogonal Mini-Datasets. WATER 2022. [DOI: 10.3390/w14131990] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
Electrodialysis (ED) may be designed to enhance wastewater recycling efficiency for crop irrigation in areas where water distribution is otherwise inaccessible. ED process controls are difficult to manage because the ED cells need to be custom-built to meet local requirements, and the wastewater influx often has heterogeneous ionic properties. Besides the underlying complex chemical phenomena, recycling screening is a challenge to engineering because the number of experimental trials must be maintained low in order to be timely and cost-effective. A new data-centric approach is presented that screens three water quality indices against four ED-process-controlling factors for a wastewater recycling application in agricultural development. The implemented unsupervised solver must: (1) be fine-tuned for optimal deployment and (2) screen the ED trials for effect potency. The databionic swarm intelligence classifier is employed to cluster the L9(34) OA mini-dataset of: (1) the removed Na+ content, (2) the sodium adsorption ratio (SAR) and (3) the soluble Na+ percentage. From an information viewpoint, the proviso for the factor profiler is that it should be apt to detect strength and curvature effects against not-computable uncertainty. The strength hierarchy was analyzed for the four ED-process-controlling factors: (1) the dilute flow, (2) the cathode flow, (3) the anode flow and (4) the voltage rate. The new approach matches two sequences for similarities, according to: (1) the classified cluster identification string and (2) the pre-defined OA factorial setting string. Internal cluster validity is checked by the Dunn and Davies–Bouldin Indices, after completing a hyper-parameter L8(4122) OA screening. The three selected hyper-parameters (distance measure, structure type and position type) created negligible variability. The dilute flow was found to regulate the overall ED-based separation performance. The results agree with other recent statistical/algorithmic studies through external validation. In conclusion, statistical/algorithmic freeware (R-packages) may be effective in resolving quality multi-indexed screening tasks of intricate non-linear mini-OA-datasets.
Collapse
|
8
|
Xu M, Liu D, Zhang Y. Design of Interactive Teaching System of Physical Training Based on Artificial Intelligence. JOURNAL OF INFORMATION & KNOWLEDGE MANAGEMENT 2022. [DOI: 10.1142/s0219649222400214] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Nowadays, with the continuous change and innovation of teaching methods in Colleges and universities, the curriculum system of students is also constantly enriched and developed. Therefore, people’s requirements for teaching management and teaching system are also improving. Physical education curriculum is usually based on outdoor teaching, and some schools have not established a complete teaching system. Therefore, the interactive teaching system of physical training based on artificial intelligence is designed. First of all, through the construction of the interactive teaching system of the total control circuit, determine the corresponding circuit address decoding, improve the audio control circuit, associated video connection interactive drive three parts, the intelligent sports training interactive system hardware design. Then, through the creation of intelligent training function module, the design of training database and the realisation of effective training and teaching of intelligent sports, the software design of intelligent sports training interactive system is carried out. Finally, through the test of the system, to verify the corresponding effect, further improve the relevant system, make it more safe and accurate, improve the efficiency of sports training interactive system, enhance the integrity of the teaching process.
Collapse
Affiliation(s)
- Min Xu
- Sports Teaching Department, Shanghai University of Finance and Economics, Shanghai 200433, P. R. China
| | - DongAo Liu
- Sports Teaching Department, Shanghai University of Finance and Economics, Shanghai 200433, P. R. China
| | - Yan Zhang
- Sports Teaching Department, Shanghai University of Finance and Economics, Shanghai 200433, P. R. China
| |
Collapse
|
9
|
New bag-of-feature for histopathology image classification using reinforced cat swarm algorithm and weighted Gaussian mixture modelling. COMPLEX INTELL SYST 2022. [DOI: 10.1007/s40747-022-00726-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Abstract
AbstractThe progress in digital histopathology for computer-aided diagnosis leads to advancement in automated histopathological image classification system. However, heterogeneity and complexity in structural background make it a challenging process. Therefore, this paper introduces robust and reliable new bag-of-feature framework. The optimal visual words are obtained by applying proposed reinforcement cat swarm optimization algorithm. Moreover, the frequency of occurrence of each visual words is depicted through histogram using new weighted Gaussian mixture modelling method. Reinforcement cat swarm optimization algorithm is evaluated on the IEEE CEC 2017 benchmark function problems and compared with other state-of-the-art algorithms. Moreover, statistical test analysis is done on acquired mean and the best fitness values from benchmark functions. The proposed classification model effectively identifies and classifies the different categories of histopathological images. Furthermore, the comparative experimental result analysis of proposed reinforcement cat swarm optimization-based bag-of-feature is performed on standard quality metrics measures. The observation states that reinforcement cat swarm optimization-based bag-of-feature outperforms the other methods and provides promising results.
Collapse
|
10
|
Bisandu DB, Moulitsas I, Filippone S. Social ski driver conditional autoregressive-based deep learning classifier for flight delay prediction. Neural Comput Appl 2022. [DOI: 10.1007/s00521-022-06898-y] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
Abstract
AbstractThe importance of robust flight delay prediction has recently increased in the air transportation industry. This industry seeks alternative methods and technologies for more robust flight delay prediction because of its significance for all stakeholders. The most affected are airlines that suffer from monetary and passenger loyalty losses. Several studies have attempted to analysed and solve flight delay prediction problems using machine learning methods. This research proposes a novel alternative method, namely social ski driver conditional autoregressive-based (SSDCA-based) deep learning. Our proposed method combines the Social Ski Driver algorithm with Conditional Autoregressive Value at Risk by Regression Quantiles. We consider the most relevant instances from the training dataset, which are the delayed flights. We applied data transformation to stabilise the data variance using Yeo-Johnson. We then perform the training and testing of our data using deep recurrent neural network (DRNN) and SSDCA-based algorithms. The SSDCA-based optimisation algorithm helped us choose the right network architecture with better accuracy and less error than the existing literature. The results of our proposed SSDCA-based method and existing benchmark methods were compared. The efficiency and computational time of our proposed method are compared against the existing benchmark methods. The SSDCA-based DRNN provides a more accurate flight delay prediction with 0.9361 and 0.9252 accuracy rates on both dataset-1 and dataset-2, respectively. To show the reliability of our method, we compared it with other meta-heuristic approaches. The result is that the SSDCA-based DRNN outperformed all existing benchmark methods tested in our experiment.
Collapse
|
11
|
Exploiting Distance-Based Structures in Data Using an Explainable AI for Stock Picking. INFORMATION 2022. [DOI: 10.3390/info13020051] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
In principle, the fundamental data of companies may be used to select stocks with a high probability of either increasing or decreasing price. Many of the commonly known rules or used explanations for such a stock-picking process are too vague to be applied in concrete cases, and at the same time, it is challenging to analyze high-dimensional data with a low number of cases in order to derive data-driven and usable explanations. This work proposes an explainable AI (XAI) approach on the quarterly available fundamental data of companies traded on the German stock market. In the XAI, distance-based structures in data (DSD) that guide decision tree induction are identified. The leaves of the appropriately selected decision tree contain subsets of stocks and provide viable explanations that can be rated by a human. The prediction of the future price trends of specific stocks is made possible using the explanations and a rating. In each quarter, stock picking by DSD-XAI is based on understanding the explanations and has a higher success rate than arbitrary stock picking, a hybrid AI system, and a recent unsupervised decision tree called eUD3.5.
Collapse
|
12
|
A learning automata-based hybrid MPA and JS algorithm for numerical optimization problems and its application on data clustering. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2021.107682] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
|
13
|
Distance-based clustering challenges for unbiased benchmarking studies. Sci Rep 2021; 11:18988. [PMID: 34556686 PMCID: PMC8460803 DOI: 10.1038/s41598-021-98126-1] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2021] [Accepted: 09/02/2021] [Indexed: 02/08/2023] Open
Abstract
Benchmark datasets with predefined cluster structures and high-dimensional biomedical datasets outline the challenges of cluster analysis: clustering algorithms are limited in their clustering ability in the presence of clusters defining distance-based structures resulting in a biased clustering solution. Data sets might not have cluster structures. Clustering yields arbitrary labels and often depends on the trial, leading to varying results. Moreover, recent research indicated that all partition comparison measures can yield the same results for different clustering solutions. Consequently, algorithm selection and parameter optimization by unsupervised quality measures (QM) are always biased and misleading. Only if the predefined structures happen to meet the particular clustering criterion and QM, can the clusters be recovered. Results are presented based on 41 open-source algorithms which are particularly useful in biomedical scenarios. Furthermore, comparative analysis with mirrored density plots provides a significantly more detailed benchmark than that with the typically used box plots or violin plots.
Collapse
|
14
|
Abstract
The forecasting of univariate time series poses challenges in industrial applications if the seasonality varies. Typically, a non-varying seasonality of a time series is treated with a model based on Fourier theory or the aggregation of forecasts from multiple resolution levels. If the seasonality changes with time, various wavelet approaches for univariate forecasting are proposed with promising potential but without accessible software or a systematic evaluation of different wavelet models compared to state-of-the-art methods. In contrast, the advantage of the specific multiresolution forecasting proposed here is the convenience of a swiftly accessible implementation in R and Python combined with coefficient selection through evolutionary optimization which is evaluated in four different applications: scheduling of a call center, planning electricity demand, and predicting stocks and prices. The systematic benchmarking is based on out-of-sample forecasts resulting from multiple cross-validations with the error measure MASE and SMAPE for which the error distribution of each method and dataset is estimated and visualized with the mirrored density plot. The multiresolution forecasting performs equal to or better than twelve comparable state-of-the-art methods but does not require users to set parameters contrary to prior wavelet forecasting frameworks. This makes the method suitable for industrial applications.
Collapse
|
15
|
Thrun MC, Pape F, Ultsch A. Conventional displays of structures in data compared with interactive projection-based clustering (IPBC). INTERNATIONAL JOURNAL OF DATA SCIENCE AND ANALYTICS 2021. [DOI: 10.1007/s41060-021-00264-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
Abstract
AbstractClustering is an important task in knowledge discovery with the goal to identify structures of similar data points in a dataset. Here, the focus lies on methods that use a human-in-the-loop, i.e., incorporate user decisions into the clustering process through 2D and 3D displays of the structures in the data. Some of these interactive approaches fall into the category of visual analytics and emphasize the power of such displays to identify the structures interactively in various types of datasets or to verify the results of clustering algorithms. This work presents a new method called interactive projection-based clustering (IPBC). IPBC is an open-source and parameter-free method using a human-in-the-loop for an interactive 2.5D display and identification of structures in data based on the user’s choice of a dimensionality reduction method. The IPBC approach is systematically compared with accessible visual analytics methods for the display and identification of cluster structures using twelve clustering benchmark datasets and one additional natural dataset. Qualitative comparison of 2D, 2.5D and 3D displays of structures and empirical evaluation of the identified cluster structures show that IPBC outperforms comparable methods. Additionally, IPBC assists in identifying structures previously unknown to domain experts in an application.
Collapse
|
16
|
Thrun MC. The Exploitation of Distance Distributions for Clustering. INTERNATIONAL JOURNAL OF COMPUTATIONAL INTELLIGENCE AND APPLICATIONS 2021. [DOI: 10.1142/s1469026821500164] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Although distance measures are used in many machine learning algorithms, the literature on the context-independent selection and evaluation of distance measures is limited in the sense that prior knowledge is used. In cluster analysis, current studies evaluate the choice of distance measure after applying unsupervised methods based on error probabilities, implicitly setting the goal of reproducing predefined partitions in data. Such studies use clusters of data that are often based on the context of the data as well as the custom goal of the specific study. Depending on the data context, different properties for distance distributions are judged to be relevant for appropriate distance selection. However, if cluster analysis is based on the task of finding similar partitions of data, then the intrapartition distances should be smaller than the interpartition distances. By systematically investigating this specification using distribution analysis through the mirrored-density (MD plot), it is shown that multimodal distance distributions are preferable in cluster analysis. As a consequence, it is advantageous to model distance distributions with Gaussian mixtures prior to the evaluation phase of unsupervised methods. Experiments are performed on several artificial datasets and natural datasets for the task of clustering.
Collapse
Affiliation(s)
- Michael C. Thrun
- Databionics Research Group, Philipps-University of Marburg, D-35032 Marburg, Germany
- Department of Hematology, Oncology and Immunology, Philipps-University Marburg, Germany
| |
Collapse
|
17
|
Explainable AI Framework for Multivariate Hydrochemical Time Series. MACHINE LEARNING AND KNOWLEDGE EXTRACTION 2021. [DOI: 10.3390/make3010009] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
Abstract
The understanding of water quality and its underlying processes is important for the protection of aquatic environments. With the rare opportunity of access to a domain expert, an explainable AI (XAI) framework is proposed that is applicable to multivariate time series. The XAI provides explanations that are interpretable by domain experts. In three steps, it combines a data-driven choice of a distance measure with supervised decision trees guided by projection-based clustering. The multivariate time series consists of water quality measurements, including nitrate, electrical conductivity, and twelve other environmental parameters. The relationships between water quality and the environmental parameters are investigated by identifying similar days within a cluster and dissimilar days between clusters. The framework, called DDS-XAI, does not depend on prior knowledge about data structure, and its explanations are tendentially contrastive. The relationships in the data can be visualized by a topographic map representing high-dimensional structures. Two state of the art XAIs called eUD3.5 and iterative mistake minimization (IMM) were unable to provide meaningful and relevant explanations from the three multivariate time series data. The DDS-XAI framework can be swiftly applied to new data. Open-source code in R for all steps of the XAI framework is provided and the steps are structured application-oriented.
Collapse
|
18
|
Thrun MC, Gehlert T, Ultsch A. Analyzing the fine structure of distributions. PLoS One 2020; 15:e0238835. [PMID: 33052923 PMCID: PMC7556505 DOI: 10.1371/journal.pone.0238835] [Citation(s) in RCA: 27] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2019] [Accepted: 08/25/2020] [Indexed: 11/30/2022] Open
Abstract
One aim of data mining is the identification of interesting structures in data. For better analytical results, the basic properties of an empirical distribution, such as skewness and eventual clipping, i.e. hard limits in value ranges, need to be assessed. Of particular interest is the question of whether the data originate from one process or contain subsets related to different states of the data producing process. Data visualization tools should deliver a clear picture of the univariate probability density distribution (PDF) for each feature. Visualization tools for PDFs typically use kernel density estimates and include both the classical histogram, as well as the modern tools like ridgeline plots, bean plots and violin plots. If density estimation parameters remain in a default setting, conventional methods pose several problems when visualizing the PDF of uniform, multimodal, skewed distributions and distributions with clipped data, For that reason, a new visualization tool called the mirrored density plot (MD plot), which is specifically designed to discover interesting structures in continuous features, is proposed. The MD plot does not require adjusting any parameters of density estimation, which is what may make the use of this plot compelling particularly to non-experts. The visualization tools in question are evaluated against statistical tests with regard to typical challenges of explorative distribution analysis. The results of the evaluation are presented using bimodal Gaussian, skewed distributions and several features with already published PDFs. In an exploratory data analysis of 12 features describing quarterly financial statements, when statistical testing poses a great difficulty, only the MD plots can identify the structure of their PDFs. In sum, the MD plot outperforms the above mentioned methods.
Collapse
Affiliation(s)
- Michael C. Thrun
- Databionics AG, Dept. of Mathematics and Computer Science, Philipps-University of Marburg, Marburg, Germany
- Dept. of Hematology, Oncology and Immunology, Philipps-University Marburg, Germany
- * E-mail:
| | - Tino Gehlert
- Alumni of Faculty of Mathematics, Chemnitz University of Technology, Chemnitz, Germany
| | - Alfred Ultsch
- Databionics AG, Dept. of Mathematics and Computer Science, Philipps-University of Marburg, Marburg, Germany
| |
Collapse
|