1
|
Liu X, Kopelman NM, Rosenberg NA. Clumppling: cluster matching and permutation program with integer linear programming. Bioinformatics 2024; 40:btad751. [PMID: 38096585 PMCID: PMC10766593 DOI: 10.1093/bioinformatics/btad751] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2023] [Revised: 11/20/2023] [Accepted: 12/13/2023] [Indexed: 01/06/2024] Open
Abstract
MOTIVATION In the mixed-membership unsupervised clustering analyses commonly used in population genetics, multiple replicate data analyses can differ in their clustering solutions. Combinatorial algorithms assist in aligning clustering outputs from multiple replicates so that clustering solutions can be interpreted and combined across replicates. Although several algorithms have been introduced, challenges exist in achieving optimal alignments and performing alignments in reasonable computation time. RESULTS We present Clumppling, a method for aligning replicate solutions in mixed-membership unsupervised clustering. The method uses integer linear programming for finding optimal alignments, embedding the cluster alignment problem in standard combinatorial optimization frameworks. In example analyses, we find that it achieves solutions with preferred values of a desired objective function relative to those achieved by Pong and that it proceeds with less computation time than Clumpak. It is also the first method to permit alignments across replicates with multiple arbitrary values of the number of clusters K. AVAILABILITY AND IMPLEMENTATION Clumppling is available at https://github.com/PopGenClustering/Clumppling.
Collapse
Affiliation(s)
- Xiran Liu
- Institute for Computational and Mathematical Engineering, Stanford University, Stanford, CA 94305, United States
| | - Naama M Kopelman
- Faculty of Sciences, Holon Institute of Technology, Holon 58109, Israel
| | - Noah A Rosenberg
- Institute for Computational and Mathematical Engineering, Stanford University, Stanford, CA 94305, United States
- Department of Biology, Stanford University, Stanford, CA 94305, United States
| |
Collapse
|
2
|
Younesy H, Pober J, Möller T, Karimi MM. ModEx: a general purpose computer model exploration system. FRONTIERS IN BIOINFORMATICS 2023; 3:1153800. [PMID: 37304402 PMCID: PMC10249055 DOI: 10.3389/fbinf.2023.1153800] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2023] [Accepted: 05/09/2023] [Indexed: 06/13/2023] Open
Abstract
We present a general purpose visual analysis system that can be used for exploring parameters of a variety of computer models. Our proposed system offers key components of a visual parameter analysis framework including parameter sampling, deriving output summaries, and an exploration interface. It also provides an API for rapid development of parameter space exploration solutions as well as the flexibility to support custom workflows for different application domains. We evaluate the effectiveness of our system by demonstrating it in three domains: data mining, machine learning and specific application in bioinformatics.
Collapse
Affiliation(s)
- Hamid Younesy
- School of Computing Science, Simon Fraser University, Burnaby, BC, Canada
| | | | - Torsten Möller
- Research Network Data Science and Faculty of Computer Science, University of Vienna, Vienna, Austria
| | - Mohammad M. Karimi
- Comprehensive Cancer Centre, School of Cancer and Pharmaceutical Sciences, Faculty of Life Sciences and Medicine, King's College London, London, United Kingdom
| |
Collapse
|
3
|
Das S, Saket B, Kwon BC, Endert A. Geono-Cluster: Interactive Visual Cluster Analysis for Biologists. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 2021; 27:4401-4412. [PMID: 32746262 DOI: 10.1109/tvcg.2020.3002166] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Biologists often perform clustering analysis to derive meaningful patterns, relationships, and structures from data instances and attributes. Though clustering plays a pivotal role in biologists' data exploration, it takes non-trivial efforts for biologists to find the best grouping in their data using existing tools. Visual cluster analysis is currently performed either programmatically or through menus and dialogues in many tools, which require parameter adjustments over several steps of trial-and-error. In this article, we introduce Geono-Cluster, a novel visual analysis tool designed to support cluster analysis for biologists who do not have formal data science training. Geono-Cluster enables biologists to apply their domain expertise into clustering results by visually demonstrating how their expected clustering outputs should look like with a small sample of data instances. The system then predicts users' intentions and generates potential clustering results. Our study follows the design study protocol to derive biologists' tasks and requirements, design the system, and evaluate the system with experts on their own dataset. Results of our study with six biologists provide initial evidence that Geono-Cluster enables biologists to create, refine, and evaluate clustering results to effectively analyze their data and gain data-driven insights. At the end, we discuss lessons learned and implications of our study.
Collapse
|
4
|
Karatzas E, Gkonta M, Hotova J, Baltoumas FA, Kontou PI, Bobotsis CJ, Bagos PG, Pavlopoulos GA. VICTOR: A visual analytics web application for comparing cluster sets. Comput Biol Med 2021; 135:104557. [PMID: 34139436 DOI: 10.1016/j.compbiomed.2021.104557] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2021] [Revised: 06/04/2021] [Accepted: 06/04/2021] [Indexed: 01/21/2023]
Abstract
Clustering is the process of grouping different data objects based on similar properties. Clustering has applications in various case studies from several fields such as graph theory, image analysis, pattern recognition, statistics and others. Nowadays, there are numerous algorithms and tools able to generate clustering results. However, different algorithms or parameterizations may produce quite dissimilar cluster sets. In this way, the user is often forced to manually filter and compare these results in order to decide which of them generate the ideal clusters. To automate this process, in this study, we present VICTOR, the first fully interactive and dependency-free visual analytics web application which allows the visual comparison of the results of various clustering algorithms. VICTOR can handle multiple cluster set results simultaneously and compare them using ten different metrics. Clustering results can be filtered and compared to each other with the use of data tables or interactive heatmaps, bar plots, correlation networks, sankey and circos plots. We demonstrate VICTOR's functionality using three examples. In the first case, we compare five different network clustering algorithms on a Yeast protein-protein interaction dataset whereas in the second example, we test four different parameters of the MCL clustering algorithm on the same dataset. Finally, as a third example, we compare four different meta-analyses with hierarchically clustered differentially expressed genes found to be involved in myocardial infarction. VICTOR is available at http://victor.pavlopouloslab.info or http://bib.fleming.gr:3838/VICTOR.
Collapse
Affiliation(s)
- Evangelos Karatzas
- Institute for Fundamental Biomedical Research, BSRC "Alexander Fleming", Vari, Greece.
| | - Maria Gkonta
- Institute for Fundamental Biomedical Research, BSRC "Alexander Fleming", Vari, Greece; Department of Biology, University of Athens, Greece
| | - Joana Hotova
- Institute for Fundamental Biomedical Research, BSRC "Alexander Fleming", Vari, Greece; Department of Biology, University of Athens, Greece
| | - Fotis A Baltoumas
- Institute for Fundamental Biomedical Research, BSRC "Alexander Fleming", Vari, Greece
| | - Panagiota I Kontou
- Department of Computer Science and Biomedical Informatics, University of Thessaly, Lamia, Greece
| | | | - Pantelis G Bagos
- Department of Computer Science and Biomedical Informatics, University of Thessaly, Lamia, Greece
| | | |
Collapse
|
5
|
Pister A, Buono P, Fekete JD, Plaisant C, Valdivia P. Integrating Prior Knowledge in Mixed-Initiative Social Network Clustering. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 2021; 27:1775-1785. [PMID: 33095715 DOI: 10.1109/tvcg.2020.3030347] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
We propose a new approach-called PK-clustering-to help social scientists create meaningful clusters in social networks. Many clustering algorithms exist but most social scientists find them difficult to understand, and tools do not provide any guidance to choose algorithms, or to evaluate results taking into account the prior knowledge of the scientists. Our work introduces a new clustering approach and a visual analytics user interface that address this issue. It is based on a process that 1) captures the prior knowledge of the scientists as a set of incomplete clusters, 2) runs multiple clustering algorithms (similarly to clustering ensemble methods), 3) visualizes the results of all the algorithms ranked and summarized by how well each algorithm matches the prior knowledge, 4) evaluates the consensus between user-selected algorithms and 5) allows users to review details and iteratively update the acquired knowledge. We describe our approach using an initial functional prototype, then provide two examples of use and early feedback from social scientists. We believe our clustering approach offers a novel constructive method to iteratively build knowledge while avoiding being overly influenced by the results of often randomly selected black-box clustering algorithms.
Collapse
|
6
|
Krueger R, Beyer J, Jang WD, Kim NW, Sokolov A, Sorger PK, Pfister H. Facetto: Combining Unsupervised and Supervised Learning for Hierarchical Phenotype Analysis in Multi-Channel Image Data. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 2020; 26:227-237. [PMID: 31514138 PMCID: PMC7045445 DOI: 10.1109/tvcg.2019.2934547] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/20/2023]
Abstract
Facetto is a scalable visual analytics application that is used to discover single-cell phenotypes in high-dimensional multi-channel microscopy images of human tumors and tissues. Such images represent the cutting edge of digital histology and promise to revolutionize how diseases such as cancer are studied, diagnosed, and treated. Highly multiplexed tissue images are complex, comprising 109 or more pixels, 60-plus channels, and millions of individual cells. This makes manual analysis challenging and error-prone. Existing automated approaches are also inadequate, in large part, because they are unable to effectively exploit the deep knowledge of human tissue biology available to anatomic pathologists. To overcome these challenges, Facetto enables a semi-automated analysis of cell types and states. It integrates unsupervised and supervised learning into the image and feature exploration process and offers tools for analytical provenance. Experts can cluster the data to discover new types of cancer and immune cells and use clustering results to train a convolutional neural network that classifies new cells accordingly. Likewise, the output of classifiers can be clustered to discover aggregate patterns and phenotype subsets. We also introduce a new hierarchical approach to keep track of analysis steps and data subsets created by users; this assists in the identification of cell types. Users can build phenotype trees and interact with the resulting hierarchical structures of both high-dimensional feature and image spaces. We report on use-cases in which domain scientists explore various large-scale fluorescence imaging datasets. We demonstrate how Facetto assists users in steering the clustering and classification process, inspecting analysis results, and gaining new scientific insights into cancer biology.
Collapse
|
7
|
Allendes Osorio RS, Tripathi LP, Mizuguchi K. CLINE: a web-tool for the comparison of biological dendrogram structures. BMC Bioinformatics 2019; 20:528. [PMID: 31660851 PMCID: PMC6819642 DOI: 10.1186/s12859-019-3149-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2019] [Accepted: 10/04/2019] [Indexed: 11/18/2022] Open
Abstract
Background When visually comparing the results of hierarchical clustering, the differences in the arrangements of components are of special interest. However, in a biological setting, identifying such differences becomes less straightforward, as the changes in the dendrogram structure caused by permuting biological replicates, do not necessarily imply a different biological interpretation. Here, we introduce a visualization tool to help identify biologically similar topologies across different clustering results, even in the presence of replicates. Results Here we introduce CLINE, an open-access web application that allows users to visualize and compare multiple dendrogram structures, by visually displaying the links between areas of similarity across multiple structures. Through the use of a single page and a simple user interface, the user is able to load and remove structures form the visualization, change some aspects of their display and set the parameters used to match cluster topology across consecutive pairs of dendrograms. Conclusions We have implemented a web-tool that allows the users to visualize different dendrogram structures, showing not only the structures themselves, but also linking areas of similarity across multiple structures. The software is freely available at http://mizuguchilab.org/tools/cline/. Also, the source code, documentation and installation instructions are available on GitHub at https://github.com/RodolfoAllendes/cline/.
Collapse
|
8
|
Onel M, Beykal B, Ferguson K, Chiu WA, McDonald TJ, Zhou L, House JS, Wright FA, Sheen DA, Rusyn I, Pistikopoulos EN. Grouping of complex substances using analytical chemistry data: A framework for quantitative evaluation and visualization. PLoS One 2019; 14:e0223517. [PMID: 31600275 PMCID: PMC6786635 DOI: 10.1371/journal.pone.0223517] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2019] [Accepted: 09/23/2019] [Indexed: 02/01/2023] Open
Abstract
A detailed characterization of the chemical composition of complex substances, such as products of petroleum refining and environmental mixtures, is greatly needed in exposure assessment and manufacturing. The inherent complexity and variability in the composition of complex substances obfuscate the choices for their detailed analytical characterization. Yet, in lieu of exact chemical composition of complex substances, evaluation of the degree of similarity is a sensible path toward decision-making in environmental health regulations. Grouping of similar complex substances is a challenge that can be addressed via advanced analytical methods and streamlined data analysis and visualization techniques. Here, we propose a framework with unsupervised and supervised analyses to optimally group complex substances based on their analytical features. We test two data sets of complex oil-derived substances. The first data set is from gas chromatography-mass spectrometry (GC-MS) analysis of 20 Standard Reference Materials representing crude oils and oil refining products. The second data set consists of 15 samples of various gas oils analyzed using three analytical techniques: GC-MS, GC×GC-flame ionization detection (FID), and ion mobility spectrometry-mass spectrometry (IM-MS). We use hierarchical clustering using Pearson correlation as a similarity metric for the unsupervised analysis and build classification models using the Random Forest algorithm for the supervised analysis. We present a quantitative comparative assessment of clustering results via Fowlkes-Mallows index, and classification results via model accuracies in predicting the group of an unknown complex substance. We demonstrate the effect of (i) different grouping methodologies, (ii) data set size, and (iii) dimensionality reduction on the grouping quality, and (iv) different analytical techniques on the characterization of the complex substances. While the complexity and variability in chemical composition are an inherent feature of complex substances, we demonstrate how the choices of the data analysis and visualization methods can impact the communication of their characteristics to delineate sufficient similarity.
Collapse
Affiliation(s)
- Melis Onel
- Artie McFerrin Department of Chemical Engineering, Texas A&M University, College Station, TX, United States of America
- Texas A&M Energy Institute, Texas A&M University, College Station, TX, United States of America
| | - Burcu Beykal
- Artie McFerrin Department of Chemical Engineering, Texas A&M University, College Station, TX, United States of America
- Texas A&M Energy Institute, Texas A&M University, College Station, TX, United States of America
| | - Kyle Ferguson
- Department of Veterinary Integrative Biosciences, Texas A&M University, College Station, TX, United States of America
| | - Weihsueh A. Chiu
- Department of Veterinary Integrative Biosciences, Texas A&M University, College Station, TX, United States of America
| | - Thomas J. McDonald
- Department of Environmental and Occupational Health, Texas A&M University, College Station, TX, United States of America
| | - Lan Zhou
- Department of Statistics, Texas A&M University, College Station, TX, United States of America
| | - John S. House
- Bioinformatics Research Center, North Carolina State University, Raleigh, NC, United States of America
| | - Fred A. Wright
- Bioinformatics Research Center, North Carolina State University, Raleigh, NC, United States of America
- Departments of Statistics and Biological Sciences, North Carolina State University, Raleigh, NC, United States of America
| | - David A. Sheen
- Chemical Sciences Division, National Institute of Standards and Technology, Gaithersburg, MD, United States of America
| | - Ivan Rusyn
- Department of Veterinary Integrative Biosciences, Texas A&M University, College Station, TX, United States of America
| | - Efstratios N. Pistikopoulos
- Artie McFerrin Department of Chemical Engineering, Texas A&M University, College Station, TX, United States of America
- Texas A&M Energy Institute, Texas A&M University, College Station, TX, United States of America
| |
Collapse
|
9
|
Cavallo M, Demiralp C. Clustrophile 2: Guided Visual Clustering Analysis. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 2018; 25:267-276. [PMID: 30130194 DOI: 10.1109/tvcg.2018.2864477] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Data clustering is a common unsupervised learning method frequently used in exploratory data analysis. However, identifying relevant structures in unlabeled, high-dimensional data is nontrivial, requiring iterative experimentation with clustering parameters as well as data features and instances. The number of possible clusterings for a typical dataset is vast, and navigating in this vast space is also challenging. The absence of ground-truth labels makes it impossible to define an optimal solution, thus requiring user judgment to establish what can be considered a satisfiable clustering result. Data scientists need adequate interactive tools to effectively explore and navigate the large clustering space so as to improve the effectiveness of exploratory clustering analysis. We introduce Clustrophile 2, a new interactive tool for guided clustering analysis. Clustrophile 2 guides users in clustering-based exploratory analysis, adapts user feedback to improve user guidance, facilitates the interpretation of clusters, and helps quickly reason about differences between clusterings. To this end, Clustrophile 2 contributes a novel feature, the Clustering Tour, to help users choose clustering parameters and assess the quality of different clustering results in relation to current analysis goals and user expectations. We evaluate Clustrophile 2 through a user study with 12 data scientists, who used our tool to explore and interpret sub-cohorts in a dataset of Parkinson's disease patients. Results suggest that Clustrophile 2 improves the speed and effectiveness of exploratory clustering analysis for both experts and non-experts.
Collapse
|
10
|
Manjunath M, Zhang Y, Kim Y, Yeo SH, Sobh O, Russell N, Followell C, Bushell C, Ravaioli U, Song JS. ClusterEnG: an interactive educational web resource for clustering and visualizing high-dimensional data. PeerJ Comput Sci 2018; 4:e155. [PMID: 30906871 PMCID: PMC6429934 DOI: 10.7717/peerj-cs.155] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2018] [Accepted: 05/01/2018] [Indexed: 06/09/2023]
Abstract
SUMMARY Clustering is one of the most common techniques used in data analysis to discover hidden structures by grouping together data points that are similar in some measure into clusters. Although there are many programs available for performing clustering, a single web resource that provides both state-of-the-art clustering methods and interactive visualizations is lacking. ClusterEnG (acronym for Clustering Engine for Genomics) provides an interface for clustering big data and interactive visualizations including 3D views, cluster selection and zoom features. ClusterEnG also aims at educating the user about the similarities and differences between various clustering algorithms and provides clustering tutorials that demonstrate potential pitfalls of each algorithm. The web resource will be particularly useful to scientists who are not conversant with computing but want to understand the structure of their data in an intuitive manner. AVAILABILITY ClusterEnG is part of a bigger project called KnowEnG (Knowledge Engine for Genomics) and is available at http://education.knoweng.org/clustereng. CONTACT songi@illinois.edu.
Collapse
Affiliation(s)
- Mohith Manjunath
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Champaign, IL, United States of America
| | - Yi Zhang
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Champaign, IL, United States of America
- Department of Bioengineering, University of Illinois at Urbana-Champaign, Champaign, IL, United States of America
| | - Yeonsung Kim
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Champaign, IL, United States of America
| | - Steve H. Yeo
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Champaign, IL, United States of America
| | - Omar Sobh
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Champaign, IL, United States of America
| | - Nathan Russell
- Illinois Applied Research Institute, University of Illinois at Urbana-Champaign, Champaign, IL, United States of America
| | - Christian Followell
- Illinois Applied Research Institute, University of Illinois at Urbana-Champaign, Champaign, IL, United States of America
| | - Colleen Bushell
- Illinois Applied Research Institute, University of Illinois at Urbana-Champaign, Champaign, IL, United States of America
| | - Umberto Ravaioli
- Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, Champaign, IL, United States of America
| | - Jun S. Song
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Champaign, IL, United States of America
- Department of Physics, University of Illinois at Urbana-Champaign, Champaign, IL, United States of America
| |
Collapse
|
11
|
Kwon BC, Eysenbach B, Verma J, Ng K, De Filippi C, Stewart WF, Perer A. Clustervision: Visual Supervision of Unsupervised Clustering. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 2018; 24:142-151. [PMID: 28866567 DOI: 10.1109/tvcg.2017.2745085] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Clustering, the process of grouping together similar items into distinct partitions, is a common type of unsupervised machine learning that can be useful for summarizing and aggregating complex multi-dimensional data. However, data can be clustered in many ways, and there exist a large body of algorithms designed to reveal different patterns. While having access to a wide variety of algorithms is helpful, in practice, it is quite difficult for data scientists to choose and parameterize algorithms to get the clustering results relevant for their dataset and analytical tasks. To alleviate this problem, we built Clustervision, a visual analytics tool that helps ensure data scientists find the right clustering among the large amount of techniques and parameters available. Our system clusters data using a variety of clustering techniques and parameters and then ranks clustering results utilizing five quality metrics. In addition, users can guide the system to produce more relevant results by providing task-relevant constraints on the data. Our visual user interface allows users to find high quality clustering results, explore the clusters using several coordinated visualization techniques, and select the cluster result that best suits their task. We demonstrate this novel approach using a case study with a team of researchers in the medical domain and showcase that our system empowers users to choose an effective representation of their complex data.
Collapse
|
12
|
Li X, Wong KC. Multiobjective Patient Stratification Using Evolutionary Multiobjective Optimization. IEEE J Biomed Health Inform 2017; 22:1619-1629. [PMID: 29990162 DOI: 10.1109/jbhi.2017.2769711] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
One of the main challenges in modern medic-ine is to stratify patients for personalized care. Many different clustering methods have been proposed to solve the problem in both quantitative and biologically meaningful manners. However, existing clustering algorithms suffer from numerous restrictions such as experimental noises, high dimensionality, and poor interpretability. To overcome those limitations altogether, we propose and formulate a multiobjective framework based on evolutionary multiobjective optimization to balance the feature relevance and redundancy for patient stratification. To demonstrate the effectiveness of our proposed algorithms, we benchmark our algorithms across 55 synthetic datasets based on a real human transcription regulation network model, 35 real cancer gene expression datasets, and two case studies. Experimental results suggest that the proposed algorithms perform better than the recent state-of-the-arts. In addition, time complexity analysis, convergence analysis, and parameter analysis are conducted to demonstrate the robustness of the proposed methods from different perspectives. Finally, the t-Distributed Stochastic Neighbor Embedding (t-SNE) is applied to project the selected feature subsets onto two or three dimensions to visualize the high-dimensional patient stratification data.
Collapse
|
13
|
Kern M, Lex A, Gehlenborg N, Johnson CR. Interactive visual exploration and refinement of cluster assignments. BMC Bioinformatics 2017; 18:406. [PMID: 28899361 PMCID: PMC5596943 DOI: 10.1186/s12859-017-1813-7] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2017] [Accepted: 08/29/2017] [Indexed: 11/10/2022] Open
Abstract
Background With ever-increasing amounts of data produced in biology research, scientists are in need of efficient data analysis methods. Cluster analysis, combined with visualization of the results, is one such method that can be used to make sense of large data volumes. At the same time, cluster analysis is known to be imperfect and depends on the choice of algorithms, parameters, and distance measures. Most clustering algorithms don’t properly account for ambiguity in the source data, as records are often assigned to discrete clusters, even if an assignment is unclear. While there are metrics and visualization techniques that allow analysts to compare clusterings or to judge cluster quality, there is no comprehensive method that allows analysts to evaluate, compare, and refine cluster assignments based on the source data, derived scores, and contextual data. Results In this paper, we introduce a method that explicitly visualizes the quality of cluster assignments, allows comparisons of clustering results and enables analysts to manually curate and refine cluster assignments. Our methods are applicable to matrix data clustered with partitional, hierarchical, and fuzzy clustering algorithms. Furthermore, we enable analysts to explore clustering results in context of other data, for example, to observe whether a clustering of genomic data results in a meaningful differentiation in phenotypes. Conclusions Our methods are integrated into Caleydo StratomeX, a popular, web-based, disease subtype analysis tool. We show in a usage scenario that our approach can reveal ambiguities in cluster assignments and produce improved clusterings that better differentiate genotypes and phenotypes. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1813-7) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Michael Kern
- Scientific Computing and Imaging Institute, University of Utah, 72 Sout Central Campus Drive, Salt Lake City, 84112, USA.,Department of Informatics, Technical University of Munich, Garching bei München, 85747, Germany
| | - Alexander Lex
- Scientific Computing and Imaging Institute, University of Utah, 72 Sout Central Campus Drive, Salt Lake City, 84112, USA.
| | - Nils Gehlenborg
- Department of Biomedical Informatics, Harvard Medical School, Boston, 02115, USA
| | - Chris R Johnson
- Scientific Computing and Imaging Institute, University of Utah, 72 Sout Central Campus Drive, Salt Lake City, 84112, USA
| |
Collapse
|
14
|
Waldin N, Le Muzic M, Waldner M, Gröller E, Goodsell D, Ludovic A, Viola I. Chameleon: Dynamic Color Mapping for Multi-Scale Structural Biology Models. EUROGRAPHICS WORKSHOP ON VISUAL COMPUTING FOR BIOMEDICINE 2017; 2016. [PMID: 28361008 DOI: 10.2312/vcbm.20161266] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Visualization of structural biology data uses color to categorize or separate dense structures into particular semantic units. In multiscale models of viruses or bacteria, there are atoms on the finest level of detail, then amino-acids, secondary structures, macromolecules, up to the compartment level and, in all these levels, elements can be visually distinguished by color. However, currently only single scale coloring schemes are utilized that show information for one particular scale only. We present a novel technology which adaptively, based on the current scale level, adjusts the color scheme to depict or distinguish the currently best visible structural information. We treat the color as a visual resource that is distributed given a particular demand. The changes of the color scheme are seamlessly interpolated between the color scheme from the previous views into a given new one. With such dynamic multi-scale color mapping we ensure that the viewer is able to distinguish structural detail that is shown on any given scale. This technique has been tested by users with an expertise in structural biology and has been overall well received.
Collapse
|
15
|
Highlights from the 5th Symposium on Biological Data Visualization: Part 1. BMC Bioinformatics 2015; 16 Suppl 11:S1. [PMID: 26330192 PMCID: PMC4547146 DOI: 10.1186/1471-2105-16-s11-s1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
|