1
|
Stafford IS, Ashton JJ, Mossotto E, Cheng G, Mark Beattie R, Ennis S. Supervised Machine Learning Classifies Inflammatory Bowel Disease Patients by Subtype Using Whole Exome Sequencing Data. J Crohns Colitis 2023; 17:1672-1680. [PMID: 37205778 PMCID: PMC10637043 DOI: 10.1093/ecco-jcc/jjad084] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/20/2022] [Indexed: 05/21/2023]
Abstract
BACKGROUND Inflammatory bowel disease [IBD] is a chronic inflammatory disorder with two main subtypes: Crohn's disease [CD] and ulcerative colitis [UC]. Prompt subtype diagnosis enables the correct treatment to be administered. Using genomic data, we aimed to assess machine learning [ML] to classify patients according to IBD subtype. METHODS Whole exome sequencing [WES] from paediatric/adult IBD patients was processed using an in-house bioinformatics pipeline. These data were condensed into the per-gene, per-individual genomic burden score, GenePy. Data were split into training and testing datasets [80/20]. Feature selection with a linear support vector classifier, and hyperparameter tuning with Bayesian Optimisation, were performed [training data]. The supervised ML method random forest was utilised to classify patients as CD or UC, using three panels: 1] all available genes; 2] autoimmune genes; 3] 'IBD' genes. ML results were assessed using area under the receiver operating characteristics curve [AUROC], sensitivity, and specificity on the testing dataset. RESULTS A total of 906 patients were included in analysis [600 CD, 306 UC]. Training data included 488 patients, balanced according to the minority class of UC. The autoimmune gene panel generated the best performing ML model [AUROC = 0.68], outperforming an IBD gene panel [AUROC = 0.61]. NOD2 was the top gene for discriminating CD and UC, regardless of the gene panel used. Lack of variation in genes with high GenePy scores in CD patients was the best classifier of a diagnosis of UC. DISCUSSION We demonstrate promising classification of patients by subtype using random forest and WES data. Focusing on specific subgroups of patients, with larger datasets, may result in better classification.
Collapse
Affiliation(s)
- Imogen S Stafford
- Department of Human Genetics and Genomic Medicine, University of Southampton, Southampton, UK
- NIHR Southampton Biomedical Research, University Hospital Southampton, Southampton, UK
- Institute for Life Sciences, University of Southampton, Southampton, UK
| | - James J Ashton
- Department of Human Genetics and Genomic Medicine, University of Southampton, Southampton, UK
- Department of Paediatric Gastroenterology, Southampton Children’s Hospital, Southampton, UK
| | - Enrico Mossotto
- Department of Human Genetics and Genomic Medicine, University of Southampton, Southampton, UK
| | - Guo Cheng
- Department of Human Genetics and Genomic Medicine, University of Southampton, Southampton, UK
- NIHR Southampton Biomedical Research, University Hospital Southampton, Southampton, UK
| | - Robert Mark Beattie
- Department of Paediatric Gastroenterology, Southampton Children’s Hospital, Southampton, UK
| | - Sarah Ennis
- Department of Human Genetics and Genomic Medicine, University of Southampton, Southampton, UK
| |
Collapse
|
2
|
Teichman G, Cohen D, Ganon O, Dunsky N, Shani S, Gingold H, Rechavi O. RNAlysis: analyze your RNA sequencing data without writing a single line of code. BMC Biol 2023; 21:74. [PMID: 37024838 PMCID: PMC10080885 DOI: 10.1186/s12915-023-01574-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2022] [Accepted: 03/17/2023] [Indexed: 04/08/2023] Open
Abstract
BACKGROUND Among the major challenges in next-generation sequencing experiments are exploratory data analysis, interpreting trends, identifying potential targets/candidates, and visualizing the results clearly and intuitively. These hurdles are further heightened for researchers who are not experienced in writing computer code since most available analysis tools require programming skills. Even for proficient computational biologists, an efficient and replicable system is warranted to generate standardized results. RESULTS We have developed RNAlysis, a modular Python-based analysis software for RNA sequencing data. RNAlysis allows users to build customized analysis pipelines suiting their specific research questions, going all the way from raw FASTQ files (adapter trimming, alignment, and feature counting), through exploratory data analysis and data visualization, clustering analysis, and gene set enrichment analysis. RNAlysis provides a friendly graphical user interface, allowing researchers to analyze data without writing code. We demonstrate the use of RNAlysis by analyzing RNA sequencing data from different studies using C. elegans nematodes. We note that the software applies equally to data obtained from any organism with an existing reference genome. CONCLUSIONS RNAlysis is suitable for investigating various biological questions, allowing researchers to more accurately and reproducibly run comprehensive bioinformatic analyses. It functions as a gateway into RNA sequencing analysis for less computer-savvy researchers, but can also help experienced bioinformaticians make their analyses more robust and efficient, as it offers diverse tools, scalability, automation, and standardization between analyses.
Collapse
Affiliation(s)
- Guy Teichman
- Department of Neurobiology, Wise Faculty of Life Sciences and Sagol School of Neuroscience, Tel Aviv University, Tel Aviv, Israel.
| | - Dror Cohen
- Department of Neurobiology, Wise Faculty of Life Sciences and Sagol School of Neuroscience, Tel Aviv University, Tel Aviv, Israel
| | - Or Ganon
- Department of Biology, Technion - Israel Institute of Technology, Haifa, Israel
| | - Netta Dunsky
- Sagol Brain Institute, Sourasky Medical Center, Neurological Institute, Tel Aviv and Sagol School of Neuroscience, Tel Aviv University, Tel Aviv, Israel
| | - Shachar Shani
- Sackler Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel
| | - Hila Gingold
- Department of Neurobiology, Wise Faculty of Life Sciences and Sagol School of Neuroscience, Tel Aviv University, Tel Aviv, Israel
| | - Oded Rechavi
- Department of Neurobiology, Wise Faculty of Life Sciences and Sagol School of Neuroscience, Tel Aviv University, Tel Aviv, Israel
| |
Collapse
|
3
|
Robin V, Bodein A, Scott-Boyer MP, Leclercq M, Périn O, Droit A. Overview of methods for characterization and visualization of a protein-protein interaction network in a multi-omics integration context. Front Mol Biosci 2022; 9:962799. [PMID: 36158572 PMCID: PMC9494275 DOI: 10.3389/fmolb.2022.962799] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2022] [Accepted: 08/16/2022] [Indexed: 11/26/2022] Open
Abstract
At the heart of the cellular machinery through the regulation of cellular functions, protein-protein interactions (PPIs) have a significant role. PPIs can be analyzed with network approaches. Construction of a PPI network requires prediction of the interactions. All PPIs form a network. Different biases such as lack of data, recurrence of information, and false interactions make the network unstable. Integrated strategies allow solving these different challenges. These approaches have shown encouraging results for the understanding of molecular mechanisms, drug action mechanisms, and identification of target genes. In order to give more importance to an interaction, it is evaluated by different confidence scores. These scores allow the filtration of the network and thus facilitate the representation of the network, essential steps to the identification and understanding of molecular mechanisms. In this review, we will discuss the main computational methods for predicting PPI, including ones confirming an interaction as well as the integration of PPIs into a network, and we will discuss visualization of these complex data.
Collapse
Affiliation(s)
- Vivian Robin
- Molecular Medicine Department, CHU de Québec Research Center, Université Laval, Québec, QC, Canada
| | - Antoine Bodein
- Molecular Medicine Department, CHU de Québec Research Center, Université Laval, Québec, QC, Canada
| | - Marie-Pier Scott-Boyer
- Molecular Medicine Department, CHU de Québec Research Center, Université Laval, Québec, QC, Canada
| | - Mickaël Leclercq
- Molecular Medicine Department, CHU de Québec Research Center, Université Laval, Québec, QC, Canada
| | - Olivier Périn
- Digital Sciences Department, L'Oréal Advanced Research, Aulnay-sous-bois, France
| | - Arnaud Droit
- Molecular Medicine Department, CHU de Québec Research Center, Université Laval, Québec, QC, Canada
| |
Collapse
|
4
|
Bar N, Nikparvar B, Jayavelu ND, Roessler FK. Constrained Fourier estimation of short-term time-series gene expression data reduces noise and improves clustering and gene regulatory network predictions. BMC Bioinformatics 2022; 23:330. [PMID: 35945515 PMCID: PMC9364503 DOI: 10.1186/s12859-022-04839-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2021] [Accepted: 07/12/2022] [Indexed: 01/15/2023] Open
Abstract
BACKGROUND Biological data suffers from noise that is inherent in the measurements. This is particularly true for time-series gene expression measurements. Nevertheless, in order to to explore cellular dynamics, scientists employ such noisy measurements in predictive and clustering tools. However, noisy data can not only obscure the genes temporal patterns, but applying predictive and clustering tools on noisy data may yield inconsistent, and potentially incorrect, results. RESULTS To reduce the noise of short-term (< 48 h) time-series expression data, we relied on the three basic temporal patterns of gene expression: waves, impulses and sustained responses. We constrained the estimation of the true signals to these patterns by estimating the parameters of first and second-order Fourier functions and using the nonlinear least-squares trust-region optimization technique. Our approach lowered the noise in at least 85% of synthetic time-series expression data, significantly more than the spline method ([Formula: see text]). When the data contained a higher signal-to-noise ratio, our method allowed downstream network component analyses to calculate consistent and accurate predictions, particularly when the noise variance was high. Conversely, these tools led to erroneous results from untreated noisy data. Our results suggest that at least 5-7 time points are required to efficiently de-noise logarithmic scaled time-series expression data. Investing in sampling additional time points provides little benefit to clustering and prediction accuracy. CONCLUSIONS Our constrained Fourier de-noising method helps to cluster noisy gene expression and interpret dynamic gene networks more accurately. The benefit of noise reduction is large and can constitute the difference between a successful application and a failing one.
Collapse
Affiliation(s)
- Nadav Bar
- grid.5947.f0000 0001 1516 2393Department of Chemical Engineering, Norwegian University of Science and Technology (NTNU), Sem Sælandsvei 4, Trondheim, NO-7491 Norway
| | - Bahareh Nikparvar
- grid.5947.f0000 0001 1516 2393Department of Chemical Engineering, Norwegian University of Science and Technology (NTNU), Sem Sælandsvei 4, Trondheim, NO-7491 Norway
| | - Naresh Doni Jayavelu
- grid.34477.330000000122986657Division of Medical Genetics, Department of Medicine, University of Washington Seattle, Seattle, WA 98195-7720 USA
| | - Fabienne Krystin Roessler
- grid.5947.f0000 0001 1516 2393Department of Chemical Engineering, Norwegian University of Science and Technology (NTNU), Sem Sælandsvei 4, Trondheim, NO-7491 Norway
| |
Collapse
|
5
|
Koyuncu E. Centroidal Clustering of Noisy Observations by Using r th Power Distortion Measures. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; PP:1430-1438. [PMID: 35731771 DOI: 10.1109/tnnls.2022.3183294] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
We consider the problem of clustering a dataset through multiple noisy observations of its members. The goal is to obtain a clustering that is as faithful to the clustering of the original dataset as possible. We propose a centroidal approach whose distortion measure is the sum of r th powers of the distances between the cluster center and the noisy observations. For r=2 , our scheme boils down to the well-known approach of clustering the average of noisy samples. First, we provide a mathematical analysis of our clustering scheme. In particular, we find formulas for the average distortion and the spatial distribution of the cluster centers in the asymptotic regime where the number of centers is large. We then provide an algorithm to numerically optimize the cluster centers in the finite regime. We extend our method to automatically assign weights to noisy observations. Finally, we show that for various practical noise models, with a suitable choice of r , our algorithms can outperform several other existing techniques over various datasets.
Collapse
|
6
|
DTIP-TC2A: An analytical framework for drug-target interactions prediction methods. Comput Biol Chem 2022; 99:107707. [DOI: 10.1016/j.compbiolchem.2022.107707] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2021] [Revised: 05/01/2022] [Accepted: 05/26/2022] [Indexed: 11/18/2022]
|
7
|
Sadeghi SS, Keyvanpour MR. Computational Drug Repurposing: Classification of the Research Opportunities and Challenges. Curr Comput Aided Drug Des 2021; 16:354-364. [PMID: 31198115 DOI: 10.2174/1573409915666190613113822] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2018] [Revised: 02/13/2019] [Accepted: 05/18/2019] [Indexed: 11/22/2022]
Abstract
BACKGROUND Drug repurposing has grown significantly in recent years. Research and innovation in drug repurposing are extremely popular due to its practical and explicit advantages. However, its adoption into practice is slow because researchers and industries have to face various challenges. OBJECTIVE As this field, there is a lack of a comprehensive platform for systematic identification for removing development limitations. This paper deals with a comprehensive classification of challenges in drug repurposing. METHODS Initially, a classification of various existing repurposing models is propounded. Next, the benefits of drug repurposing are summarized. Further, a categorization for computational drug repurposing shortcomings is presented. Finally, the methods are evaluated based on their strength to addressing the drawbacks. RESULTS This work can offer a desirable platform for comparing the computational repurposing methods by measuring the methods in light of these challenges. CONCLUSION A proper comparison could prepare guidance for a genuine understanding of methods. Accordingly, this comprehension of the methods will help researchers eliminate the barriers thereby developing and improving methods. Furthermore, in this study, we conclude why despite all the benefits of drug repurposing, it is not being done anymore.
Collapse
|
8
|
Peterson EJR, Abidi AA, Arrieta-Ortiz ML, Aguilar B, Yurkovich JT, Kaur A, Pan M, Srinivas V, Shmulevich I, Baliga NS. Intricate Genetic Programs Controlling Dormancy in Mycobacterium tuberculosis. Cell Rep 2021; 31:107577. [PMID: 32348771 PMCID: PMC7605849 DOI: 10.1016/j.celrep.2020.107577] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2019] [Revised: 12/18/2019] [Accepted: 04/06/2020] [Indexed: 11/24/2022] Open
Abstract
Mycobacterium tuberculosis (MTB) displays the remarkable ability to transition in and out of dormancy, a hallmark of the pathogen’s capacity to evade the immune system and exploit susceptible individuals. Uncovering the gene regulatory programs that underlie the phenotypic shifts in MTB during disease latency and reactivation has posed a challenge. We develop an experimental system to precisely control dissolved oxygen levels in MTB cultures in order to capture the transcriptional events that unfold as MTB transitions into and out of hypoxia-induced dormancy. Using a comprehensive genome-wide transcription factor binding map and insights from network topology analysis, we identify regulatory circuits that deterministically drive sequential transitions across six transcriptionally and functionally distinct states encompassing more than three-fifths of the MTB genome. The architecture of the genetic programs explains the transcriptional dynamics underlying synchronous entry of cells into a dormant state that is primed to infect the host upon encountering favorable conditions. Mycobacterium tuberculosis (MTB) persists within the host by counteracting disparate stressors including hypoxia. Peterson et al. report a transcriptional program that coordinates sequential state transitions to drive MTB in and out of hypoxia-induced dormancy. Among varied properties, this program encodes advanced preparedness to infect the host in favorable conditions.
Collapse
Affiliation(s)
| | - Abrar A Abidi
- Institute for Systems Biology, Seattle, WA 98109, USA
| | | | - Boris Aguilar
- Institute for Systems Biology, Seattle, WA 98109, USA
| | | | - Amardeep Kaur
- Institute for Systems Biology, Seattle, WA 98109, USA
| | - Min Pan
- Institute for Systems Biology, Seattle, WA 98109, USA
| | | | | | - Nitin S Baliga
- Institute for Systems Biology, Seattle, WA 98109, USA; Molecular and Cellular Biology Program, Departments of Microbiology and Biology, University of Washington, Seattle, WA; Lawrence Berkeley National Laboratories, Berkeley, CA.
| |
Collapse
|
9
|
Stacey RG, Skinnider MA, Foster LJ. On the Robustness of Graph-Based Clustering to Random Network Alterations. Mol Cell Proteomics 2020; 20:100002. [PMID: 33592499 PMCID: PMC7896145 DOI: 10.1074/mcp.ra120.002275] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2020] [Revised: 10/30/2020] [Accepted: 11/04/2020] [Indexed: 11/23/2022] Open
Abstract
Biological functions emerge from complex and dynamic networks of protein-protein interactions. Because these protein-protein interaction networks, or interactomes, represent pairwise connections within a hierarchically organized system, it is often useful to identify higher-order associations embedded within them, such as multimember protein complexes. Graph-based clustering techniques are widely used to accomplish this goal, and dozens of field-specific and general clustering algorithms exist. However, interactomes can be prone to errors, especially when inferred from high-throughput biochemical assays. Therefore, robustness to network-level noise is an important criterion. Here, we tested the robustness of a range of graph-based clustering algorithms in the presence of noise, including algorithms common across domains and those specific to protein networks. Strikingly, we found that all of the clustering algorithms tested here markedly amplified network-level noise. Randomly rewiring only 1% of network edges yielded more than a 50% change in clustering results. Moreover, we found the impact of network noise on individual clusters was not uniform: some clusters were consistently robust to injected noise, whereas others were not. Therefore we developed the clust.perturb R package and Shiny web application to measure the reproducibility of clusters by randomly perturbing the network. We show that clust.perturb results are predictive of real-world cluster stability: poorly reproducible clusters as identified by clust.perturb are significantly less likely to be reclustered across experiments. We conclude that graph-based clustering amplifies noise in protein interaction networks, but quantifying the robustness of a cluster to network noise can separate stable protein complexes from spurious associations.
Collapse
Affiliation(s)
- R Greg Stacey
- Michael Smith Laboratories, University of British Columbia, Vancouver, Canada.
| | - Michael A Skinnider
- Michael Smith Laboratories, University of British Columbia, Vancouver, Canada
| | - Leonard J Foster
- Michael Smith Laboratories, University of British Columbia, Vancouver, Canada; Department of Biochemistry, University of British Columbia, Vancouver, Canada
| |
Collapse
|
10
|
Wong YKE, Lam KW, Ho KY, Yu CSA, Cho CSW, Tsang HF, Chu MKM, Ng PWL, Tai CSW, Chan LWC, Wong EYL, Wong SCC. The applications of big data in molecular diagnostics. Expert Rev Mol Diagn 2019; 19:905-917. [PMID: 31422710 DOI: 10.1080/14737159.2019.1657834] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
Affiliation(s)
- Yin Kwan Evelyn Wong
- Department of Health Technology and Informatics, Hong Kong Polytechnic University, Hong Kong Special Administrative Region
| | - Ka Wai Lam
- Department of Health Technology and Informatics, Hong Kong Polytechnic University, Hong Kong Special Administrative Region
| | - Ka Yi Ho
- Department of Health Technology and Informatics, Hong Kong Polytechnic University, Hong Kong Special Administrative Region
| | | | - Chi Shing William Cho
- Department of Clinical Oncology, Queen Elizabeth Hospital, Hong Kong Special Administrative Region
| | - Hin Fung Tsang
- Department of Health Technology and Informatics, Hong Kong Polytechnic University, Hong Kong Special Administrative Region
| | - Man Kee Maggie Chu
- Department of Life Science, The Hong Kong University of Science and Technology, Hong Kong Special Administrative Region
| | - Po Wah Lawrence Ng
- Department of Pathology, Queen Elizabeth Hospital, Hong Kong Special Administrative Region
| | - Chi Shing William Tai
- Department of Applied Biology and Chemical Technology, Hong Kong Polytechnic University, Hong Kong Special Administrative Region
| | - Lawrence Wing Chi Chan
- Department of Health Technology and Informatics, Hong Kong Polytechnic University, Hong Kong Special Administrative Region
| | - Elaine Yue Ling Wong
- Department of Health Technology and Informatics, Hong Kong Polytechnic University, Hong Kong Special Administrative Region
| | - Sze Chuen Cesar Wong
- Department of Health Technology and Informatics, Hong Kong Polytechnic University, Hong Kong Special Administrative Region
| |
Collapse
|
11
|
Clustering data with the presence of attribute noise: a study of noise completely at random and ensemble of multiple k-means clusterings. INT J MACH LEARN CYB 2019. [DOI: 10.1007/s13042-019-00989-4] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
12
|
de Ridder M, Klein K, Kim J. A review and outlook on visual analytics for uncertainties in functional magnetic resonance imaging. Brain Inform 2018; 5:5. [PMID: 29968092 PMCID: PMC6170942 DOI: 10.1186/s40708-018-0083-0] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2018] [Accepted: 06/18/2018] [Indexed: 11/10/2022] Open
Abstract
Analysis of functional magnetic resonance imaging (fMRI) plays a pivotal role in uncovering an understanding of the brain. fMRI data contain both spatial volume and temporal signal information, which provide a depiction of brain activity. The analysis pipeline, however, is hampered by numerous uncertainties in many of the steps; often seen as one of the last hurdles for the domain. In this review, we categorise fMRI research into three pipeline phases: (i) image acquisition and processing; (ii) image analysis; and (iii) visualisation and human interpretation, to explore the uncertainties that arise in each phase, including the compound effects due to the inter-dependence of steps. Attempts at mitigating uncertainties rely on providing interactive visual analytics that aid users in understanding the effects of the uncertainties and adjusting their analyses. This impetus for visual analytics comes in light of considerable research investigating uncertainty throughout the pipeline. However, to the best of our knowledge, there is yet to be a comprehensive review on the importance and utility of uncertainty visual analytics (UVA) in addressing fMRI concerns, which we term fMRI-UVA. Such techniques have been broadly implemented in related biomedical fields, and its potential for fMRI has recently been explored; however, these attempts are limited in their scope and utility, primarily focussing on addressing small parts of single pipeline phases. Our comprehensive review of the fMRI uncertainties from the perspective of visual analytics addresses the three identified phases in the pipeline. We also discuss the two interrelated approaches for future research opportunities for fMRI-UVA.
Collapse
Affiliation(s)
- Michael de Ridder
- Biomedical and Multimedia Information Technology Research Group, University of Sydney, Sydney, Australia.
| | - Karsten Klein
- Department of Computer and Information Science, Universität Konstanz, Konstanz, Germany
| | - Jinman Kim
- Biomedical and Multimedia Information Technology Research Group, University of Sydney, Sydney, Australia
| |
Collapse
|
13
|
Abstract
Clustering is an unsupervised learning method, which groups data points based on similarity, and is used to reveal the underlying structure of data. This computational approach is essential to understanding and visualizing the complex data that are acquired in high-throughput multidimensional biological experiments. Clustering enables researchers to make biological inferences for further experiments. Although a powerful technique, inappropriate application can lead biological researchers to waste resources and time in experimental follow-up. We review common pitfalls identified from the published molecular biology literature and present methods to avoid them. Commonly encountered pitfalls relate to the high-dimensional nature of biological data from high-throughput experiments, the failure to consider more than one clustering method for a given problem, and the difficulty in determining whether clustering has produced meaningful results. We present concrete examples of problems and solutions (clustering results) in the form of toy problems and real biological data for these issues. We also discuss ensemble clustering as an easy-to-implement method that enables the exploration of multiple clustering solutions and improves robustness of clustering solutions. Increased awareness of common clustering pitfalls will help researchers avoid overinterpreting or misinterpreting the results and missing valuable insights when clustering biological data.
Collapse
Affiliation(s)
- Tom Ronan
- Department of Biomedical Engineering, Center for Biological Systems Engineering, Washington University in St. Louis, St. Louis, MO 63130, USA
| | - Zhijie Qi
- Department of Biomedical Engineering, Center for Biological Systems Engineering, Washington University in St. Louis, St. Louis, MO 63130, USA
| | - Kristen M Naegle
- Department of Biomedical Engineering, Center for Biological Systems Engineering, Washington University in St. Louis, St. Louis, MO 63130, USA.
| |
Collapse
|
14
|
Reeb PD, Bramardi SJ, Steibel JP. Assessing Dissimilarity Measures for Sample-Based Hierarchical Clustering of RNA Sequencing Data Using Plasmode Datasets. PLoS One 2015; 10:e0132310. [PMID: 26162080 PMCID: PMC4498680 DOI: 10.1371/journal.pone.0132310] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2015] [Accepted: 06/11/2015] [Indexed: 01/03/2023] Open
Abstract
Sample- and gene- based hierarchical cluster analyses have been widely adopted as tools for exploring gene expression data in high-throughput experiments. Gene expression values (read counts) generated by RNA sequencing technology (RNA-seq) are discrete variables with special statistical properties, such as over-dispersion and right-skewness. Additionally, read counts are subject to technology artifacts as differences in sequencing depth. This possesses a challenge to finding distance measures suitable for hierarchical clustering. Normalization and transformation procedures have been proposed to favor the use of Euclidean and correlation based distances. Additionally, novel model-based dissimilarities that account for RNA-seq data characteristics have also been proposed. Adequacy of dissimilarity measures has been assessed using parametric simulations or exemplar datasets that may limit the scope of the conclusions. Here, we propose the simulation of realistic conditions through creation of plasmode datasets, to assess the adequacy of dissimilarity measures for sample-based hierarchical clustering of RNA-seq data. Consistent results were obtained using plasmode datasets based on RNA-seq experiments conducted under widely different conditions. Dissimilarity measures based on Euclidean distance that only considered data normalization or data standardization were not reliable to represent the expected hierarchical structure. Conversely, using either a Poisson-based dissimilarity or a rank correlation based dissimilarity or an appropriate data transformation, resulted in dendrograms that resemble the expected hierarchical structure. Plasmode datasets can be generated for a wide range of scenarios upon which dissimilarity measures can be evaluated for sample-based hierarchical clustering analysis. We showed different ways of generating such plasmodes and applied them to the problem of selecting a suitable dissimilarity measure. We report several measures that are satisfactory and the choice of a particular measure may rely on the availability on the software pipeline of preference.
Collapse
Affiliation(s)
- Pablo D. Reeb
- Department of Fisheries and Wildlife, Michigan State University, East Lansing, Michigan, United States of America
- Department of Statistics, Universidad Nacional del Comahue, Cinco Saltos, Rio Negro, Argentina
| | - Sergio J. Bramardi
- Department of Statistics, Universidad Nacional del Comahue, Cinco Saltos, Rio Negro, Argentina
- College of Agricultural and Forest Sciences, Universidad Nacional de La Plata, La Plata, Buenos Aires, Argentina
| | - Juan P. Steibel
- Department of Fisheries and Wildlife, Michigan State University, East Lansing, Michigan, United States of America
- Department of Animal Science, Michigan State University, East Lansing, Michigan, United States of America
- * E-mail:
| |
Collapse
|
15
|
Dinov ID, Petrosyan P, Liu Z, Eggert P, Hobel S, Vespa P, Woo Moon S, Van Horn JD, Franco J, Toga AW. High-throughput neuroimaging-genetics computational infrastructure. Front Neuroinform 2014; 8:41. [PMID: 24795619 PMCID: PMC4005931 DOI: 10.3389/fninf.2014.00041] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2014] [Accepted: 03/27/2014] [Indexed: 01/01/2023] Open
Abstract
Many contemporary neuroscientific investigations face significant challenges in terms of data management, computational processing, data mining, and results interpretation. These four pillars define the core infrastructure necessary to plan, organize, orchestrate, validate, and disseminate novel scientific methods, computational resources, and translational healthcare findings. Data management includes protocols for data acquisition, archival, query, transfer, retrieval, and aggregation. Computational processing involves the necessary software, hardware, and networking infrastructure required to handle large amounts of heterogeneous neuroimaging, genetics, clinical, and phenotypic data and meta-data. Data mining refers to the process of automatically extracting data features, characteristics and associations, which are not readily visible by human exploration of the raw dataset. Result interpretation includes scientific visualization, community validation of findings and reproducible findings. In this manuscript we describe the novel high-throughput neuroimaging-genetics computational infrastructure available at the Institute for Neuroimaging and Informatics (INI) and the Laboratory of Neuro Imaging (LONI) at University of Southern California (USC). INI and LONI include ultra-high-field and standard-field MRI brain scanners along with an imaging-genetics database for storing the complete provenance of the raw and derived data and meta-data. In addition, the institute provides a large number of software tools for image and shape analysis, mathematical modeling, genomic sequence processing, and scientific visualization. A unique feature of this architecture is the Pipeline environment, which integrates the data management, processing, transfer, and visualization. Through its client-server architecture, the Pipeline environment provides a graphical user interface for designing, executing, monitoring validating, and disseminating of complex protocols that utilize diverse suites of software tools and web-services. These pipeline workflows are represented as portable XML objects which transfer the execution instructions and user specifications from the client user machine to remote pipeline servers for distributed computing. Using Alzheimer's and Parkinson's data, we provide several examples of translational applications using this infrastructure.
Collapse
Affiliation(s)
- Ivo D. Dinov
- Laboratory of Neuro Imaging, Institute for Neuroimaging and Informatics, University of Southern CaliforniaLos Angeles, CA, USA
- Biomedical Informatics Research Network, Information Sciences Institute, University of Southern CaliforniaLos Angeles, CA, USA
- Statistics Online Computational Resource, University of Michigan, UMSNAnn Arbor, MI, USA
| | - Petros Petrosyan
- Laboratory of Neuro Imaging, Institute for Neuroimaging and Informatics, University of Southern CaliforniaLos Angeles, CA, USA
| | - Zhizhong Liu
- Laboratory of Neuro Imaging, Institute for Neuroimaging and Informatics, University of Southern CaliforniaLos Angeles, CA, USA
| | - Paul Eggert
- Laboratory of Neuro Imaging, Institute for Neuroimaging and Informatics, University of Southern CaliforniaLos Angeles, CA, USA
- Department of Computer Science, University of CaliforniaLos Angeles, Los Angeles, CA, USA
| | - Sam Hobel
- Laboratory of Neuro Imaging, Institute for Neuroimaging and Informatics, University of Southern CaliforniaLos Angeles, CA, USA
| | - Paul Vespa
- Brain Injury Research Center, Department of Neurosurgery, David Geffen School of Medicine, University of CaliforniaLos Angeles, Los Angeles, CA, USA
| | - Seok Woo Moon
- Department of Neuropsychiatry, Konkuk University School of MedicineSeoul, Korea
| | - John D. Van Horn
- Laboratory of Neuro Imaging, Institute for Neuroimaging and Informatics, University of Southern CaliforniaLos Angeles, CA, USA
| | - Joseph Franco
- Laboratory of Neuro Imaging, Institute for Neuroimaging and Informatics, University of Southern CaliforniaLos Angeles, CA, USA
| | - Arthur W. Toga
- Laboratory of Neuro Imaging, Institute for Neuroimaging and Informatics, University of Southern CaliforniaLos Angeles, CA, USA
- Biomedical Informatics Research Network, Information Sciences Institute, University of Southern CaliforniaLos Angeles, CA, USA
| |
Collapse
|
16
|
Ren X, Wang Y, Zhang XS, Jin Q. iPcc: a novel feature extraction method for accurate disease class discovery and prediction. Nucleic Acids Res 2013; 41:e143. [PMID: 23761440 PMCID: PMC3737526 DOI: 10.1093/nar/gkt343] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023] Open
Abstract
Gene expression profiling has gradually become a routine procedure for disease diagnosis and classification. In the past decade, many computational methods have been proposed, resulting in great improvements on various levels, including feature selection and algorithms for classification and clustering. In this study, we present iPcc, a novel method from the feature extraction perspective to further propel gene expression profiling technologies from bench to bedside. We define ‘correlation feature space’ for samples based on the gene expression profiles by iterative employment of Pearson’s correlation coefficient. Numerical experiments on both simulated and real gene expression data sets demonstrate that iPcc can greatly highlight the latent patterns underlying noisy gene expression data and thus greatly improve the robustness and accuracy of the algorithms currently available for disease diagnosis and classification based on gene expression profiles.
Collapse
Affiliation(s)
- Xianwen Ren
- MOH Key Laboratory of Systems Biology of Pathogens, Institute of Pathogen Biology, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing 100730, China
| | | | | | | |
Collapse
|
17
|
A Noise Removal Algorithm for Time Series Microarray Data. PROGRESS IN ARTIFICIAL INTELLIGENCE 2013. [DOI: 10.1007/978-3-642-40669-0_14] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|