1
|
Zhao Z, Zucknick M, Aittokallio T. EnrichIntersect: an R package for custom set enrichment analysis and interactive visualization of intersecting sets. BIOINFORMATICS ADVANCES 2022; 2:vbac073. [PMID: 36699400 PMCID: PMC9710586 DOI: 10.1093/bioadv/vbac073] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/15/2022] [Accepted: 09/26/2022] [Indexed: 02/01/2023]
Abstract
Summary Enrichment analysis has been widely used to study whether predefined sets of genes or other molecular features are over-represented in a ranked list associated with a disease or other phenotype. However, computational tools that perform enrichment analysis and visualization are usually limited to predefined sets available from public databases. To make such analyses more flexible, we introduce an R package, EnrichIntersect, which enables enrichment analyses among any ranked features and user-defined custom sets. For interactive visualization of multiple covariates, such as genes or other features, which are associated with multiple phenotypes and multiple sample groups, such as drug responses in various cancer types, EnrichIntersect illustrates all associations at a glance, hence explicitly indicating intersecting covariates between multiple phenotypic variables and between multiple sample groups. Availability and implementation The EnrichIntersect R package is available at https://CRAN.R-project.org/package=EnrichIntersect via an open-source MIT license. A package installation process is described on CRAN at https://cran.r-project.org/. A user-manual description of features and function calls can be found from the vignette of our package on CRAN.
Collapse
Affiliation(s)
- Zhi Zhao
- To whom correspondence should be addressed.
| | - Manuela Zucknick
- Department of Biostatistics, Oslo Centre for Biostatistics and Epidemiology (OCBE), Faculty of Medicine, University of Oslo, Oslo N-0372, Norway
| | - Tero Aittokallio
- Department of Cancer Genetics, Institute for Cancer Research, Oslo University Hospital, Oslo N-0310, Norway,Department of Biostatistics, Oslo Centre for Biostatistics and Epidemiology (OCBE), Faculty of Medicine, University of Oslo, Oslo N-0372, Norway,Institute for Molecular Medicine Finland (FIMM), HiLIFE, University of Helsinki, Helsinki FI-00014, Finland
| |
Collapse
|
2
|
Text-based experiment retrieval in genomic databases. J Inf Sci 2022. [DOI: 10.1177/01655515221118670] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
With the growing number of genomic data in public repositories, efficient search methodologies have become a basic need to reach the relevant genomic data. However, this need cannot be fulfilled with the current repositories because they offer a limited search option which is a lexical matching of textual descriptions or metadata of the experiments. This technique is insufficient to get the required information needed to detect similarities between experiments within a large data collection. Due to the limitation of the existing repositories, in this study, we develop a text-based experiment retrieval framework by using both lexical and semantic similarity approaches to find similarities between experiments, and their retrieval performance was compared. This study is the first attempt to use text-driven semantic analysis approaches for developing a retrieval framework for experiments. An empirical study was conducted on a large textual description of Arabidopsis microarray experiments from the Gene Expression Omnibus database. In the proposed model, Jaccard similarity was used as a lexical similarity approach; Latent Semantic Analysis, Probabilistic Latent Semantic Analysis and Latent Dirichlet allocation were used as semantic similarity approaches to detect similarities between the textual descriptions of the experiments. According to the experimental results, relevant experiments can be retrieved successfully by text-driven semantic similarity approaches compared with the lexical similarity approach.
Collapse
|
3
|
Chen S, Andrienko N, Andrienko G, Adilova L, Barlet J, Kindermann J, Nguyen PH, Thonnard O, Turkay C. LDA Ensembles for Interactive Exploration and Categorization of Behaviors. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 2020; 26:2775-2792. [PMID: 30869622 DOI: 10.1109/tvcg.2019.2904069] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
We define behavior as a set of actions performed by some actor during a period of time. We consider the problem of analyzing a large collection of behaviors by multiple actors, more specifically, identifying typical behaviors and spotting anomalous behaviors. We propose an approach leveraging topic modeling techniques - LDA (Latent Dirichlet Allocation) Ensembles - to represent categories of typical behaviors by topics that are obtained through topic modeling a behavior collection. When such methods are applied to text in natural languages, the quality of the extracted topics are usually judged based on the semantic relatedness of the terms pertinent to the topics. This criterion, however, is not necessarily applicable to topics extracted from non-textual data, such as action sets, since relationships between actions may not be obvious. We have developed a suite of visual and interactive techniques supporting the construction of an appropriate combination of topics based on other criteria, such as distinctiveness and coverage of the behavior set. Two case studies on analyzing operation behaviors in the security management system and visiting behaviors in an amusement park, and the expert evaluation of the first case study demonstrate the effectiveness of our approach.
Collapse
|
4
|
Yang G, Ma A, Qin ZS, Chen L. Application of topic models to a compendium of ChIP-Seq datasets uncovers recurrent transcriptional regulatory modules. Bioinformatics 2020; 36:2352-2358. [PMID: 31899481 DOI: 10.1093/bioinformatics/btz975] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2019] [Revised: 10/29/2019] [Accepted: 12/30/2019] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION The availability of thousands of genome-wide coupling chromatin immunoprecipitation (ChIP)-Seq datasets across hundreds of transcription factors (TFs) and cell lines provides an unprecedented opportunity to jointly analyze large-scale TF-binding in vivo, making possible the discovery of the potential interaction and cooperation among different TFs. The interacted and cooperated TFs can potentially form a transcriptional regulatory module (TRM) (e.g. co-binding TFs), which helps decipher the combinatorial regulatory mechanisms. RESULTS We develop a computational method tfLDA to apply state-of-the-art topic models to multiple ChIP-Seq datasets to decipher the combinatorial binding events of multiple TFs. tfLDA is able to learn high-order combinatorial binding patterns of TFs from multiple ChIP-Seq profiles, interpret and visualize the combinatorial patterns. We apply the tfLDA to two cell lines with a rich collection of TFs and identify combinatorial binding patterns that show well-known TRMs and related TF co-binding events. AVAILABILITY AND IMPLEMENTATION A software R package tfLDA is freely available at https://github.com/lichen-lab/tfLDA. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Guodong Yang
- Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, Atlanta, GA 30322, USA.,Department of Cardiovascular Medicine, First Affiliated Hospital of Xi'an Jiaotong University, Xi'an, Shaanxi, 710061, P. R. China
| | - Aiqun Ma
- Department of Cardiovascular Medicine, First Affiliated Hospital of Xi'an Jiaotong University, Xi'an, Shaanxi, 710061, P. R. China
| | - Zhaohui S Qin
- Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, Atlanta, GA 30322, USA
| | - Li Chen
- Department of Medicine, Indiana University School of Medicine, Indianapolis, IN 46202, USA.,Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, IN 46202, USA
| |
Collapse
|
5
|
Lekschas F, Gehlenborg N. SATORI: a system for ontology-guided visual exploration of biomedical data repositories. Bioinformatics 2018; 34:1200-1207. [PMID: 29186292 PMCID: PMC6031061 DOI: 10.1093/bioinformatics/btx739] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2017] [Accepted: 11/22/2017] [Indexed: 01/14/2023] Open
Abstract
Motivation The ever-increasing number of biomedical datasets provides tremendous opportunities for re-use but current data repositories provide limited means of exploration apart from text-based search. Ontological metadata annotations provide context by semantically relating datasets. Visualizing this rich network of relationships can improve the explorability of large data repositories and help researchers find datasets of interest. Results We developed SATORI—an integrative search and visual exploration interface for the exploration of biomedical data repositories. The design is informed by a requirements analysis through a series of semi-structured interviews. We evaluated the implementation of SATORI in a field study on a real-world data collection. SATORI enables researchers to seamlessly search, browse and semantically query data repositories via two visualizations that are highly interconnected with a powerful search interface. Availability and implementation SATORI is an open-source web application, which is freely available at http://satori.refinery-platform.org and integrated into the Refinery Platform. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Fritz Lekschas
- Harvard John A. Paulson School of Engineering and Applied Sciences, Cambridge, MA 02138, USA.,Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, USA
| | - Nils Gehlenborg
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, USA
| |
Collapse
|
6
|
Heinonen M, Milliat F, Benadjaoud MA, François A, Buard V, Tarlet G, d’Alché-Buc F, Guipaud O. Temporal clustering analysis of endothelial cell gene expression following exposure to a conventional radiotherapy dose fraction using Gaussian process clustering. PLoS One 2018; 13:e0204960. [PMID: 30281653 PMCID: PMC6169916 DOI: 10.1371/journal.pone.0204960] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2018] [Accepted: 09/15/2018] [Indexed: 12/31/2022] Open
Abstract
The vascular endothelium is considered as a key cell compartment for the response to ionizing radiation of normal tissues and tumors, and as a promising target to improve the differential effect of radiotherapy in the future. Following radiation exposure, the global endothelial cell response covers a wide range of gene, miRNA, protein and metabolite expression modifications. Changes occur at the transcriptional, translational and post-translational levels and impact cell phenotype as well as the microenvironment by the production and secretion of soluble factors such as reactive oxygen species, chemokines, cytokines and growth factors. These radiation-induced dynamic modifications of molecular networks may control the endothelial cell phenotype and govern recruitment of immune cells, stressing the importance of clearly understanding the mechanisms which underlie these temporal processes. A wide variety of time series data is commonly used in bioinformatics studies, including gene expression, protein concentrations and metabolomics data. The use of clustering of these data is still an unclear problem. Here, we introduce kernels between Gaussian processes modeling time series, and subsequently introduce a spectral clustering algorithm. We apply the methods to the study of human primary endothelial cells (HUVECs) exposed to a radiotherapy dose fraction (2 Gy). Time windows of differential expressions of 301 genes involved in key cellular processes such as angiogenesis, inflammation, apoptosis, immune response and protein kinase were determined from 12 hours to 3 weeks post-irradiation. Then, 43 temporal clusters corresponding to profiles of similar expressions, including 49 genes out of 301 initially measured, were generated according to the proposed method. Forty-seven transcription factors (TFs) responsible for the expression of clusters of genes were predicted from sequence regulatory elements using the MotifMap system. Their temporal profiles of occurrences were established and clustered. Dynamic network interactions and molecular pathways of TFs and differential genes were finally explored, revealing key node genes and putative important cellular processes involved in tissue infiltration by immune cells following exposure to a radiotherapy dose fraction.
Collapse
Affiliation(s)
- Markus Heinonen
- Department of Information and Computer Science, Aalto University, Aalto, Finland
| | - Fabien Milliat
- Institute for Radiological Protection and Nuclear Safety (IRSN), PSE-SANTE, SERAMED, LRMed, Fontenay-aux-Roses, France
| | - Mohamed Amine Benadjaoud
- Institute for Radiological Protection and Nuclear Safety (IRSN), PSE-SANTE, SERAMED, Fontenay-aux-Roses, France
| | - Agnès François
- Institute for Radiological Protection and Nuclear Safety (IRSN), PSE-SANTE, SERAMED, LRMed, Fontenay-aux-Roses, France
| | - Valérie Buard
- Institute for Radiological Protection and Nuclear Safety (IRSN), PSE-SANTE, SERAMED, LRMed, Fontenay-aux-Roses, France
| | - Georges Tarlet
- Institute for Radiological Protection and Nuclear Safety (IRSN), PSE-SANTE, SERAMED, LRMed, Fontenay-aux-Roses, France
| | | | - Olivier Guipaud
- Institute for Radiological Protection and Nuclear Safety (IRSN), PSE-SANTE, SERAMED, LRMed, Fontenay-aux-Roses, France
- * E-mail:
| |
Collapse
|
7
|
Rauber PE, Falcão AX, Telea AC. Projections as visual aids for classification system design. INFORMATION VISUALIZATION 2018; 17:282-305. [PMID: 30263012 PMCID: PMC6131729 DOI: 10.1177/1473871617713337] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Dimensionality reduction is a compelling alternative for high-dimensional data visualization. This method provides insight into high-dimensional feature spaces by mapping relationships between observations (high-dimensional vectors) to low (two or three) dimensional spaces. These low-dimensional representations support tasks such as outlier and group detection based on direct visualization. Supervised learning, a subfield of machine learning, is also concerned with observations. A key task in supervised learning consists of assigning class labels to observations based on generalization from previous experience. Effective development of such classification systems depends on many choices, including features descriptors, learning algorithms, and hyperparameters. These choices are not trivial, and there is no simple recipe to improve classification systems that perform poorly. In this context, we first propose the use of visual representations based on dimensionality reduction (projections) for predictive feedback on classification efficacy. Second, we propose a projection-based visual analytics methodology, and supportive tooling, that can be used to improve classification systems through feature selection. We evaluate our proposal through experiments involving four datasets and three representative learning algorithms.
Collapse
Affiliation(s)
- Paulo E Rauber
- Department of Mathematics and Computing
Science, University of Groningen, Groningen, The Netherlands
- University of Campinas, Campinas,
Brazil
| | | | - Alexandru C Telea
- Department of Mathematics and Computing
Science, University of Groningen, Groningen, The Netherlands
| |
Collapse
|
8
|
Kohonen P, Parkkinen JA, Willighagen EL, Ceder R, Wennerberg K, Kaski S, Grafström RC. A transcriptomics data-driven gene space accurately predicts liver cytopathology and drug-induced liver injury. Nat Commun 2017; 8:15932. [PMID: 28671182 PMCID: PMC5500850 DOI: 10.1038/ncomms15932] [Citation(s) in RCA: 71] [Impact Index Per Article: 10.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2016] [Accepted: 05/15/2017] [Indexed: 01/17/2023] Open
Abstract
Predicting unanticipated harmful effects of chemicals and drug molecules is a difficult and costly task. Here we utilize a 'big data compacting and data fusion'-concept to capture diverse adverse outcomes on cellular and organismal levels. The approach generates from transcriptomics data set a 'predictive toxicogenomics space' (PTGS) tool composed of 1,331 genes distributed over 14 overlapping cytotoxicity-related gene space components. Involving ∼2.5 × 108 data points and 1,300 compounds to construct and validate the PTGS, the tool serves to: explain dose-dependent cytotoxicity effects, provide a virtual cytotoxicity probability estimate intrinsic to omics data, predict chemically-induced pathological states in liver resulting from repeated dosing of rats, and furthermore, predict human drug-induced liver injury (DILI) from hepatocyte experiments. Analysing 68 DILI-annotated drugs, the PTGS tool outperforms and complements existing tests, leading to a hereto-unseen level of DILI prediction accuracy.
Collapse
Affiliation(s)
- Pekka Kohonen
- Institute of Environmental Medicine, Karolinska Institutet, Nobels väg 13, Box 210, SE-17177 Stockholm, Sweden
| | - Juuso A Parkkinen
- Helsinki Institute for Information Technology HIIT, Department of Computer Science, Aalto University, Konemiehentie 2, P.O. Box 15400, 00076 Aalto, Finland
| | - Egon L Willighagen
- Institute of Environmental Medicine, Karolinska Institutet, Nobels väg 13, Box 210, SE-17177 Stockholm, Sweden.,Department of Bioinformatics-BiGCaT, Maastricht University, Universiteitssingel 50, P.O. Box 616, UNS 50 Box19, NL-6200 MD Maastricht, The Netherlands
| | - Rebecca Ceder
- Institute of Environmental Medicine, Karolinska Institutet, Nobels väg 13, Box 210, SE-17177 Stockholm, Sweden
| | - Krister Wennerberg
- Institute for Molecular Medicine Finland, FIMM, University of Helsinki, Tukholmankatu 8, P.O. Box 20, FI-00014 Helsinki, Finland
| | - Samuel Kaski
- Helsinki Institute for Information Technology HIIT, Department of Computer Science, Aalto University, Konemiehentie 2, P.O. Box 15400, 00076 Aalto, Finland.,Helsinki Institute for Information Technology HIIT, Department of Computer Science, University of Helsinki, Gustaf Hällströmin katu 2b, P.O. Box 68, FI-00014 Helsinki, Finland
| | - Roland C Grafström
- Institute of Environmental Medicine, Karolinska Institutet, Nobels väg 13, Box 210, SE-17177 Stockholm, Sweden
| |
Collapse
|
9
|
Abstract
Background Deciphering taxonomical structures based on high dimensional sequencing data is still challenging in metagenomics study. Moreover, the common workflow processed in this field fails to identify microbial communities and their effect on a specific disease status. Even the relationships and interactions between different bacteria in a microbial community keep unknown. Results MetaTopics can efficiently extract the latent microbial communities which reflect the intrinsic relations or interactions among several major microbes. Furthermore, a quantitative measurement, Quetelet Index, is defined to estimate the influence of a latent sub-community on a certain disease status for given samples. An analysis of our in-house oral metagenomics data and public gut microbe data was presented to demonstrate the application and usefulness of MetaTopics. To preset a user-friendly R package, we have built a dedicated website, https://github.com/bm2-lab/MetaTopics, which includes free downloads, detailed tutorials and illustration examples. Conclusions MetaTopics is the first interactive R package to integrate the state-of-arts topic model derived from statistical learning community to analyze and visualize the metagenomics taxonomy data. Electronic supplementary material The online version of this article (doi:10.1186/s12864-016-3257-2) contains supplementary material, which is available to authorized users.
Collapse
|
10
|
Söderholm S, Fu Y, Gaelings L, Belanov S, Yetukuri L, Berlinkov M, Cheltsov AV, Anders S, Aittokallio T, Nyman TA, Matikainen S, Kainov DE. Multi-Omics Studies towards Novel Modulators of Influenza A Virus-Host Interaction. Viruses 2016; 8:v8100269. [PMID: 27690086 PMCID: PMC5086605 DOI: 10.3390/v8100269] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2016] [Revised: 09/13/2016] [Accepted: 09/22/2016] [Indexed: 12/20/2022] Open
Abstract
Human influenza A viruses (IAVs) cause global pandemics and epidemics. These viruses evolve rapidly, making current treatment options ineffective. To identify novel modulators of IAV–host interactions, we re-analyzed our recent transcriptomics, metabolomics, proteomics, phosphoproteomics, and genomics/virtual ligand screening data. We identified 713 potential modulators targeting 199 cellular and two viral proteins. Anti-influenza activity for 48 of them has been reported previously, whereas the antiviral efficacy of the 665 remains unknown. Studying anti-influenza efficacy and immuno/neuro-modulating properties of these compounds and their combinations as well as potential viral and host resistance to them may lead to the discovery of novel modulators of IAV–host interactions, which might be more effective than the currently available anti-influenza therapeutics.
Collapse
Affiliation(s)
- Sandra Söderholm
- Institute of Biotechnology, University of Helsinki, Helsinki 00014, Finland.
- Finnish Institute of Occupational Health, Helsinki 00250, Finland.
| | - Yu Fu
- Institute for Molecular Medicine Finland (FIMM), University of Helsinki, Helsinki 00014, Finland.
| | - Lana Gaelings
- Institute for Molecular Medicine Finland (FIMM), University of Helsinki, Helsinki 00014, Finland.
| | - Sergey Belanov
- Institute for Molecular Medicine Finland (FIMM), University of Helsinki, Helsinki 00014, Finland.
| | - Laxman Yetukuri
- Institute for Molecular Medicine Finland (FIMM), University of Helsinki, Helsinki 00014, Finland.
| | - Mikhail Berlinkov
- Institute of Mathematics and Computer Science, Ural Federal University, Yekaterinburg 620083, Russia.
| | - Anton V Cheltsov
- Q-Mol L.L.C. in Silico Pharmaceuticals, San Diego, CA 92037, USA.
| | - Simon Anders
- Institute for Molecular Medicine Finland (FIMM), University of Helsinki, Helsinki 00014, Finland.
| | - Tero Aittokallio
- Institute for Molecular Medicine Finland (FIMM), University of Helsinki, Helsinki 00014, Finland.
- Department of Mathematics and Statistics, University of Turku, Turku 20014, Finland.
| | | | - Sampsa Matikainen
- Finnish Institute of Occupational Health, Helsinki 00250, Finland.
- Department of Rheumatology, Helsinki University Hospital, University of Helsinki, Helsinki 00015, Finland.
| | - Denis E Kainov
- Institute for Molecular Medicine Finland (FIMM), University of Helsinki, Helsinki 00014, Finland.
| |
Collapse
|
11
|
Liu L, Tang L, Dong W, Yao S, Zhou W. An overview of topic modeling and its current applications in bioinformatics. SPRINGERPLUS 2016; 5:1608. [PMID: 27652181 PMCID: PMC5028368 DOI: 10.1186/s40064-016-3252-8] [Citation(s) in RCA: 94] [Impact Index Per Article: 11.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/24/2016] [Accepted: 09/08/2016] [Indexed: 11/10/2022]
Abstract
BACKGROUND With the rapid accumulation of biological datasets, machine learning methods designed to automate data analysis are urgently needed. In recent years, so-called topic models that originated from the field of natural language processing have been receiving much attention in bioinformatics because of their interpretability. Our aim was to review the application and development of topic models for bioinformatics. DESCRIPTION This paper starts with the description of a topic model, with a focus on the understanding of topic modeling. A general outline is provided on how to build an application in a topic model and how to develop a topic model. Meanwhile, the literature on application of topic models to biological data was searched and analyzed in depth. According to the types of models and the analogy between the concept of document-topic-word and a biological object (as well as the tasks of a topic model), we categorized the related studies and provided an outlook on the use of topic models for the development of bioinformatics applications. CONCLUSION Topic modeling is a useful method (in contrast to the traditional means of data reduction in bioinformatics) and enhances researchers' ability to interpret biological information. Nevertheless, due to the lack of topic models optimized for specific biological data, the studies on topic modeling in biological data still have a long and challenging road ahead. We believe that topic models are a promising method for various applications in bioinformatics research.
Collapse
Affiliation(s)
- Lin Liu
- School of Information, Yunnan University, Kunming, 650091 Yunnan China
- School of Information (Key Laboratory of Educational Informatization for Nationalities Ministry of Education), Yunnan Normal University, Kunming, 650092 Yunnan China
| | - Lin Tang
- Key Laboratory of Educational Informatization for Nationalities Ministry of Education, Yunnan Normal University, Kunming, 650092 Yunnan China
| | - Wen Dong
- School of Information, Yunnan University, Kunming, 650091 Yunnan China
| | - Shaowen Yao
- National Pilot School of Software, Yunnan University, Kunming, 650091 Yunnan China
| | - Wei Zhou
- National Pilot School of Software, Yunnan University, Kunming, 650091 Yunnan China
| |
Collapse
|
12
|
González J, Muñoz A, Martos G. Asymmetric latent semantic indexing for gene expression experiments visualization. J Bioinform Comput Biol 2016; 14:1650023. [PMID: 27427382 DOI: 10.1142/s0219720016500232] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
We propose a new method to visualize gene expression experiments inspired by the latent semantic indexing technique originally proposed in the textual analysis context. By using the correspondence word-gene document-experiment, we define an asymmetric similarity measure of association for genes that accounts for potential hierarchies in the data, the key to obtain meaningful gene mappings. We use the polar decomposition to obtain the sources of asymmetry of the similarity matrix, which are later combined with previous knowledge. Genetic classes of genes are identified by means of a mixture model applied in the genes latent space. We describe the steps of the procedure and we show its utility in the Human Cancer dataset.
Collapse
Affiliation(s)
- Javier González
- * Department of Computer Science, Sheffield Institute for Translational Neuroscience, University of Sheffield, Glossop Road S10 2HQ, Sheffield, UK
| | - Alberto Muñoz
- † Department of Statistics, University Carlos III of Madrid, Spain. C/Madrid, 126-28903, Getafe (Madrid), Spain
| | - Gabriel Martos
- † Department of Statistics, University Carlos III of Madrid, Spain. C/Madrid, 126-28903, Getafe (Madrid), Spain
| |
Collapse
|
13
|
Şener DD, Oğul H. Retrieving relevant time-course experiments: a study on Arabidopsis microarrays. IET Syst Biol 2016; 10:87-93. [PMID: 27187987 DOI: 10.1049/iet-syb.2015.0042] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
Abstract
Understanding time-course regulation of genes in response to a stimulus is a major concern in current systems biology. The problem is usually approached by computational methods to model the gene behaviour or its networked interactions with the others by a set of latent parameters. The model parameters can be estimated through a meta-analysis of available data obtained from other relevant experiments. The key question here is how to find the relevant experiments which are potentially useful in analysing current data. In this study, the authors address this problem in the context of time-course gene expression experiments from an information retrieval perspective. To this end, they introduce a computational framework that takes a time-course experiment as a query and reports a list of relevant experiments retrieved from a given repository. These retrieved experiments can then be used to associate the environmental factors of query experiment with the findings previously reported. The model is tested using a set of time-course Arabidopsis microarrays. The experimental results show that relevant experiments can be successfully retrieved based on content similarity.
Collapse
Affiliation(s)
- Duygu Dede Şener
- Department of Computer Engineering, Başkent University, Baglica Campus TR-06810, Ankara, Turkey.
| | - Hasan Oğul
- Department of Computer Engineering, Başkent University, Baglica Campus TR-06810, Ankara, Turkey
| |
Collapse
|
14
|
Blomstedt P, Dutta R, Seth S, Brazma A, Kaski S. Modelling-based experiment retrieval: a case study with gene expression clustering. Bioinformatics 2016; 32:1388-94. [PMID: 26740526 DOI: 10.1093/bioinformatics/btv762] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2015] [Accepted: 12/28/2015] [Indexed: 12/18/2022] Open
Abstract
MOTIVATION Public and private repositories of experimental data are growing to sizes that require dedicated methods for finding relevant data. To improve on the state of the art of keyword searches from annotations, methods for content-based retrieval have been proposed. In the context of gene expression experiments, most methods retrieve gene expression profiles, requiring each experiment to be expressed as a single profile, typically of case versus control. A more general, recently suggested alternative is to retrieve experiments whose models are good for modelling the query dataset. However, for very noisy and high-dimensional query data, this retrieval criterion turns out to be very noisy as well. RESULTS We propose doing retrieval using a denoised model of the query dataset, instead of the original noisy dataset itself. To this end, we introduce a general probabilistic framework, where each experiment is modelled separately and the retrieval is done by finding related models. For retrieval of gene expression experiments, we use a probabilistic model called product partition model, which induces a clustering of genes that show similar expression patterns across a number of samples. The suggested metric for retrieval using clusterings is the normalized information distance. Empirical results finally suggest that inference for the full probabilistic model can be approximated with good performance using computationally faster heuristic clustering approaches (e.g. k-means). The method is highly scalable and straightforward to apply to construct a general-purpose gene expression experiment retrieval method. AVAILABILITY AND IMPLEMENTATION The method can be implemented using standard clustering algorithms and normalized information distance, available in many statistical software packages. CONTACT paul.blomstedt@aalto.fi or samuel.kaski@aalto.fi SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Paul Blomstedt
- Helsinki Institute for Information Technology HIIT, Department of Computer Science, Aalto University, Espoo, Finland and
| | - Ritabrata Dutta
- Helsinki Institute for Information Technology HIIT, Department of Computer Science, Aalto University, Espoo, Finland and
| | - Sohan Seth
- Helsinki Institute for Information Technology HIIT, Department of Computer Science, Aalto University, Espoo, Finland and
| | - Alvis Brazma
- European Molecular Biology Laboratory, European Bioinformatics Institute, EMBL-EBI, Wellcome Trust Genome Campus, Hinxton, UK
| | - Samuel Kaski
- Helsinki Institute for Information Technology HIIT, Department of Computer Science, Aalto University, Espoo, Finland and
| |
Collapse
|
15
|
miSEA: microRNA set enrichment analysis. Biosystems 2015; 134:37-42. [DOI: 10.1016/j.biosystems.2015.05.004] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2015] [Revised: 05/11/2015] [Accepted: 05/12/2015] [Indexed: 11/20/2022]
|
16
|
Açıcı K, Terzi YK, Oğul H. Retrieving relevant experiments: The case of microRNA microarrays. Biosystems 2015; 134:71-8. [PMID: 26116091 DOI: 10.1016/j.biosystems.2015.06.003] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2015] [Revised: 06/15/2015] [Accepted: 06/17/2015] [Indexed: 01/06/2023]
Abstract
Content-based retrieval of biological experiments in large public repositories is a recent challenge in computational biology and bioinformatics. The task is, in general, to search in a database using a query-by-example without any experimental meta-data annotation. Here, we consider a more specific problem that seeks a solution for retrieving relevant microRNA experiments from microarray repositories. A computational framework is proposed with this objective. The framework adapts a normal-uniform mixture model for identifying differentially expressed microRNAs in microarray profiling experiments. A rank-based thresholding scheme is offered to binarize real-valued experiment fingerprints based on differential expression. An effective similarity metric is introduced to compare categorical fingerprints, which in turn infers the relevance between two experiments. Two different views of experimental relevance are evaluated, one for disease association and another for embryonic germ layer, to discern the retrieval ability of the proposed model. To the best of our knowledge, the experiment retrieval task is investigated for the first time in the context of microRNA microarrays.
Collapse
Affiliation(s)
- Koray Açıcı
- Department of Computer Engineering, Başkent University, Ankara, Turkey
| | | | - Hasan Oğul
- Department of Computer Engineering, Başkent University, Ankara, Turkey.
| |
Collapse
|
17
|
Uziela K, Honkela A. Probe Region Expression Estimation for RNA-Seq Data for Improved Microarray Comparability. PLoS One 2015; 10:e0126545. [PMID: 25966034 PMCID: PMC4429080 DOI: 10.1371/journal.pone.0126545] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2014] [Accepted: 04/03/2015] [Indexed: 01/25/2023] Open
Abstract
Rapidly growing public gene expression databases contain a wealth of data for building an unprecedentedly detailed picture of human biology and disease. This data comes from many diverse measurement platforms that make integrating it all difficult. Although RNA-sequencing (RNA-seq) is attracting the most attention, at present, the rate of new microarray studies submitted to public databases far exceeds the rate of new RNA-seq studies. There is clearly a need for methods that make it easier to combine data from different technologies. In this paper, we propose a new method for processing RNA-seq data that yields gene expression estimates that are much more similar to corresponding estimates from microarray data, hence greatly improving cross-platform comparability. The method we call PREBS is based on estimating the expression from RNA-seq reads overlapping the microarray probe regions, and processing these estimates with standard microarray summarisation algorithms. Using paired microarray and RNA-seq samples from TCGA LAML data set we show that PREBS expression estimates derived from RNA-seq are more similar to microarray-based expression estimates than those from other RNA-seq processing methods. In an experiment to retrieve paired microarray samples from a database using an RNA-seq query sample, gene signatures defined based on PREBS expression estimates were found to be much more accurate than those from other methods. PREBS also allows new ways of using RNA-seq data, such as expression estimation for microarray probe sets. An implementation of the proposed method is available in the Bioconductor package “prebs.”
Collapse
Affiliation(s)
- Karolis Uziela
- Helsinki Institute for Information Technology HIIT, Department of Computer Science, University of Helsinki, Helsinki, Finland
- Department of Biochemistry and Biophysics, Science for Life Laboratory, Stockholm University, 17121 Solna, Sweden
| | - Antti Honkela
- Helsinki Institute for Information Technology HIIT, Department of Computer Science, University of Helsinki, Helsinki, Finland
- * E-mail:
| |
Collapse
|
18
|
Faisal A, Peltonen J, Georgii E, Rung J, Kaski S. Toward computational cumulative biology by combining models of biological datasets. PLoS One 2014; 9:e113053. [PMID: 25427176 PMCID: PMC4245117 DOI: 10.1371/journal.pone.0113053] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2014] [Accepted: 10/17/2014] [Indexed: 11/21/2022] Open
Abstract
A main challenge of data-driven sciences is how to make maximal use of the progressively expanding databases of experimental datasets in order to keep research cumulative. We introduce the idea of a modeling-based dataset retrieval engine designed for relating a researcher's experimental dataset to earlier work in the field. The search is (i) data-driven to enable new findings, going beyond the state of the art of keyword searches in annotations, (ii) modeling-driven, to include both biological knowledge and insights learned from data, and (iii) scalable, as it is accomplished without building one unified grand model of all data. Assuming each dataset has been modeled beforehand, by the researchers or automatically by database managers, we apply a rapidly computable and optimizable combination model to decompose a new dataset into contributions from earlier relevant models. By using the data-driven decomposition, we identify a network of interrelated datasets from a large annotated human gene expression atlas. While tissue type and disease were major driving forces for determining relevant datasets, the found relationships were richer, and the model-based search was more accurate than the keyword search; moreover, it recovered biologically meaningful relationships that are not straightforwardly visible from annotations—for instance, between cells in different developmental stages such as thymocytes and T-cells. Data-driven links and citations matched to a large extent; the data-driven links even uncovered corrections to the publication data, as two of the most linked datasets were not highly cited and turned out to have wrong publication entries in the database.
Collapse
Affiliation(s)
- Ali Faisal
- Helsinki Institute for Information Technology HIIT, Department of Information and Computer Science, Aalto University, Espoo, Finland
| | - Jaakko Peltonen
- Helsinki Institute for Information Technology HIIT, Department of Information and Computer Science, Aalto University, Espoo, Finland
| | - Elisabeth Georgii
- Helsinki Institute for Information Technology HIIT, Department of Information and Computer Science, Aalto University, Espoo, Finland
| | - Johan Rung
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, United Kingdom
| | - Samuel Kaski
- Helsinki Institute for Information Technology HIIT, Department of Information and Computer Science, Aalto University, Espoo, Finland
- Helsinki Institute for Information Technology HIIT, Department of Computer Science, University of Helsinki, Helsinki, Finland
- * E-mail:
| |
Collapse
|
19
|
Jalanka-Tuovinen J, Salojärvi J, Salonen A, Immonen O, Garsed K, Kelly FM, Zaitoun A, Palva A, Spiller RC, de Vos WM. Faecal microbiota composition and host-microbe cross-talk following gastroenteritis and in postinfectious irritable bowel syndrome. Gut 2014; 63:1737-45. [PMID: 24310267 DOI: 10.1136/gutjnl-2013-305994] [Citation(s) in RCA: 233] [Impact Index Per Article: 23.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Abstract
BACKGROUND About 10% of patients with IBS report the start of the syndrome after infectious enteritis. The clinical features of postinfectious IBS (PI-IBS) resemble those of diarrhoea-predominant IBS (IBS-D). While altered faecal microbiota has been identified in other IBS subtypes, composition of the microbiota in patients with PI-IBS remains uncharacterised. OBJECTIVE To characterise the microbial composition of patients with PI-IBS, and to examine the associations between the faecal microbiota and a patient's clinical features. DESIGN Using a phylogenetic microarray and selected qPCR assays, we analysed differences in the faecal microbiota of 57 subjects from five study groups: patients with diagnosed PI-IBS, patients who 6 months after gastroenteritis had either persisting bowel dysfunction or no IBS symptoms, benchmarked against patients with IBS-D and healthy controls. In addition, the associations between the faecal microbiota and health were investigated by correlating the microbial profiles to immunological markers, quality of life indicators and host gene expression in rectal biopsies. RESULTS Microbiota analysis revealed a bacterial profile of 27 genus-like groups, providing an Index of Microbial Dysbiosis (IMD), which significantly separated patient groups and controls. Within this profile, several members of Bacteroidetes phylum were increased 12-fold in patients, while healthy controls had 35-fold more uncultured Clostridia. We showed correlations between the IMD and expression of several host gene pathways, including amino acid synthesis, cell junction integrity and inflammatory response, suggesting an impaired epithelial barrier function in IBS. CONCLUSIONS The faecal microbiota of patients with PI-IBS differs from that of healthy controls and resembles that of patients with IBS-D, suggesting a common pathophysiology. Moreover, our analysis suggests a variety of host-microbe associations that may underlie intestinal symptoms, initiated by gastroenteritis.
Collapse
Affiliation(s)
- Jonna Jalanka-Tuovinen
- Department of Veterinary Biosciences, Microbiology, University of Helsinki, Helsinki, Finland
| | - Jarkko Salojärvi
- Department of Veterinary Biosciences, Microbiology, University of Helsinki, Helsinki, Finland
| | - Anne Salonen
- Department of Veterinary Biosciences, Microbiology, University of Helsinki, Helsinki, Finland
| | - Outi Immonen
- Department of Bacteriology and Immunology, University of Helsinki, Helsinki, Finland
| | - Klara Garsed
- Department of Veterinary Biosciences, Microbiology, University of Helsinki, Helsinki, Finland
| | - Fiona M Kelly
- NIHR Biomedical Research Unit, Nottingham Digestive Diseases Centre, University Hospital, Nottingham, UK
| | - Abed Zaitoun
- GSK Research and Development Ltd, GlaxoSmithKline, Stevenage, UK
| | - Airi Palva
- NIHR Biomedical Research Unit, Nottingham Digestive Diseases Centre, University Hospital, Nottingham, UK
| | - Robin C Spiller
- Department of Veterinary Biosciences, Microbiology, University of Helsinki, Helsinki, Finland
| | - Willem M de Vos
- Department of Veterinary Biosciences, Microbiology, University of Helsinki, Helsinki, Finland Department of Bacteriology and Immunology, University of Helsinki, Helsinki, Finland Laboratory of Microbiology, Wageningen University, Wageningen, The Netherlands
| |
Collapse
|
20
|
|
21
|
Seth S, Välimäki N, Kaski S, Honkela A. Exploration and retrieval of whole-metagenome sequencing samples. Bioinformatics 2014; 30:2471-9. [PMID: 24845653 PMCID: PMC4230234 DOI: 10.1093/bioinformatics/btu340] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Over the recent years, the field of whole-metagenome shotgun sequencing has witnessed significant growth owing to the high-throughput sequencing technologies that allow sequencing genomic samples cheaper, faster and with better coverage than before. This technical advancement has initiated the trend of sequencing multiple samples in different conditions or environments to explore the similarities and dissimilarities of the microbial communities. Examples include the human microbiome project and various studies of the human intestinal tract. With the availability of ever larger databases of such measurements, finding samples similar to a given query sample is becoming a central operation. RESULTS In this article, we develop a content-based exploration and retrieval method for whole-metagenome sequencing samples. We apply a distributed string mining framework to efficiently extract all informative sequence k-mers from a pool of metagenomic samples and use them to measure the dissimilarity between two samples. We evaluate the performance of the proposed approach on two human gut metagenome datasets as well as human microbiome project metagenomic samples. We observe significant enrichment for diseased gut samples in results of queries with another diseased sample and high accuracy in discriminating between different body sites even though the method is unsupervised. AVAILABILITY AND IMPLEMENTATION A software implementation of the DSM framework is available at https://github.com/HIITMetagenomics/dsm-framework. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Sohan Seth
- Helsinki Institute for Information Technology HIIT, Department of Information and Computer Science, Aalto University, Espoo, Finland, Genome-Scale Biology Program and Department of Medical Genetics, University of Helsinki, Helsinki, Finland, and Helsinki Institute for Information Technology HIIT, Department of Computer Science, University of Helsinki, Helsinki, Finland
| | - Niko Välimäki
- Helsinki Institute for Information Technology HIIT, Department of Information and Computer Science, Aalto University, Espoo, Finland, Genome-Scale Biology Program and Department of Medical Genetics, University of Helsinki, Helsinki, Finland, and Helsinki Institute for Information Technology HIIT, Department of Computer Science, University of Helsinki, Helsinki, Finland Helsinki Institute for Information Technology HIIT, Department of Information and Computer Science, Aalto University, Espoo, Finland, Genome-Scale Biology Program and Department of Medical Genetics, University of Helsinki, Helsinki, Finland, and Helsinki Institute for Information Technology HIIT, Department of Computer Science, University of Helsinki, Helsinki, Finland
| | - Samuel Kaski
- Helsinki Institute for Information Technology HIIT, Department of Information and Computer Science, Aalto University, Espoo, Finland, Genome-Scale Biology Program and Department of Medical Genetics, University of Helsinki, Helsinki, Finland, and Helsinki Institute for Information Technology HIIT, Department of Computer Science, University of Helsinki, Helsinki, Finland Helsinki Institute for Information Technology HIIT, Department of Information and Computer Science, Aalto University, Espoo, Finland, Genome-Scale Biology Program and Department of Medical Genetics, University of Helsinki, Helsinki, Finland, and Helsinki Institute for Information Technology HIIT, Department of Computer Science, University of Helsinki, Helsinki, Finland
| | - Antti Honkela
- Helsinki Institute for Information Technology HIIT, Department of Information and Computer Science, Aalto University, Espoo, Finland, Genome-Scale Biology Program and Department of Medical Genetics, University of Helsinki, Helsinki, Finland, and Helsinki Institute for Information Technology HIIT, Department of Computer Science, University of Helsinki, Helsinki, Finland
| |
Collapse
|
22
|
Wang V, Xi L, Enayetallah A, Fauman E, Ziemek D. GeneTopics--interpretation of gene sets via literature-driven topic models. BMC SYSTEMS BIOLOGY 2013; 7 Suppl 5:S10. [PMID: 24564875 PMCID: PMC4029197 DOI: 10.1186/1752-0509-7-s5-s10] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Abstract
Background Annotation of a set of genes is often accomplished through comparison to a library of labelled gene sets such as biological processes or canonical pathways. However, this approach might fail if the employed libraries are not up to date with the latest research, don't capture relevant biological themes or are curated at a different level of granularity than is required to appropriately analyze the input gene set. At the same time, the vast biomedical literature offers an unstructured repository of the latest research findings that can be tapped to provide thematic sub-groupings for any input gene set. Methods Our proposed method relies on a gene-specific text corpus and extracts commonalities between documents in an unsupervised manner using a topic model approach. We automatically determine the number of topics summarizing the corpus and calculate a gene relevancy score for each topic allowing us to eliminate non-specific topics. As a result we obtain a set of literature topics in which each topic is associated with a subset of the input genes providing directly interpretable keywords and corresponding documents for literature research. Results We validate our method based on labelled gene sets from the KEGG metabolic pathway collection and the genetic association database (GAD) and show that the approach is able to detect topics consistent with the labelled annotation. Furthermore, we discuss the results on three different types of experimentally derived gene sets, (1) differentially expressed genes from a cardiac hypertrophy experiment in mice, (2) altered transcript abundance in human pancreatic beta cells, and (3) genes implicated by GWA studies to be associated with metabolite levels in a healthy population. In all three cases, we are able to replicate findings from the original papers in a quick and semi-automated manner. Conclusions Our approach provides a novel way of automatically generating meaningful annotations for gene sets that are directly tied to relevant articles in the literature. Extending a general topic model method, the approach introduced here establishes a workflow for the interpretation of gene sets generated from diverse experimental scenarios that can complement the classical approach of comparison to reference gene sets.
Collapse
|
23
|
Georgii E, Salojärvi J, Brosché M, Kangasjärvi J, Kaski S. Targeted retrieval of gene expression measurements using regulatory models. Bioinformatics 2012; 28:2349-56. [DOI: 10.1093/bioinformatics/bts361] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/21/2023] Open
|
24
|
Khan SA, Faisal A, Mpindi JP, Parkkinen JA, Kalliokoski T, Poso A, Kallioniemi OP, Wennerberg K, Kaski S. Comprehensive data-driven analysis of the impact of chemoinformatic structure on the genome-wide biological response profiles of cancer cells to 1159 drugs. BMC Bioinformatics 2012; 13:112. [PMID: 22646858 PMCID: PMC3532323 DOI: 10.1186/1471-2105-13-112] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2011] [Accepted: 04/09/2012] [Indexed: 11/16/2022] Open
Abstract
Background Detailed and systematic understanding of the biological effects of millions of available compounds on living cells is a significant challenge. As most compounds impact multiple targets and pathways, traditional methods for analyzing structure-function relationships are not comprehensive enough. Therefore more advanced integrative models are needed for predicting biological effects elicited by specific chemical features. As a step towards creating such computational links we developed a data-driven chemical systems biology approach to comprehensively study the relationship of 76 structural 3D-descriptors (VolSurf, chemical space) of 1159 drugs with the microarray gene expression responses (biological space) they elicited in three cancer cell lines. The analysis covering 11350 genes was based on data from the Connectivity Map. We decomposed the biological response profiles into components, each linked to a characteristic chemical descriptor profile. Results Integrated analysis of both the chemical and biological space was more informative than either dataset alone in predicting drug similarity as measured by shared protein targets. We identified ten major components that link distinct VolSurf chemical features across multiple compounds to specific cellular responses. For example, component 2 (hydrophobic properties) strongly linked to DNA damage response, while component 3 (hydrogen bonding) was associated with metabolic stress. Individual structural and biological features were often linked to one cell line only, such as leukemia cells (HL-60) specifically responding to cardiac glycosides. Conclusions In summary, our approach identified several novel links between specific chemical structure properties and distinct biological responses in cells incubated with these drugs. Importantly, the analysis focused on chemical-biological properties that emerge across multiple drugs. The decoding of such systematic relationships is necessary to build better models of drug effects, including unanticipated types of molecular properties having strong biological effects.
Collapse
Affiliation(s)
- Suleiman A Khan
- Helsinki Institute for Information Technology HIIT, Department of Information and Computer Science, Aalto University, PO Box 15400, Espoo, 00076, Finland.
| | | | | | | | | | | | | | | | | |
Collapse
|
25
|
Corander J, Aittokallio T, Ripatti S, Kaski S. The rocky road to personalized medicine: computational and statistical challenges. Per Med 2012; 9:109-114. [DOI: 10.2217/pme.12.1] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/04/2023]
Affiliation(s)
- Jukka Corander
- Department of Mathematics & Statistics, University of Helsinki, PO Box 68, 00014 Helsinki, Finland and Department of Mathematics, Åbo Akademi University, 20500 Åbo, Finland
| | - Tero Aittokallio
- Institute for Molecular Medicine Finland (FIMM), University of Helsinki, 00014 Helsinki, Finland and Department of Mathematics, University of Turku, 20014 Turku, Finland
| | - Samuli Ripatti
- Institute for Molecular Medicine Finland (FIMM), University of Helsinki, 00014 Helsinki, Finland and Public Health Genomics Unit, National Institute for Health & Welfare, Helsinki, Finland and Wellcome Trust Sanger Institute, Hinxton, UK
| | - Samuel Kaski
- Helsinki Institute for Information Technology, Aalto University, 00076 Aalto, Finland and Helsinki Institute for Information Technology, University of Helsinki, Finland
| |
Collapse
|
26
|
Caldas J, Gehlenborg N, Kettunen E, Faisal A, Rönty M, Nicholson AG, Knuutila S, Brazma A, Kaski S. Data-driven information retrieval in heterogeneous collections of transcriptomics data links SIM2s to malignant pleural mesothelioma. ACTA ACUST UNITED AC 2011; 28:246-53. [PMID: 22106335 PMCID: PMC3259436 DOI: 10.1093/bioinformatics/btr634] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/21/2023]
Abstract
Motivation: Genome-wide measurement of transcript levels is an ubiquitous tool in biomedical research. As experimental data continues to be deposited in public databases, it is becoming important to develop search engines that enable the retrieval of relevant studies given a query study. While retrieval systems based on meta-data already exist, data-driven approaches that retrieve studies based on similarities in the expression data itself have a greater potential of uncovering novel biological insights. Results: We propose an information retrieval method based on differential expression. Our method deals with arbitrary experimental designs and performs competitively with alternative approaches, while making the search results interpretable in terms of differential expression patterns. We show that our model yields meaningful connections between biological conditions from different studies. Finally, we validate a previously unknown connection between malignant pleural mesothelioma and SIM2s suggested by our method, via real-time polymerase chain reaction in an independent set of mesothelioma samples. Availability:Supplementary data and source code are available from http://www.ebi.ac.uk/fg/research/rex. Contact:samuel.kaski@aalto.fi Supplementary Information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- José Caldas
- Helsinki Institute for Information Technology HIIT, Department of Information and Computer Science, Aalto University, Helsinki, Finland
| | | | | | | | | | | | | | | | | |
Collapse
|
27
|
Caldas J, Kaski S. Hierarchical generative biclustering for microRNA expression analysis. J Comput Biol 2011; 18:251-61. [PMID: 21385032 DOI: 10.1089/cmb.2010.0256] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Clustering methods are a useful and common first step in gene expression studies, but the results may be hard to interpret. We bring in explicitly an indicator of which genes tie each cluster, changing the setup to biclustering. Furthermore, we make the indicators hierarchical, resulting in a hierarchy of progressively more specific biclusters. A non-parametric Bayesian formulation makes the model rigorous yet flexible and computations feasible. The model can additionally be used in information retrieval for relating relevant samples. We show that the model outperforms four other biclustering procedures on a large miRNA data set. We also demonstrate the model's added interpretability and information retrieval capability in a case study. Software is publicly available at http://research.ics.tkk.fi/mi/software/treebic/.
Collapse
Affiliation(s)
- José Caldas
- Aalto University School of Science and Technology, Department of Information and Computer Science, Helsinki Institute for Information Technology, Aalto, Finland
| | | |
Collapse
|
28
|
Kilpinen SK, Ojala KA, Kallioniemi OP. Alignment of gene expression profiles from test samples against a reference database: New method for context-specific interpretation of microarray data. BioData Min 2011; 4:5. [PMID: 21453538 PMCID: PMC3080808 DOI: 10.1186/1756-0381-4-5] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2010] [Accepted: 03/31/2011] [Indexed: 02/07/2023] Open
Abstract
Background Gene expression microarray data have been organized and made available as public databases, but the utilization of such highly heterogeneous reference datasets in the interpretation of data from individual test samples is not as developed as e.g. in the field of nucleotide sequence comparisons. We have created a rapid and powerful approach for the alignment of microarray gene expression profiles (AGEP) from test samples with those contained in a large annotated public reference database and demonstrate here how this can facilitate interpretation of microarray data from individual samples. Methods AGEP is based on the calculation of kernel density distributions for the levels of expression of each gene in each reference tissue type and provides a quantitation of the similarity between the test sample and the reference tissue types as well as the identity of the typical and atypical genes in each comparison. As a reference database, we used 1654 samples from 44 normal tissues (extracted from the Genesapiens database). Results Using leave-one-out validation, AGEP correctly defined the tissue of origin for 1521 (93.6%) of all the 1654 samples in the original database. Independent validation of 195 external normal tissue samples resulted in 87% accuracy for the exact tissue type and 97% accuracy with related tissue types. AGEP analysis of 10 Duchenne muscular dystrophy (DMD) samples provided quantitative description of the key pathogenetic events, such as the extent of inflammation, in individual samples and pinpointed tissue-specific genes whose expression changed (SAMD4A) in DMD. AGEP analysis of microarray data from adipocytic differentiation of mesenchymal stem cells and from normal myeloid cell types and leukemias provided quantitative characterization of the transcriptomic changes during normal and abnormal cell differentiation. Conclusions The AGEP method is a widely applicable method for the rapid comprehensive interpretation of microarray data, as proven here by the definition of tissue- and disease-specific changes in gene expression as well as during cellular differentiation. The capability to quantitatively compare data from individual samples against a large-scale annotated reference database represents a widely applicable paradigm for the analysis of all types of high-throughput data. AGEP enables systematic and quantitative comparison of gene expression data from test samples against a comprehensive collection of different cell/tissue types previously studied by the entire research community.
Collapse
Affiliation(s)
- Sami K Kilpinen
- Institute for Molecular Medicine Finland (FIMM), University of Helsinki, Tukholmankatu 8, Helsinki, Finland.
| | | | | |
Collapse
|
29
|
Engreitz JM, Morgan AA, Dudley JT, Chen R, Thathoo R, Altman RB, Butte AJ. Content-based microarray search using differential expression profiles. BMC Bioinformatics 2010; 11:603. [PMID: 21172034 PMCID: PMC3022631 DOI: 10.1186/1471-2105-11-603] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2010] [Accepted: 12/21/2010] [Indexed: 12/20/2022] Open
Abstract
Background With the expansion of public repositories such as the Gene Expression Omnibus (GEO), we are rapidly cataloging cellular transcriptional responses to diverse experimental conditions. Methods that query these repositories based on gene expression content, rather than textual annotations, may enable more effective experiment retrieval as well as the discovery of novel associations between drugs, diseases, and other perturbations. Results We develop methods to retrieve gene expression experiments that differentially express the same transcriptional programs as a query experiment. Avoiding thresholds, we generate differential expression profiles that include a score for each gene measured in an experiment. We use existing and novel dimension reduction and correlation measures to rank relevant experiments in an entirely data-driven manner, allowing emergent features of the data to drive the results. A combination of matrix decomposition and p-weighted Pearson correlation proves the most suitable for comparing differential expression profiles. We apply this method to index all GEO DataSets, and demonstrate the utility of our approach by identifying pathways and conditions relevant to transcription factors Nanog and FoxO3. Conclusions Content-based gene expression search generates relevant hypotheses for biological inquiry. Experiments across platforms, tissue types, and protocols inform the analysis of new datasets.
Collapse
Affiliation(s)
- Jesse M Engreitz
- Department of Bioengineering, Stanford University School of Medicine, CA, USA
| | | | | | | | | | | | | |
Collapse
|
30
|
Freudenberg JM, Sivaganesan S, Phatak M, Shinde K, Medvedovic M. Generalized random set framework for functional enrichment analysis using primary genomics datasets. ACTA ACUST UNITED AC 2010; 27:70-7. [PMID: 20971985 DOI: 10.1093/bioinformatics/btq593] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
Abstract
MOTIVATION Functional enrichment analysis using primary genomics datasets is an emerging approach to complement established methods for functional enrichment based on predefined lists of functionally related genes. Currently used methods depend on creating lists of 'significant' and 'non-significant' genes based on ad hoc significance cutoffs. This can lead to loss of statistical power and can introduce biases affecting the interpretation of experimental results. RESULTS We developed and validated a new statistical framework, generalized random set (GRS) analysis, for comparing the genomic signatures in two datasets without the need for gene categorization. In our tests, GRS produced correct measures of statistical significance, and it showed dramatic improvement in the statistical power over other methods currently used in this setting. We also developed a procedure for identifying genes driving the concordance of the genomics profiles and demonstrated a dramatic improvement in functional coherence of genes identified in such analysis. AVAILABILITY GRS can be downloaded as part of the R package CLEAN from http://ClusterAnalysis.org/. An online implementation is available at http://GenomicsPortals.org/.
Collapse
Affiliation(s)
- Johannes M Freudenberg
- Department of Environmental Health, University of Cincinnati College of Medicine, Cincinnati, OH 45267, USA
| | | | | | | | | |
Collapse
|
31
|
Abeel T, de Ridder J, Peixoto L. Highlights from the 5th International Society for Computational Biology Student Council Symposium at the 17th Annual International Conference on Intelligent Systems for Molecular Biology and the 8th European Conference on Computational Biology. BMC Bioinformatics 2009; 10 Suppl 13:I1. [PMID: 19840405 PMCID: PMC2764124 DOI: 10.1186/1471-2105-10-s13-i1] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Affiliation(s)
- Thomas Abeel
- Department of Plant Systems Biology, VIB, Ghent University, Gent, Belgium.
| | | | | |
Collapse
|
32
|
Caldas J, Gehlenborg N, Faisal A, Brazma A, Kaski S. Probabilistic retrieval and visualization of biologically relevant microarray experiments. BMC Bioinformatics 2009. [PMCID: PMC2764132 DOI: 10.1186/1471-2105-10-s13-p1] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022] Open
|