1
|
Zhao Z, Zobolas J, Zucknick M, Aittokallio T. Tutorial on survival modeling with applications to omics data. Bioinformatics 2024; 40:btae132. [PMID: 38445722 PMCID: PMC10973942 DOI: 10.1093/bioinformatics/btae132] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2023] [Revised: 02/22/2024] [Accepted: 03/04/2024] [Indexed: 03/07/2024] Open
Abstract
MOTIVATION Identification of genomic, molecular and clinical markers prognostic of patient survival is important for developing personalized disease prevention, diagnostic and treatment approaches. Modern omics technologies have made it possible to investigate the prognostic impact of markers at multiple molecular levels, including genomics, epigenomics, transcriptomics, proteomics and metabolomics, and how these potential risk factors complement clinical characterization of patient outcomes for survival prognosis. However, the massive sizes of the omics datasets, along with their correlation structures, pose challenges for studying relationships between the molecular information and patients' survival outcomes. RESULTS We present a general workflow for survival analysis that is applicable to high-dimensional omics data as inputs when identifying survival-associated features and validating survival models. In particular, we focus on the commonly used Cox-type penalized regressions and hierarchical Bayesian models for feature selection in survival analysis, which are especially useful for high-dimensional data, but the framework is applicable more generally. AVAILABILITY AND IMPLEMENTATION A step-by-step R tutorial using The Cancer Genome Atlas survival and omics data for the execution and evaluation of survival models has been made available at https://ocbe-uio.github.io/survomics.
Collapse
Affiliation(s)
- Zhi Zhao
- Oslo Centre for Biostatistics and Epidemiology (OCBE), Department of Biostatistics, Faculty of Medicine, University of Oslo, Oslo 0372, Norway
- Department of Cancer Genetics, Institute for Cancer Research, Oslo University Hospital, Oslo 0310, Norway
| | - John Zobolas
- Oslo Centre for Biostatistics and Epidemiology (OCBE), Department of Biostatistics, Faculty of Medicine, University of Oslo, Oslo 0372, Norway
- Department of Cancer Genetics, Institute for Cancer Research, Oslo University Hospital, Oslo 0310, Norway
| | - Manuela Zucknick
- Oslo Centre for Biostatistics and Epidemiology (OCBE), Department of Biostatistics, Faculty of Medicine, University of Oslo, Oslo 0372, Norway
- Oslo Centre for Biostatistics and Epidemiology (OCBE), Research Support Services, Oslo University Hospital, Oslo 0372, Norway
| | - Tero Aittokallio
- Oslo Centre for Biostatistics and Epidemiology (OCBE), Department of Biostatistics, Faculty of Medicine, University of Oslo, Oslo 0372, Norway
- Department of Cancer Genetics, Institute for Cancer Research, Oslo University Hospital, Oslo 0310, Norway
- Institute for Molecular Medicine Finland (FIMM), HiLIFE, University of Helsinki, Helsinki FI-00014, Finland
| |
Collapse
|
2
|
Goh WWB, Hui HWH, Wong L. How missing value imputation is confounded with batch effects and what you can do about it. Drug Discov Today 2023; 28:103661. [PMID: 37301250 DOI: 10.1016/j.drudis.2023.103661] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2023] [Revised: 05/31/2023] [Accepted: 06/05/2023] [Indexed: 06/12/2023]
Abstract
In data-processing pipelines, upstream steps can influence downstream processes because of their sequential nature. Among these data-processing steps, batch effect (BE) correction (BEC) and missing value imputation (MVI) are crucial for ensuring data suitability for advanced modeling and reducing the likelihood of false discoveries. Although BEC-MVI interactions are not well studied, they are ultimately interdependent. Batch sensitization can improve the quality of MVI. Conversely, accounting for missingness also improves proper BE estimation in BEC. Here, we discuss how BEC and MVI are interconnected and interdependent. We show how batch sensitization can improve any MVI and bring attention to the idea of BE-associated missing values (BEAMs). Finally, we discuss how batch-class imbalance problems can be mitigated by borrowing ideas from machine learning.
Collapse
Affiliation(s)
- Wilson Wen Bin Goh
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore; School of Biological Sciences, Nanyang Technological University, Singapore; Center for Biomedical Informatics, Nanyang Technological University, Singapore.
| | - Harvard Wai Hann Hui
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore; School of Biological Sciences, Nanyang Technological University, Singapore
| | - Limsoon Wong
- Department of Computer Science, National University of Singapore, Singapore; Department of Pathology, National University of Singapore, Singapore.
| |
Collapse
|
3
|
General Trends of the Camelidae Antibody V HHs Domain Dynamics. Int J Mol Sci 2023; 24:ijms24054511. [PMID: 36901942 PMCID: PMC10003728 DOI: 10.3390/ijms24054511] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2023] [Revised: 02/22/2023] [Accepted: 02/23/2023] [Indexed: 03/03/2023] Open
Abstract
Conformational flexibility plays an essential role in antibodies' functional and structural stability. They facilitate and determine the strength of antigen-antibody interactions. Camelidae express an interesting subtype of single-chain antibody, named Heavy Chain only Antibody. They have only one N-terminal Variable domain (VHH) per chain, composed of Frameworks (FRs) and Complementarity Determining regions (CDRs) like their VH and VL counterparts in IgG. Even when expressed independently, VHH domains display excellent solubility and (thermo)stability, which helps them to retain their impressive interaction capabilities. Sequence and structural features of VHH domains contributing to these abilities have already been studied compared to classical antibodies. To have the broadest view and understand the changes in dynamics of these macromolecules, large-scale molecular dynamics simulations for a large number of non-redundant VHH structures have been performed for the first time. This analysis reveals the most prevalent movements in these domains. It reveals the four main classes of VHHs dynamics. Diverse local changes were observed in CDRs with various intensities. Similarly, different types of constraints were observed in CDRs, while FRs close to CDRs were sometimes primarily impacted. This study sheds light on the changes in flexibility in different regions of VHH that may impact their in silico design.
Collapse
|
4
|
Pati SK, Gupta MK, Shai R, Banerjee A, Ghosh A. Missing value estimation of microarray data using Sim-GAN. Knowl Inf Syst 2022. [DOI: 10.1007/s10115-022-01718-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/16/2022]
|
5
|
Guo G, Niu R, Qian G, Song H, Lu T. Trimmed scores regression for k-means clustering data with high-missing ratio. COMMUN STAT-SIMUL C 2022. [DOI: 10.1080/03610918.2022.2091779] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
Affiliation(s)
- Guangbao Guo
- School of Mathematics and Statistics, Shandong University of Technology, Zibo, China
| | - Ruiling Niu
- School of Mathematics and Statistics, Shandong University of Technology, Zibo, China
| | - Guoqi Qian
- School of Mathematics and Statistics, The University of Melbourne, Australia
| | - Haoyue Song
- School of Mathematics and Statistics, Shandong University of Technology, Zibo, China
| | - Tao Lu
- School of Mathematics and Statistics, Shandong University of Technology, Zibo, China
| |
Collapse
|
6
|
Soemartojo SM, Siswantining T, Fernando Y, Sarwinda D, Al-Ash HS, Syarofina S, Saputra N. Iterative bicluster-based Bayesian principal component analysis and least squares for missing-value imputation in microarray and RNA-sequencing data. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2022; 19:8741-8759. [PMID: 35942733 DOI: 10.3934/mbe.2022405] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Microarray and RNA-sequencing (RNA-seq) techniques each produce gene expression data that can be expressed as a matrix that often contains missing values. Thus, a process of missing-value imputation that uses coherence information of the dataset is necessary. Existing imputation methods, such as iterative bicluster-based least squares (bi-iLS), use biclustering to estimate the missing values because genes are only similar under correlative experimental conditions. Also, they use the row average to obtain a temporary complete matrix, but the use of the row average is considered to be a flaw. The row average cannot reflect the real structure of the dataset because the row average only uses the information of an individual row. Therefore, we propose the use of Bayesian principal component analysis (BPCA) to obtain the temporary complete matrix instead of using the row average in bi-iLS. This alteration produces new missing values imputation method called iterative bicluster-based Bayesian principal component analysis and least squares (bi-BPCA-iLS). Several experiments have been conducted on two-dimension independent gene expression datasets, which are microarray (e.g., cell-cycle expression dataset of yeast saccharomyces cerevisiae) and RNA-seq (gene expression data from schizosaccharomyces pombe) datasets. In the case of the microarray dataset, our proposed bi-BPCA-iLS method showed a significant overall improvement in the normalized root mean square error (NRMSE) values of 10.6% from the local least squares (LLS) and 0.6% from the bi-iLS. In the case of the RNA-seq dataset, our proposed bi-BPCA-iLS method showed an overall improvement in the NRMSE values of 8.2% from the LLS and 3.1% from the bi-iLS. The additional computational time of bi-BPCA-iLS is not significant compared to bi-iLS.
Collapse
Affiliation(s)
- Saskya Mary Soemartojo
- Department of Mathematics, Faculty of Mathematics and Natural Sciences, Universitas Indonesia, Indonesia
| | - Titin Siswantining
- Department of Mathematics, Faculty of Mathematics and Natural Sciences, Universitas Indonesia, Indonesia
| | - Yoel Fernando
- Department of Mathematics, Faculty of Mathematics and Natural Sciences, Universitas Indonesia, Indonesia
| | - Devvi Sarwinda
- Department of Mathematics, Faculty of Mathematics and Natural Sciences, Universitas Indonesia, Indonesia
| | - Herley Shaori Al-Ash
- Department of Mathematics, Faculty of Mathematics and Natural Sciences, Universitas Indonesia, Indonesia
| | - Sarah Syarofina
- Department of Mathematics, Faculty of Mathematics and Natural Sciences, Universitas Indonesia, Indonesia
| | - Noval Saputra
- Department of Mathematics, Faculty of Mathematics and Natural Sciences, Universitas Indonesia, Indonesia
| |
Collapse
|
7
|
Kumar N, Hoque MA, Sugimoto M. Kernel weighted least square approach for imputing missing values of metabolomics data. Sci Rep 2021; 11:11108. [PMID: 34045614 PMCID: PMC8159923 DOI: 10.1038/s41598-021-90654-0] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2021] [Accepted: 05/13/2021] [Indexed: 01/26/2023] Open
Abstract
Mass spectrometry is a modern and sophisticated high-throughput analytical technique that enables large-scale metabolomic analyses. It yields a high-dimensional large-scale matrix (samples × metabolites) of quantified data that often contain missing cells in the data matrix as well as outliers that originate for several reasons, including technical and biological sources. Although several missing data imputation techniques are described in the literature, all conventional existing techniques only solve the missing value problems. They do not relieve the problems of outliers. Therefore, outliers in the dataset decrease the accuracy of the imputation. We developed a new kernel weight function-based proposed missing data imputation technique that resolves the problems of missing values and outliers. We evaluated the performance of the proposed method and other conventional and recently developed missing imputation techniques using both artificially generated data and experimentally measured data analysis in both the absence and presence of different rates of outliers. Performances based on both artificial data and real metabolomics data indicate the superiority of our proposed kernel weight-based missing data imputation technique to the existing alternatives. For user convenience, an R package of the proposed kernel weight-based missing value imputation technique was developed, which is available at https://github.com/NishithPaul/tWLSA.
Collapse
Affiliation(s)
- Nishith Kumar
- Department of Statistics, Bangabandhu Sheikh Mujibur Rahman Science and Technology University, Gopalganj, Bangladesh.
| | - Md Aminul Hoque
- Department of Statistics, University of Rajshahi, Rajshahi, Bangladesh
| | - Masahiro Sugimoto
- Health Promotion and Preemptive Medicine, Research and Development Center for Minimally Invasive Therapies, Tokyo Medical University, Shinjuku, Tokyo, 160-8402, Japan.,Institute for Advanced Biosciences, Keio University, Tsuruoka, 997-0052, Japan
| |
Collapse
|
8
|
Yang L, Qin Y, Jian C. Screening for Core Genes Related to Pathogenesis of Alzheimer's Disease. Front Cell Dev Biol 2021; 9:668738. [PMID: 33968940 PMCID: PMC8101499 DOI: 10.3389/fcell.2021.668738] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2021] [Accepted: 04/01/2021] [Indexed: 12/18/2022] Open
Abstract
Alzheimer’s disease (AD), a nervous system disease, lacks effective therapies at present. RNA expression is the basic way to regulate life activities, and identifying related characteristics in AD patients may aid the exploration of AD pathogenesis and treatment. This study developed a classifier that could accurately classify AD patients and healthy people, and then obtained 3 core genes that may be related to the pathogenesis of AD. To this end, RNA expression data of the middle temporal gyrus of AD patients were firstly downloaded from GEO database, and the data were then normalized using limma package following a supplementation of missing data by k-Nearest Neighbor (KNN) algorithm. Afterwards, the top 500 genes of the most feature importance were obtained through Max-Relevance and Min-Redundancy (mRMR) analysis, and based on these genes, a series of AD classifiers were constructed through Support Vector Machine (SVM), Random Forest (RF), and KNN algorithms. Then, the KNN classifier with the highest Matthews correlation coefficient (MCC) value composed of 14 genes in incremental feature selection (IFS) analysis was identified as the best AD classifier. As analyzed, the 14 genes played a pivotal role in determination of AD and may be core genes associated with the pathogenesis of AD. Finally, protein-protein interaction (PPI) network and Random Walk with Restart (RWR) analysis were applied to obtain core gene-associated genes, and key pathways related to AD were further analyzed. Overall, this study contributed to a deeper understanding of AD pathogenesis and provided theoretical guidance for related research and experiments.
Collapse
Affiliation(s)
- Longxiu Yang
- Department of Neurology, The First Affiliated Hospital of Guangxi Medical University, Nanning, China
| | - Yuan Qin
- Department of Neurology, The First Affiliated Hospital of Guangxi Medical University, Nanning, China
| | - Chongdong Jian
- Department of Neurology, The Affiliated Hospital of Youjiang Medical University for Nationalities, Baise, China
| |
Collapse
|
9
|
Mancuso CA, Canfield JL, Singla D, Krishnan A. A flexible, interpretable, and accurate approach for imputing the expression of unmeasured genes. Nucleic Acids Res 2020; 48:e125. [PMID: 33074331 PMCID: PMC7708069 DOI: 10.1093/nar/gkaa881] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2020] [Revised: 08/24/2020] [Accepted: 09/28/2020] [Indexed: 12/15/2022] Open
Abstract
While there are >2 million publicly-available human microarray gene-expression profiles, these profiles were measured using a variety of platforms that each cover a pre-defined, limited set of genes. Therefore, key to reanalyzing and integrating this massive data collection are methods that can computationally reconstitute the complete transcriptome in partially-measured microarray samples by imputing the expression of unmeasured genes. Current state-of-the-art imputation methods are tailored to samples from a specific platform and rely on gene-gene relationships regardless of the biological context of the target sample. We show that sparse regression models that capture sample-sample relationships (termed SampleLASSO), built on-the-fly for each new target sample to be imputed, outperform models based on fixed gene relationships. Extensive evaluation involving three machine learning algorithms (LASSO, k-nearest-neighbors, and deep-neural-networks), two gene subsets (GPL96–570 and LINCS), and multiple imputation tasks (within and across microarray/RNA-seq datasets) establishes that SampleLASSO is the most accurate model. Additionally, we demonstrate the biological interpretability of this method by showing that, for imputing a target sample from a certain tissue, SampleLASSO automatically leverages training samples from the same tissue. Thus, SampleLASSO is a simple, yet powerful and flexible approach for harmonizing large-scale gene-expression data.
Collapse
Affiliation(s)
- Christopher A Mancuso
- Department of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, MI 48824, USA
| | - Jacob L Canfield
- Department of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, MI 48824, USA.,Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI 48824, USA
| | - Deepak Singla
- Department of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, MI 48824, USA.,Indian Institute of Technology, Delhi, India
| | - Arjun Krishnan
- Department of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, MI 48824, USA.,Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI 48824, USA
| |
Collapse
|
10
|
Loreck K, Mitrenga S, Heinze R, Ehricht R, Engemann C, Lueken C, Ploetz M, Greiner M, Meemken D. Use of meat juice and blood serum with a miniaturised protein microarray assay to develop a multi-parameter IgG screening test with high sample throughput potential for slaughtering pigs. BMC Vet Res 2020; 16:106. [PMID: 32252773 PMCID: PMC7137480 DOI: 10.1186/s12917-020-02308-4] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2019] [Accepted: 03/10/2020] [Indexed: 01/23/2023] Open
Abstract
BACKGROUND Serological screening of pig herds at the abattoir is considered a potential tool to improve meat inspection procedures and herd health management. Therefore, we previously reported the feasibility of a miniaturised protein microarray as a new serological IgG screening test for zoonotic agents and production diseases in pigs. The present study investigates whether the protein microarray-based assay is applicable for high sample throughput using either blood serum or meat juice. MATERIAL AND METHODS Microarrays with 12 different antigens were produced by Abbott (formerly Alere Technologies GmbH) Jena, Germany in a previously offered 'ArrayTube' platform and in an 'ArrayStrip' platform for large-scale use. A test protocol for the use of meat juice on both microarray platforms was developed. Agreement between serum and meat juice was analysed with 88 paired samples from three German abattoirs. Serum was diluted 1:50 and meat juice 1:2. ELISA results for all tested antigens from a preceding study were used as reference test to perform Receiver Operating Characteristic analysis for both test specimens on both microarray platforms. RESULTS High area under curve values (AUC > 0.7) were calculated for the analysis of T. gondii (0.87), Y. enterocolitica (0.97), Mycoplasma hyopneumoniae (0.84) and Actinobacillus pleuropneumoniae (0.71) with serum as the test specimen and for T. gondii (0.99), Y. enterocolitica (0.94), PRRSV (0.88), A. pleuropneumoniae (0.78) and Salmonella spp. (0.72) with meat juice as the test specimen on the ArrayStrip platform. Cohens kappa values of 0.92 for T. gondii and 0.82 for Y. enterocolitica were obtained for the comparison between serum and meat juice. When applying the new method in two further laboratories, kappa values between 0.63 and 0.94 were achieved between the laboratories for these two pathogens. CONCLUSION Further development of a miniaturised pig-specific IgG protein microarray assay showed that meat juice can be used on microarray platforms. Two out of twelve tested antigens (T. gondii, Y. enterocolitica) showed high test accuracy on the ArrayTube and the ArrayStrip platform with both sample materials.
Collapse
Affiliation(s)
- Katharina Loreck
- Institute for Food Quality and Food Safety, University of Veterinary Medicine Hannover, Foundation, Bischofsholer Damm 15, D-30173, Hannover, Germany.
| | - Sylvia Mitrenga
- Institute for Food Quality and Food Safety, University of Veterinary Medicine Hannover, Foundation, Bischofsholer Damm 15, D-30173, Hannover, Germany
| | - Regina Heinze
- Abbott (Alere Technologies GmbH), Löbstedter Straße 103-105, D-07749, Jena, Germany
| | - Ralf Ehricht
- Department for Optical Molecular Diagnostics and Systems Technology, Leibniz-Institute of Photonic Technology (IPHT), Albert-Einstein-Straße 9, D-07745, Jena, Germany
- InfectoGnostics Research Campus, Centre for Applied Research, Philosophenweg 7, D-07743, Jena, Germany
- Institute of Physical Chemistry, Friedrich Schiller University Jena, Helmholtzweg 4, D-07737, Jena, Germany
| | - Claudia Engemann
- Indical Bioscience GmbH, Deutscher Platz 5b, D-04103, Leipzig, Germany
| | - Caroline Lueken
- LUFA Nord-West, Institut für Tiergesundheit, Ammerländer Heerstraße 123, D-26129, Oldenburg, Germany
| | - Madeleine Ploetz
- Institute for Food Quality and Food Safety, University of Veterinary Medicine Hannover, Foundation, Bischofsholer Damm 15, D-30173, Hannover, Germany
| | - Matthias Greiner
- Institute for Food Quality and Food Safety, University of Veterinary Medicine Hannover, Foundation, Bischofsholer Damm 15, D-30173, Hannover, Germany
- Department of Exposure, German Federal Institute for Risk Assessment (BfR), Max-Dohrn-Straße 8-10, D-10589, Berlin, Germany
| | - Diana Meemken
- Institute of Food Safety and Food Hygiene, Section Meat Hygiene, Freie Universität Berlin, Königsweg 67, D-14163, Berlin, Germany
| |
Collapse
|
11
|
Loreck K, Mitrenga S, Meemken D, Heinze R, Reissig A, Mueller E, Ehricht R, Engemann C, Greiner M. Development of a miniaturized protein microarray as a new serological IgG screening test for zoonotic agents and production diseases in pigs. PLoS One 2019; 14:e0217290. [PMID: 31116794 PMCID: PMC6530865 DOI: 10.1371/journal.pone.0217290] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2019] [Accepted: 05/08/2019] [Indexed: 12/17/2022] Open
Abstract
In order to monitor the occurrence of zoonotic agents in pig herds as well as to improve herd health management, the development of new cost-effective diagnostic methods for pigs is necessary. In this study, a protein microarray-based assay for the simultaneous detection of immunoglobulin G (IgG) antibodies against different zoonotic agents and pathogens causing production diseases in pigs was developed. Therefore, antigens of ten different important swine pathogens (Toxoplasma gondii, Yersinia enterocolitica, Salmonella spp., Trichinella spp., Mycobacterium avium, Hepatitis E virus, Mycoplasma hyopneumoniae, Actinobacillus pleuropneumoniae, the porcine reproductive and respiratory syndrome virus, Influenza A virus) were spotted and covalently immobilized as 'antigen-spots' on microarray chips in order to test pig serum for the occurrence of antibodies. Pig serum was sampled at three German abattoirs and ELISA tests for the different pathogens were conducted with the purpose of creating a panel of reference samples for microarray analysis. To evaluate the accuracy of the antigens on the microarray, receiver operating characteristic (ROC) curve analysis using the ELISA test results as reference was performed for the different antigens. High area under curve values were achieved for the antigens of two zoonotic agents: Toxoplasma gondii (0.91), Yersinia enterocolitica (0.97) and for three production diseases: Actinobacillus pleuropneumoniae (0.77), Mycoplasma hyopneumoniae (0.94) and the porcine reproductive and respiratory syndrome virus (0.87). With the help of the newly developed microarray assay, collecting data on the occurrence of antibodies against zoonotic agents and production diseases in pig herds could be minimized to one measurement, resulting in an efficient screening test.
Collapse
Affiliation(s)
- Katharina Loreck
- Institute for Food Quality and Food Safety, University of Veterinary Medicine Hannover, Foundation, Hannover, Germany
| | - Sylvia Mitrenga
- Institute for Food Quality and Food Safety, University of Veterinary Medicine Hannover, Foundation, Hannover, Germany
| | - Diana Meemken
- Institute of Food Safety and Food Hygiene, Freie Universität Berlin, Berlin, Germany
| | | | - Annett Reissig
- Leibniz-Institute of Photonic Technology (IPHT), Department for Optical Molecular Diagnostics and Systems Technology, Jena, Germany
- InfectoGnostics Research Campus, Centre for Applied Research, Jena, Germany
| | - Elke Mueller
- Leibniz-Institute of Photonic Technology (IPHT), Department for Optical Molecular Diagnostics and Systems Technology, Jena, Germany
- InfectoGnostics Research Campus, Centre for Applied Research, Jena, Germany
| | - Ralf Ehricht
- Leibniz-Institute of Photonic Technology (IPHT), Department for Optical Molecular Diagnostics and Systems Technology, Jena, Germany
- InfectoGnostics Research Campus, Centre for Applied Research, Jena, Germany
| | | | - Matthias Greiner
- Institute for Food Quality and Food Safety, University of Veterinary Medicine Hannover, Foundation, Hannover, Germany
- Department of Exposure, German Federal Institute for Risk Assessment (BfR), Berlin, Germany
| |
Collapse
|
12
|
Abstract
The cluster analysis has been widely applied by researchers from several scientific fields over the last decades. Advances in knowledge of biological phenomena have revived a great interest in cluster analysis due in part to the large amount of microarray data. Traditional clustering algorithms show, apart from the need of user-defined parameters, clear limitations to handle microarray data owing to its inherent characteristics: high-dimensional-low-sample-sized, highly redundant, and noisy. That has motivated the study of clustering algorithms tailored to the task of analyzing microarray data, which currently continue being developed and adapted. The present chapter is devoted to review clustering methods with different cluster analysis approaches in the challenging context of microarray data. Furthermore, the validation of the clustering results is briefly discussed by means of validity indexes used to assess the goodness of the number of clusters and the induced cluster assignments.
Collapse
Affiliation(s)
| | - Juana-María Vivo
- Department of Statistics and Operations Research, University of Murcia, Murcia, Spain.
| |
Collapse
|
13
|
O'Brien JJ, Gunawardena HP, Paulo JA, Chen X, Ibrahim JG, Gygi SP, Qaqish BF. The effects of nonignorable missing data on label-free mass spectrometry proteomics experiments. Ann Appl Stat 2018; 12:2075-2095. [PMID: 30473739 PMCID: PMC6249692 DOI: 10.1214/18-aoas1144] [Citation(s) in RCA: 26] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Abstract
An idealized version of a label-free discovery mass spectrometry proteomics experiment would provide absolute abundance measurements for a whole proteome, across varying conditions. Unfortunately, this ideal is not realized. Measurements are made on peptides requiring an inferential step to obtain protein level estimates. The inference is complicated by experimental factors that necessitate relative abundance estimation and result in widespread non-ignorable missing data. Relative abundance on the log scale takes the form of parameter contrasts. In a complete-case analysis, contrast estimates may be biased by missing data and a substantial amount of useful information will often go unused. To avoid problems with missing data, many analysts have turned to single imputation solutions. Unfortunately, these methods often create further difficulties by hiding inestimable contrasts, preventing the recovery of interblock information and failing to account for imputation uncertainty. To mitigate many of the problems caused by missing values, we propose the use of a Bayesian selection model. Our model is tested on simulated data, real data with simulated missing values, and on a ground truth dilution experiment where all of the true relative changes are known. The analysis suggests that our model, compared with various imputation strategies and complete-case analyses, can increase accuracy and provide substantial improvements to interval coverage.
Collapse
Affiliation(s)
- Jonathon J O'Brien
- Department of Cell Biology, Harvard Medical School, 240 Longwood Ave, Boston, MA, 02115, USA; Department of Biostatistics, University of North Carolina at Chapel Hill, 135 Dauer Drive, 3101 McGavran-Greenberg Hall, CB 7420, Chapel Hill, NC 27599, USA; Department of Biochemistry and Biophysics University of North Carolina at Chapel Hill 120 Mason Farm Rd, Campus Box 7260 Chapel Hill, NC 27599 USA
| | - Harsha P Gunawardena
- Department of Cell Biology, Harvard Medical School, 240 Longwood Ave, Boston, MA, 02115, USA; Department of Biostatistics, University of North Carolina at Chapel Hill, 135 Dauer Drive, 3101 McGavran-Greenberg Hall, CB 7420, Chapel Hill, NC 27599, USA; Department of Biochemistry and Biophysics University of North Carolina at Chapel Hill 120 Mason Farm Rd, Campus Box 7260 Chapel Hill, NC 27599 USA
| | - Joao A Paulo
- Department of Cell Biology, Harvard Medical School, 240 Longwood Ave, Boston, MA, 02115, USA; Department of Biostatistics, University of North Carolina at Chapel Hill, 135 Dauer Drive, 3101 McGavran-Greenberg Hall, CB 7420, Chapel Hill, NC 27599, USA; Department of Biochemistry and Biophysics University of North Carolina at Chapel Hill 120 Mason Farm Rd, Campus Box 7260 Chapel Hill, NC 27599 USA
| | - Xian Chen
- Department of Cell Biology, Harvard Medical School, 240 Longwood Ave, Boston, MA, 02115, USA; Department of Biostatistics, University of North Carolina at Chapel Hill, 135 Dauer Drive, 3101 McGavran-Greenberg Hall, CB 7420, Chapel Hill, NC 27599, USA; Department of Biochemistry and Biophysics University of North Carolina at Chapel Hill 120 Mason Farm Rd, Campus Box 7260 Chapel Hill, NC 27599 USA
| | - Joseph G Ibrahim
- Department of Cell Biology, Harvard Medical School, 240 Longwood Ave, Boston, MA, 02115, USA; Department of Biostatistics, University of North Carolina at Chapel Hill, 135 Dauer Drive, 3101 McGavran-Greenberg Hall, CB 7420, Chapel Hill, NC 27599, USA; Department of Biochemistry and Biophysics University of North Carolina at Chapel Hill 120 Mason Farm Rd, Campus Box 7260 Chapel Hill, NC 27599 USA
| | - Steven P Gygi
- Department of Cell Biology, Harvard Medical School, 240 Longwood Ave, Boston, MA, 02115, USA; Department of Biostatistics, University of North Carolina at Chapel Hill, 135 Dauer Drive, 3101 McGavran-Greenberg Hall, CB 7420, Chapel Hill, NC 27599, USA; Department of Biochemistry and Biophysics University of North Carolina at Chapel Hill 120 Mason Farm Rd, Campus Box 7260 Chapel Hill, NC 27599 USA
| | - Bahjat F Qaqish
- Department of Cell Biology, Harvard Medical School, 240 Longwood Ave, Boston, MA, 02115, USA; Department of Biostatistics, University of North Carolina at Chapel Hill, 135 Dauer Drive, 3101 McGavran-Greenberg Hall, CB 7420, Chapel Hill, NC 27599, USA; Department of Biochemistry and Biophysics University of North Carolina at Chapel Hill 120 Mason Farm Rd, Campus Box 7260 Chapel Hill, NC 27599 USA
| |
Collapse
|
14
|
Kim JS, Kim DS, Lee KC, Lee JS, King GM, Kang S. Microbial community structure and functional potential of lava-formed Gotjawal soils in Jeju, Korea. PLoS One 2018; 13:e0204761. [PMID: 30312313 PMCID: PMC6193574 DOI: 10.1371/journal.pone.0204761] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2018] [Accepted: 09/13/2018] [Indexed: 11/19/2022] Open
Abstract
The Gotjawal areas of Jeju Island, Korea, are comprised of unmanaged forests growing on volcanic soils. They support unique assemblages of vascular plants from both northern and southern hemispheres, but are threatened by human disturbance. The health and ecosystem function of these assemblages likely depends in part on the diversity and community structure of soil microbial communities, about which little is known. To assess the diversity of Gotjawal soil microbial communities, twenty samples were collected in November 2010 from 4 representatives of Gotjawal forests. While soil properties and microbial communities measured by 16S rRNA gene sequence data were marginally distinct among sites by PERMANOVA (p = 0.017–0.191), GeoChip data showed significant differences among sites (p <0.006). Gene composition overall, and the composition of 3 functional gene categories had similar structures themselves and similar associations with environmental factors. Among these communities, phosphorous cycling genes exhibited the most distinct patterns. 16S rRNA gene sequence data resulted in a mean 777 operational taxonomic units (OTUs), which included the following major phyla: Proteobacteria (27.9%), Actinobacteria (17.7%), Verrucomicrobia (14.3%), Acidobacteria (9.6%), Planctomycetes (9.8%), Bacteroidetes (8.9%), and Chloroflexi (2.2%). Indicator species analysis (ISA) was used to determine the taxa with high indicator value, which represented the following: uncultured Chlamydiaceae, Caulobacter, uncultured Sinobacteraceae, Paenibacillus, Arenimonas, Clostridium sensu.stricto, uncultured Burkholderiales incertae sedis, and Nocardioides in Aewol (AW), Aquicella, uncultured Planctomycetia, and Aciditerrimonas in Gujwa-Seongsan (GS), uncultured Acidobacteria Gp1, and Hamadaea in Hankyeong-Andeok (HA), and Bosea, Haliea, and Telmatocola in Jocheon-Hamdeok (JH) Gotjawal. Collectively, these results demonstrated the uniqueness of microbial communities within each Gotjawal region, likely reflecting different patterns of soil, plant assemblages and microclimates.
Collapse
Affiliation(s)
- Jong-Shik Kim
- Gyeongbuk Institute for Marine Bioindustry, Uljin, Republic of Korea
- * E-mail: (JSK); (SK)
| | - Dae-Shin Kim
- World Heritage and Mt. Hallasan Research Institute, Jeju Special Self-Governing Province, Republic of Korea
| | - Keun Chul Lee
- Korean Collection for Type Cultures, Korea Research Institute of Bioscience and Biotechnology, Jeongup, Republic of Korea
| | - Jung-Sook Lee
- Korean Collection for Type Cultures, Korea Research Institute of Bioscience and Biotechnology, Jeongup, Republic of Korea
| | - Gary M. King
- Biological Sciences, Louisiana State University, Baton Rouge, LA, United States of America
| | - Sanghoon Kang
- Department of Biology, Baylor University, Waco, TX, United States of America
- * E-mail: (JSK); (SK)
| |
Collapse
|
15
|
Taylor SL, Ruhaak LR, Kelly K, Weiss RH, Kim K. Effects of imputation on correlation: implications for analysis of mass spectrometry data from multiple biological matrices. Brief Bioinform 2017; 18:312-320. [PMID: 26896791 DOI: 10.1093/bib/bbw010] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2015] [Indexed: 11/14/2022] Open
Abstract
With expanded access to, and decreased costs of, mass spectrometry, investigators are collecting and analyzing multiple biological matrices from the same subject such as serum, plasma, tissue and urine to enhance biomarker discoveries, understanding of disease processes and identification of therapeutic targets. Commonly, each biological matrix is analyzed separately, but multivariate methods such as MANOVAs that combine information from multiple biological matrices are potentially more powerful. However, mass spectrometric data typically contain large amounts of missing values, and imputation is often used to create complete data sets for analysis. The effects of imputation on multiple biological matrix analyses have not been studied. We investigated the effects of seven imputation methods (half minimum substitution, mean substitution, k-nearest neighbors, local least squares regression, Bayesian principal components analysis, singular value decomposition and random forest), on the within-subject correlation of compounds between biological matrices and its consequences on MANOVA results. Through analysis of three real omics data sets and simulation studies, we found the amount of missing data and imputation method to substantially change the between-matrix correlation structure. The magnitude of the correlations was generally reduced in imputed data sets, and this effect increased with the amount of missing data. Significant results from MANOVA testing also were substantially affected. In particular, the number of false positives increased with the level of missing data for all imputation methods. No one imputation method was universally the best, but the simple substitution methods (Half Minimum and Mean) consistently performed poorly.
Collapse
Affiliation(s)
- Sandra L Taylor
- Division of Biostatistics, Department of Public Health Sciences, University of California School of Medicine, CA, USA
| | - L Renee Ruhaak
- Department of Chemistry, University of California, CA, USA
| | - Karen Kelly
- Division of Hematology and Oncology, University of California Davis Comprehensive Cancer Center , Sacramento, California, USA
| | - Robert H Weiss
- Division of Nephrology, Department of Internal Medicine, University of California, CA, USA
| | - Kyoungmi Kim
- Division of Biostatistics, Department of Public Health Sciences, University of California , California, USA
| |
Collapse
|
16
|
|
17
|
Wu WS, Jhou MJ. MVIAeval: a web tool for comprehensively evaluating the performance of a new missing value imputation algorithm. BMC Bioinformatics 2017; 18:31. [PMID: 28086746 PMCID: PMC5237319 DOI: 10.1186/s12859-016-1429-3] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2016] [Accepted: 12/15/2016] [Indexed: 02/06/2023] Open
Abstract
BACKGROUND Missing value imputation is important for microarray data analyses because microarray data with missing values would significantly degrade the performance of the downstream analyses. Although many microarray missing value imputation algorithms have been developed, an objective and comprehensive performance comparison framework is still lacking. To solve this problem, we previously proposed a framework which can perform a comprehensive performance comparison of different existing algorithms. Also the performance of a new algorithm can be evaluated by our performance comparison framework. However, constructing our framework is not an easy task for the interested researchers. To save researchers' time and efforts, here we present an easy-to-use web tool named MVIAeval (Missing Value Imputation Algorithm evaluator) which implements our performance comparison framework. RESULTS MVIAeval provides a user-friendly interface allowing users to upload the R code of their new algorithm and select (i) the test datasets among 20 benchmark microarray (time series and non-time series) datasets, (ii) the compared algorithms among 12 existing algorithms, (iii) the performance indices from three existing ones, (iv) the comprehensive performance scores from two possible choices, and (v) the number of simulation runs. The comprehensive performance comparison results are then generated and shown as both figures and tables. CONCLUSIONS MVIAeval is a useful tool for researchers to easily conduct a comprehensive and objective performance evaluation of their newly developed missing value imputation algorithm for microarray data or any data which can be represented as a matrix form (e.g. NGS data or proteomics data). Thus, MVIAeval will greatly expedite the progress in the research of missing value imputation algorithms.
Collapse
Affiliation(s)
- Wei-Sheng Wu
- Department of Electrical Engineering, National Cheng Kung University, Tainan, Taiwan.
| | - Meng-Jhun Jhou
- Department of Electrical Engineering, National Cheng Kung University, Tainan, Taiwan
| |
Collapse
|
18
|
Chen Y, Wang A, Ding H, Que X, Li Y, An N, Jiang L. A global learning with local preservation method for microarray data imputation. Comput Biol Med 2016; 77:76-89. [PMID: 27522236 DOI: 10.1016/j.compbiomed.2016.08.005] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2016] [Revised: 08/04/2016] [Accepted: 08/04/2016] [Indexed: 12/28/2022]
Abstract
Microarray data suffer from missing values for various reasons, including insufficient resolution, image noise, and experimental errors. Because missing values can hinder downstream analysis steps that require complete data as input, it is crucial to be able to estimate the missing values. In this study, we propose a Global Learning with Local Preservation method (GL2P) for imputation of missing values in microarray data. GL2P consists of two components: a local similarity measurement module and a global weighted imputation module. The former uses a local structure preservation scheme to exploit as much information as possible from the observable data, and the latter is responsible for estimating the missing values of a target gene by considering all of its neighbors rather than a subset of them. Furthermore, GL2P imputes the missing values in ascending order according to the rate of missing data for each target gene to fully utilize previously estimated values. To validate the proposed method, we conducted extensive experiments on six benchmarked microarray datasets. We compared GL2P with eight state-of-the-art imputation methods in terms of four performance metrics. The experimental results indicate that GL2P outperforms its competitors in terms of imputation accuracy and better preserves the structure of differentially expressed genes. In addition, GL2P is less sensitive to the number of neighbors than other local learning-based imputation methods.
Collapse
Affiliation(s)
- Ye Chen
- School of Computer and Information, Hefei University of Technology, Hefei 230009, China.
| | - Aiguo Wang
- School of Computer and Information, Hefei University of Technology, Hefei 230009, China; School of Software, Hefei University of Technology, Hefei 230009, China.
| | - Huitong Ding
- School of Computer and Information, Hefei University of Technology, Hefei 230009, China.
| | - Xia Que
- School of Computer and Information, Hefei University of Technology, Hefei 230009, China.
| | - Yabo Li
- College of Life Sciences, Lanzhou University, Lanzhou 730000, China.
| | - Ning An
- School of Computer and Information, Hefei University of Technology, Hefei 230009, China.
| | - Lili Jiang
- Department of Computing Science, Umeå University, Umeå 90187, Sweden.
| |
Collapse
|
19
|
Trends in IT Innovation to Build a Next Generation Bioinformatics Solution to Manage and Analyse Biological Big Data Produced by NGS Technologies. BIOMED RESEARCH INTERNATIONAL 2015; 2015:904541. [PMID: 26125026 PMCID: PMC4466500 DOI: 10.1155/2015/904541] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/31/2014] [Revised: 04/01/2015] [Accepted: 04/01/2015] [Indexed: 02/07/2023]
Abstract
Sequencing the human genome began in 1994, and 10 years of work were necessary in order to provide a nearly complete sequence. Nowadays, NGS technologies allow sequencing of a whole human genome in a few days. This deluge of data challenges scientists in many ways, as they are faced with data management issues and analysis and visualization drawbacks due to the limitations of current bioinformatics tools. In this paper, we describe how the NGS Big Data revolution changes the way of managing and analysing data. We present how biologists are confronted with abundance of methods, tools, and data formats. To overcome these problems, focus on Big Data Information Technology innovations from web and business intelligence. We underline the interest of NoSQL databases, which are much more efficient than relational databases. Since Big Data leads to the loss of interactivity with data during analysis due to high processing time, we describe solutions from the Business Intelligence that allow one to regain interactivity whatever the volume of data is. We illustrate this point with a focus on the Amadea platform. Finally, we discuss visualization challenges posed by Big Data and present the latest innovations with JavaScript graphic libraries.
Collapse
|
20
|
de Souto MCP, Jaskowiak PA, Costa IG. Impact of missing data imputation methods on gene expression clustering and classification. BMC Bioinformatics 2015; 16:64. [PMID: 25888091 PMCID: PMC4350881 DOI: 10.1186/s12859-015-0494-3] [Citation(s) in RCA: 39] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2014] [Accepted: 02/09/2015] [Indexed: 12/20/2022] Open
Abstract
Background Several missing value imputation methods for gene expression data have been proposed in the literature. In the past few years, researchers have been putting a great deal of effort into presenting systematic evaluations of the different imputation algorithms. Initially, most algorithms were assessed with an emphasis on the accuracy of the imputation, using metrics such as the root mean squared error. However, it has become clear that the success of the estimation of the expression value should be evaluated in more practical terms as well. One can consider, for example, the ability of the method to preserve the significant genes in the dataset, or its discriminative/predictive power for classification/clustering purposes. Results and conclusions We performed a broad analysis of the impact of five well-known missing value imputation methods on three clustering and four classification methods, in the context of 12 cancer gene expression datasets. We employed a statistical framework, for the first time in this field, to assess whether different imputation methods improve the performance of the clustering/classification methods. Our results suggest that the imputation methods evaluated have a minor impact on the classification and downstream clustering analyses. Simple methods such as replacing the missing values by mean or the median values performed as well as more complex strategies. The datasets analyzed in this study are available at http://costalab.org/Imputation/. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0494-3) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
| | - Pablo A Jaskowiak
- Institute of Mathematics and Computer Sciences, University of São Paulo, São Carlos - SP, Brazil.
| | - Ivan G Costa
- Center of Informatics, Federal University of Pernambuco, Recife - PE, Brazil. .,IZKF Computational Biology Research Group, Institute for Biomedical Engineering, RWTH Aachen University Medical School, Aachen, Germany.
| |
Collapse
|
21
|
Žitnik M, Zupan B. Data Imputation in Epistatic MAPs by Network-Guided Matrix Completion. J Comput Biol 2015; 22:595-608. [PMID: 25658751 DOI: 10.1089/cmb.2014.0158] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023] Open
Abstract
Epistatic miniarray profile (E-MAP) is a popular large-scale genetic interaction discovery platform. E-MAPs benefit from quantitative output, which makes it possible to detect subtle interactions with greater precision. However, due to the limits of biotechnology, E-MAP studies fail to measure genetic interactions for up to 40% of gene pairs in an assay. Missing measurements can be recovered by computational techniques for data imputation, in this way completing the interaction profiles and enabling downstream analysis algorithms that could otherwise be sensitive to missing data values. We introduce a new interaction data imputation method called network-guided matrix completion (NG-MC). The core part of NG-MC is low-rank probabilistic matrix completion that incorporates prior knowledge presented as a collection of gene networks. NG-MC assumes that interactions are transitive, such that latent gene interaction profiles inferred by NG-MC depend on the profiles of their direct neighbors in gene networks. As the NG-MC inference algorithm progresses, it propagates latent interaction profiles through each of the networks and updates gene network weights toward improved prediction. In a study with four different E-MAP data assays and considered protein-protein interaction and gene ontology similarity networks, NG-MC significantly surpassed existing alternative techniques. Inclusion of information from gene networks also allowed NG-MC to predict interactions for genes that were not included in original E-MAP assays, a task that could not be considered by current imputation approaches.
Collapse
Affiliation(s)
- Marinka Žitnik
- 1Faculty of Computer and Information Science, University of Ljubljana, Ljubljana, Slovenia
| | - Blaž Zupan
- 1Faculty of Computer and Information Science, University of Ljubljana, Ljubljana, Slovenia.,2Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas
| |
Collapse
|
22
|
Fa R, Nandi AK. Noise Resistant Generalized Parametric Validity Index of Clustering for Gene Expression Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2014; 11:741-752. [PMID: 26356344 DOI: 10.1109/tcbb.2014.2312006] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Validity indices have been investigated for decades. However, since there is no study of noise-resistance performance of these indices in the literature, there is no guideline for determining the best clustering in noisy data sets, especially microarray data sets. In this paper, we propose a generalized parametric validity (GPV) index which employs two tunable parameters α and β to control the proportions of objects being considered to calculate the dissimilarities. The greatest advantage of the proposed GPV index is its noise-resistance ability, which results from the flexibility of tuning the parameters. Several rules are set to guide the selection of parameter values. To illustrate the noise-resistance performance of the proposed index, we evaluate the GPV index for assessing five clustering algorithms in two gene expression data simulation models with different noise levels and compare the ability of determining the number of clusters with eight existing indices. We also test the GPV in three groups of real gene expression data sets. The experimental results suggest that the proposed GPV index has superior noise-resistance ability and provides fairly accurate judgements.
Collapse
|
23
|
Chiu CC, Chan SY, Wang CC, Wu WS. Missing value imputation for microarray data: a comprehensive comparison study and a web tool. BMC SYSTEMS BIOLOGY 2013; 7 Suppl 6:S12. [PMID: 24565220 PMCID: PMC4028811 DOI: 10.1186/1752-0509-7-s6-s12] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
Abstract
BACKGROUND Microarray data are usually peppered with missing values due to various reasons. However, most of the downstream analyses for microarray data require complete datasets. Therefore, accurate algorithms for missing value estimation are needed for improving the performance of microarray data analyses. Although many algorithms have been developed, there are many debates on the selection of the optimal algorithm. The studies about the performance comparison of different algorithms are still incomprehensive, especially in the number of benchmark datasets used, the number of algorithms compared, the rounds of simulation conducted, and the performance measures used. RESULTS In this paper, we performed a comprehensive comparison by using (I) thirteen datasets, (II) nine algorithms, (III) 110 independent runs of simulation, and (IV) three types of measures to evaluate the performance of each imputation algorithm fairly. First, the effects of different types of microarray datasets on the performance of each imputation algorithm were evaluated. Second, we discussed whether the datasets from different species have different impact on the performance of different algorithms. To assess the performance of each algorithm fairly, all evaluations were performed using three types of measures. Our results indicate that the performance of an imputation algorithm mainly depends on the type of a dataset but not on the species where the samples come from. In addition to the statistical measure, two other measures with biological meanings are useful to reflect the impact of missing value imputation on the downstream data analyses. Our study suggests that local-least-squares-based methods are good choices to handle missing values for most of the microarray datasets. CONCLUSIONS In this work, we carried out a comprehensive comparison of the algorithms for microarray missing value imputation. Based on such a comprehensive comparison, researchers could choose the optimal algorithm for their datasets easily. Moreover, new imputation algorithms could be compared with the existing algorithms using this comparison strategy as a standard protocol. In addition, to assist researchers in dealing with missing values easily, we built a web-based and easy-to-use imputation tool, MissVIA (http://cosbi.ee.ncku.edu.tw/MissVIA), which supports many imputation algorithms. Once users upload a real microarray dataset and choose the imputation algorithms, MissVIA will determine the optimal algorithm for the users' data through a series of simulations, and then the imputed results can be downloaded for the downstream data analyses.
Collapse
Affiliation(s)
- Chia-Chun Chiu
- Department of Electrical Engineering, National Cheng Kung University, No.1 University Road, 701 Tainan, Taiwan (R. O. C
| | - Shih-Yao Chan
- Department of Electrical Engineering, National Cheng Kung University, No.1 University Road, 701 Tainan, Taiwan (R. O. C
| | - Chung-Ching Wang
- Department of Electrical Engineering, National Cheng Kung University, No.1 University Road, 701 Tainan, Taiwan (R. O. C
| | - Wei-Sheng Wu
- Department of Electrical Engineering, National Cheng Kung University, No.1 University Road, 701 Tainan, Taiwan (R. O. C
| |
Collapse
|
24
|
Oh S, Kang DD, Brock GN, Tseng GC. Biological impact of missing-value imputation on downstream analyses of gene expression profiles. Bioinformatics 2011; 27:78-86. [PMID: 21045072 PMCID: PMC3008641 DOI: 10.1093/bioinformatics/btq613] [Citation(s) in RCA: 41] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2010] [Revised: 10/28/2010] [Accepted: 10/28/2010] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Microarray experiments frequently produce multiple missing values (MVs) due to flaws such as dust, scratches, insufficient resolution or hybridization errors on the chips. Unfortunately, many downstream algorithms require a complete data matrix. The motivation of this work is to determine the impact of MV imputation on downstream analysis, and whether ranking of imputation methods by imputation accuracy correlates well with the biological impact of the imputation. METHODS Using eight datasets for differential expression (DE) and classification analysis and eight datasets for gene clustering, we demonstrate the biological impact of missing-value imputation on statistical downstream analyses, including three commonly employed DE methods, four classifiers and three gene-clustering methods. Correlation between the rankings of imputation methods based on three root-mean squared error (RMSE) measures and the rankings based on the downstream analysis methods was used to investigate which RMSE measure was most consistent with the biological impact measures, and which downstream analysis methods were the most sensitive to the choice of imputation procedure. RESULTS DE was the most sensitive to the choice of imputation procedure, while classification was the least sensitive and clustering was intermediate between the two. The logged RMSE (LRMSE) measure had the highest correlation with the imputation rankings based on the DE results, indicating that the LRMSE is the best representative surrogate among the three RMSE-based measures. Bayesian principal component analysis and least squares adaptive appeared to be the best performing methods in the empirical downstream evaluation.
Collapse
Affiliation(s)
- Sunghee Oh
- Department of Biostatistics, University of Pittsburgh, Pittsburgh, PA, USA
| | | | | | | |
Collapse
|
25
|
Liew AWC, Law NF, Yan H. Missing value imputation for gene expression data: computational techniques to recover missing data from available information. Brief Bioinform 2010; 12:498-513. [PMID: 21156727 DOI: 10.1093/bib/bbq080] [Citation(s) in RCA: 92] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022] Open
Affiliation(s)
- Alan Wee-Chung Liew
- School of Information and Communication Technology, Gold Coast Campus, Griffith University, QLD4222, Australia.
| | | | | |
Collapse
|
26
|
Ryan C, Greene D, Cagney G, Cunningham P. Missing value imputation for epistatic MAPs. BMC Bioinformatics 2010; 11:197. [PMID: 20406472 PMCID: PMC2873538 DOI: 10.1186/1471-2105-11-197] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2009] [Accepted: 04/20/2010] [Indexed: 01/07/2023] Open
Abstract
Background Epistatic miniarray profiling (E-MAPs) is a high-throughput approach capable of quantifying aggravating or alleviating genetic interactions between gene pairs. The datasets resulting from E-MAP experiments typically take the form of a symmetric pairwise matrix of interaction scores. These datasets have a significant number of missing values - up to 35% - that can reduce the effectiveness of some data analysis techniques and prevent the use of others. An effective method for imputing interactions would therefore increase the types of possible analysis, as well as increase the potential to identify novel functional interactions between gene pairs. Several methods have been developed to handle missing values in microarray data, but it is unclear how applicable these methods are to E-MAP data because of their pairwise nature and the significantly larger number of missing values. Here we evaluate four alternative imputation strategies, three local (Nearest neighbor-based) and one global (PCA-based), that have been modified to work with symmetric pairwise data. Results We identify different categories for the missing data based on their underlying cause, and show that values from the largest category can be imputed effectively. We compare local and global imputation approaches across a variety of distinct E-MAP datasets, showing that both are competitive and preferable to filling in with zeros. In addition we show that these methods are effective in an E-MAP from a different species, suggesting that pairwise imputation techniques will be increasingly useful as analogous epistasis mapping techniques are developed in different species. We show that strongly alleviating interactions are significantly more difficult to predict than strongly aggravating interactions. Finally we show that imputed interactions, generated using nearest neighbor methods, are enriched for annotations in the same manner as measured interactions. Therefore our method potentially expands the number of mapped epistatic interactions. In addition we make implementations of our algorithms available for use by other researchers. Conclusions We address the problem of missing value imputation for E-MAPs, and suggest the use of symmetric nearest neighbor based approaches as they offer consistently accurate imputations across multiple datasets in a tractable manner.
Collapse
Affiliation(s)
- Colm Ryan
- School of Computer Science and Informatics, University College Dublin, Dublin, Ireland.
| | | | | | | |
Collapse
|
27
|
Impact of missing value imputation on classification for DNA microarray gene expression data--a model-based study. EURASIP JOURNAL ON BIOINFORMATICS & SYSTEMS BIOLOGY 2010; 2009:504069. [PMID: 20224634 DOI: 10.1155/2009/504069] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/18/2009] [Revised: 10/30/2009] [Accepted: 11/25/2009] [Indexed: 11/18/2022]
Abstract
Many missing-value (MV) imputation methods have been developed for microarray data, but only a few studies have investigated the relationship between MV imputation and classification accuracy. Furthermore, these studies are problematic in fundamental steps such as MV generation and classifier error estimation. In this work, we carry out a model-based study that addresses some of the issues in previous studies. Six popular imputation algorithms, two feature selection methods, and three classification rules are considered. The results suggest that it is beneficial to apply MV imputation when the noise level is high, variance is small, or gene-cluster correlation is strong, under small to moderate MV rates. In these cases, if data quality metrics are available, then it may be helpful to consider the data point with poor quality as missing and apply one of the most robust imputation algorithms to estimate the true signal based on the available high-quality data points. However, at large MV rates, we conclude that imputation methods are not recommended. Regarding the MV rate, our results indicate the presence of a peaking phenomenon: performance of imputation methods actually improves initially as the MV rate increases, but after an optimum point, performance quickly deteriorates with increasing MV rates.
Collapse
|
28
|
Multiple imputations applied to the DREAM3 phosphoproteomics challenge: a winning strategy. PLoS One 2010; 5:e8012. [PMID: 20090915 PMCID: PMC2807461 DOI: 10.1371/journal.pone.0008012] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2009] [Accepted: 10/15/2009] [Indexed: 11/19/2022] Open
Abstract
DREAM is an initiative that allows researchers to assess how well their methods or approaches can describe and predict networks of interacting molecules [1]. Each year, recently acquired datasets are released to predictors ahead of publication. Researchers typically have about three months to predict the masked data or network of interactions, using any predictive method. Predictions are assessed prior to an annual conference where the best predictions are unveiled and discussed. Here we present the strategy we used to make a winning prediction for the DREAM3 phosphoproteomics challenge. We used Amelia II, a multiple imputation software method developed by Gary King, James Honaker and Matthew Blackwell[2] in the context of social sciences to predict the 476 out of 4624 measurements that had been masked for the challenge. To chose the best possible multiple imputation parameters to apply for the challenge, we evaluated how transforming the data and varying the imputation parameters affected the ability to predict additionally masked data. We discuss the accuracy of our findings and show that multiple imputations applied to this dataset is a powerful method to accurately estimate the missing data. We postulate that multiple imputations methods might become an integral part of experimental design as a mean to achieve cost savings in experimental design or to increase the quantity of samples that could be handled for a given cost.
Collapse
|
29
|
Celton M, Malpertuy A, Lelandais G, de Brevern AG. Comparative analysis of missing value imputation methods to improve clustering and interpretation of microarray experiments. BMC Genomics 2010; 11:15. [PMID: 20056002 PMCID: PMC2827407 DOI: 10.1186/1471-2164-11-15] [Citation(s) in RCA: 66] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2009] [Accepted: 01/07/2010] [Indexed: 11/17/2022] Open
Abstract
Background Microarray technologies produced large amount of data. In a previous study, we have shown the interest of k-Nearest Neighbour approach for restoring the missing gene expression values, and its positive impact of the gene clustering by hierarchical algorithm. Since, numerous replacement methods have been proposed to impute missing values (MVs) for microarray data. In this study, we have evaluated twelve different usable methods, and their influence on the quality of gene clustering. Interestingly we have used several datasets, both kinetic and non kinetic experiments from yeast and human. Results We underline the excellent efficiency of approaches proposed and implemented by Bo and co-workers and especially one based on expected maximization (EM_array). These improvements have been observed also on the imputation of extreme values, the most difficult predictable values. We showed that the imputed MVs have still important effects on the stability of the gene clusters. The improvement on the clustering obtained by hierarchical clustering remains limited and, not sufficient to restore completely the correct gene associations. However, a common tendency can be found between the quality of the imputation method and the gene cluster stability. Even if the comparison between clustering algorithms is a complex task, we observed that k-means approach is more efficient to conserve gene associations. Conclusions More than 6.000.000 independent simulations have assessed the quality of 12 imputation methods on five very different biological datasets. Important improvements have so been done since our last study. The EM_array approach constitutes one efficient method for restoring the missing expression gene values, with a lower estimation error level. Nonetheless, the presence of MVs even at a low rate is a major factor of gene cluster instability. Our study highlights the need for a systematic assessment of imputation methods and so of dedicated benchmarks. A noticeable point is the specific influence of some biological dataset.
Collapse
Affiliation(s)
- Magalie Celton
- INSERM UMR-S 726, Equipe de Bioinformatique Génomique et Moléculaire, DSIMB, Université Paris Diderot-Paris 7, 2 place Jussieu, Paris, France
| | | | | | | |
Collapse
|
30
|
Aittokallio T. Dealing with missing values in large-scale studies: microarray data imputation and beyond. Brief Bioinform 2009; 11:253-64. [DOI: 10.1093/bib/bbp059] [Citation(s) in RCA: 109] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
|
31
|
Hiissa J, Elo LL, Huhtinen K, Perheentupa A, Poutanen M, Aittokallio T. Resampling reveals sample-level differential expression in clinical genome-wide studies. OMICS-A JOURNAL OF INTEGRATIVE BIOLOGY 2009; 13:381-96. [PMID: 19663710 DOI: 10.1089/omi.2009.0027] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
Genome-scale molecular profiling of clinical sample material often results in heterogeneous datasets beyond the capability of standard statistical procedures. Statistical tests for differential expression, in particular, rely upon the assumption that the sample groups being compared are relatively homogeneous. Such assumption rarely holds in clinical materials, which leads to detection of secondary findings (false positives) or loss of significant targets (false negatives). Here, we introduce a resampling-based procedure, named ReScore, which aggregates individual changes across all the samples while preserving their clinical classes, and thereby provides multiple sets of markers that can effectively characterize distinct sample subsets. When applied to a public leukemia microarray study, the procedure could accurately reveal hidden subgroup structures associated with underlying genotypic abnormalities. The procedure improved both the sensitivity and specificity of the findings, as well as helped us to identify several disease subtype-specific genes that have remained undetected in the conventional analyses. In our endometriosis study, we were able to accurately distinguish between various sources of systematic variation, linked, for example, to tissue-specificity and disease-related factors, many of which would have been missed with standard approaches. The generic procedure should benefit also other global profiling experiments such as those based on mass spectrometry-based proteomic assays.
Collapse
Affiliation(s)
- Jukka Hiissa
- Biomathematics Research Group, Department of Mathematics, University of Turku, Turku, Finland
| | | | | | | | | | | |
Collapse
|
32
|
Lelandais G, Tanty V, Geneix C, Etchebest C, Jacq C, Devaux F. Genome adaptation to chemical stress: clues from comparative transcriptomics in Saccharomyces cerevisiae and Candida glabrata. Genome Biol 2008; 9:R164. [PMID: 19025642 PMCID: PMC2614496 DOI: 10.1186/gb-2008-9-11-r164] [Citation(s) in RCA: 43] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2008] [Accepted: 11/24/2008] [Indexed: 12/21/2022] Open
Abstract
Comparative transcriptomics of Saccharomyces cerevisiae and Candida glabrata revealed a remarkable conservation of response to drug-induced stress, despite underlying differences in the regulatory networks. Background Recent technical and methodological advances have placed microbial models at the forefront of evolutionary and environmental genomics. To better understand the logic of genetic network evolution, we combined comparative transcriptomics, a differential clustering algorithm and promoter analyses in a study of the evolution of transcriptional networks responding to an antifungal agent in two yeast species: the free-living model organism Saccharomyces cerevisiae and the human pathogen Candida glabrata. Results We found that although the gene expression patterns characterizing the response to drugs were remarkably conserved between the two species, part of the underlying regulatory networks differed. In particular, the roles of the oxidative stress response transcription factors ScYap1p (in S. cerevisiae) and Cgap1p (in C. glabrata) had diverged. The sets of genes whose benomyl response depends on these factors are significantly different. Also, the DNA motifs targeted by ScYap1p and Cgap1p are differently represented in the promoters of these genes, suggesting that the DNA binding properties of the two proteins are slightly different. Experimental assays of ScYap1p and Cgap1p activities in vivo were in accordance with this last observation. Conclusions Based on these results and recently published data, we suggest that the robustness of environmental stress responses among related species contrasts with the rapid evolution of regulatory sequences, and depends on both the coevolution of transcription factor binding properties and the versatility of regulatory associations within transcriptional networks.
Collapse
Affiliation(s)
- Gaëlle Lelandais
- Equipe de Bioinformatique Génomique et Moléculaire, INSERM UMR S726, Université Paris 7, INTS, 6 rue Alexandre Cabanel, 75015 Paris, France.
| | | | | | | | | | | |
Collapse
|
33
|
Zhang X, Song X, Wang H, Zhang H. Sequential local least squares imputation estimating missing value of microarray data. Comput Biol Med 2008; 38:1112-20. [PMID: 18828999 DOI: 10.1016/j.compbiomed.2008.08.006] [Citation(s) in RCA: 45] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2007] [Revised: 05/17/2008] [Accepted: 08/15/2008] [Indexed: 11/19/2022]
Abstract
Missing values in microarray data can significantly affect subsequent analysis, thus it is important to estimate these missing values accurately. In this paper, a sequential local least squares imputation (SLLSimpute) method is proposed to solve this problem. It estimates missing values sequentially from the gene containing the fewest missing values and partially utilizes these estimated values. In addition, an automatic parameter selection algorithm, which can generate an appropriate number of neighboring genes for each target gene, is presented for parameter estimation. Experimental results confirmed that SLLSimpute method exhibited better estimation ability compared with other currently used imputation methods.
Collapse
Affiliation(s)
- Xiaobai Zhang
- Department of Biomedical Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, PR China
| | | | | | | |
Collapse
|
34
|
|
35
|
Tuikkala J, Elo LL, Nevalainen OS, Aittokallio T. Missing value imputation improves clustering and interpretation of gene expression microarray data. BMC Bioinformatics 2008; 9:202. [PMID: 18423022 PMCID: PMC2386492 DOI: 10.1186/1471-2105-9-202] [Citation(s) in RCA: 45] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2007] [Accepted: 04/18/2008] [Indexed: 11/22/2022] Open
Abstract
Background Missing values frequently pose problems in gene expression microarray experiments as they can hinder downstream analysis of the datasets. While several missing value imputation approaches are available to the microarray users and new ones are constantly being developed, there is no general consensus on how to choose between the different methods since their performance seems to vary drastically depending on the dataset being used. Results We show that this discrepancy can mostly be attributed to the way in which imputation methods have traditionally been developed and evaluated. By comparing a number of advanced imputation methods on recent microarray datasets, we show that even when there are marked differences in the measurement-level imputation accuracies across the datasets, these differences become negligible when the methods are evaluated in terms of how well they can reproduce the original gene clusters or their biological interpretations. Regardless of the evaluation approach, however, imputation always gave better results than ignoring missing data points or replacing them with zeros or average values, emphasizing the continued importance of using more advanced imputation methods. Conclusion The results demonstrate that, while missing values are still severely complicating microarray data analysis, their impact on the discovery of biologically meaningful gene groups can – up to a certain degree – be reduced by using readily available and relatively fast imputation methods, such as the Bayesian Principal Components Algorithm (BPCA).
Collapse
Affiliation(s)
- Johannes Tuikkala
- Department of Information Technology and TUCS, University of Turku, FI-20014 Turku, Finland.
| | | | | | | |
Collapse
|
36
|
Brock GN, Shaffer JR, Blakesley RE, Lotz MJ, Tseng GC. Which missing value imputation method to use in expression profiles: a comparative study and two selection schemes. BMC Bioinformatics 2008; 9:12. [PMID: 18186917 PMCID: PMC2253514 DOI: 10.1186/1471-2105-9-12] [Citation(s) in RCA: 77] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2007] [Accepted: 01/10/2008] [Indexed: 01/10/2023] Open
Abstract
Background Gene expression data frequently contain missing values, however, most down-stream analyses for microarray experiments require complete data. In the literature many methods have been proposed to estimate missing values via information of the correlation patterns within the gene expression matrix. Each method has its own advantages, but the specific conditions for which each method is preferred remains largely unclear. In this report we describe an extensive evaluation of eight current imputation methods on multiple types of microarray experiments, including time series, multiple exposures, and multiple exposures × time series data. We then introduce two complementary selection schemes for determining the most appropriate imputation method for any given data set. Results We found that the optimal imputation algorithms (LSA, LLS, and BPCA) are all highly competitive with each other, and that no method is uniformly superior in all the data sets we examined. The success of each method can also depend on the underlying "complexity" of the expression data, where we take complexity to indicate the difficulty in mapping the gene expression matrix to a lower-dimensional subspace. We developed an entropy measure to quantify the complexity of expression matrixes and found that, by incorporating this information, the entropy-based selection (EBS) scheme is useful for selecting an appropriate imputation algorithm. We further propose a simulation-based self-training selection (STS) scheme. This technique has been used previously for microarray data imputation, but for different purposes. The scheme selects the optimal or near-optimal method with high accuracy but at an increased computational cost. Conclusion Our findings provide insight into the problem of which imputation method is optimal for a given data set. Three top-performing methods (LSA, LLS and BPCA) are competitive with each other. Global-based imputation methods (PLS, SVD, BPCA) performed better on mcroarray data with lower complexity, while neighbour-based methods (KNN, OLS, LSA, LLS) performed better in data with higher complexity. We also found that the EBS and STS schemes serve as complementary and effective tools for selecting the optimal imputation algorithm.
Collapse
Affiliation(s)
- Guy N Brock
- Department of Bioinformatics and Biostatistics, School of Public Health and Information Sciences, Universtiy of Louisville, Louisville, KY 40292, USA.
| | | | | | | | | |
Collapse
|
37
|
Varshavsky R, Gottlieb A, Horn D, Linial M. Unsupervised feature selection under perturbations: meeting the challenges of biological data. ACTA ACUST UNITED AC 2007; 23:3343-9. [PMID: 17989091 DOI: 10.1093/bioinformatics/btm528] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
MOTIVATION Feature selection methods aim to reduce the complexity of data and to uncover the most relevant biological variables. In reality, information in biological datasets is often incomplete as a result of untrustworthy samples and missing values. The reliability of selection methods may therefore be questioned. METHOD Information loss is incorporated into a perturbation scheme, testing which features are stable under it. This method is applied to data analysis by unsupervised feature filtering (UFF). The latter has been shown to be a very successful method in analysis of gene-expression data. RESULTS We find that the UFF quality degrades smoothly with information loss. It remains successful even under substantial damage. Our method allows for selection of a best imputation method on a dataset treated by UFF. More importantly, scoring features according to their stability under information loss is shown to be correlated with biological importance in cancer studies. This scoring may lead to novel biological insights.
Collapse
Affiliation(s)
- Roy Varshavsky
- School of Computer Science and Engineering, The Hebrew University of Jerusalem 91904, Israel.
| | | | | | | |
Collapse
|
38
|
Brás LP, Menezes JC. Improving cluster-based missing value estimation of DNA microarray data. ACTA ACUST UNITED AC 2007; 24:273-82. [PMID: 17493870 DOI: 10.1016/j.bioeng.2007.04.003] [Citation(s) in RCA: 37] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2006] [Revised: 02/21/2007] [Accepted: 04/12/2007] [Indexed: 10/23/2022]
Abstract
We present a modification of the weighted K-nearest neighbours imputation method (KNNimpute) for missing values (MVs) estimation in microarray data based on the reuse of estimated data. The method was called iterative KNN imputation (IKNNimpute) as the estimation is performed iteratively using the recently estimated values. The estimation efficiency of IKNNimpute was assessed under different conditions (data type, fraction and structure of missing data) by the normalized root mean squared error (NRMSE) and the correlation coefficients between estimated and true values, and compared with that of other cluster-based estimation methods (KNNimpute and sequential KNN). We further investigated the influence of imputation on the detection of differentially expressed genes using SAM by examining the differentially expressed genes that are lost after MV estimation. The performance measures give consistent results, indicating that the iterative procedure of IKNNimpute can enhance the prediction ability of cluster-based methods in the presence of high missing rates, in non-time series experiments and in data sets comprising both time series and non-time series data, because the information of the genes having MVs is used more efficiently and the iterative procedure allows refining the MV estimates. More importantly, IKNN has a smaller detrimental effect on the detection of differentially expressed genes.
Collapse
Affiliation(s)
- Lígia P Brás
- Centre for Chemical & Biological Engineering, Department of Chemical and Biological Engineering, IST, Technical University of Lisbon, Av. Rovisco Pais, P-1049-001 Lisbon, Portugal
| | | |
Collapse
|
39
|
Wong DSV, Wong FK, Wood GR. A multi-stage approach to clustering and imputation of gene expression profiles. Bioinformatics 2007; 23:998-1005. [PMID: 17308340 DOI: 10.1093/bioinformatics/btm053] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
MOTIVATION Microarray experiments have revolutionized the study of gene expression with their ability to generate large amounts of data. This article describes an alternative to existing approaches to clustering of gene expression profiles; the key idea is to cluster in stages using a hierarchy of distance measures. This method is motivated by the way in which the human mind sorts and so groups many items. The distance measures arise from the orthogonal breakup of Euclidean distance, giving us a set of independent measures of different attributes of the gene expression profile. Interpretation of these distances is closely related to the statistical design of the microarray experiment. This clustering method not only accommodates missing data but also leads to an associated imputation method. RESULTS The performance of the clustering and imputation methods was tested on a simulated dataset, a yeast cell cycle dataset and a central nervous system development dataset. Based on the Rand and adjusted Rand indices, the clustering method is more consistent with the biological classification of the data than commonly used clustering methods. The imputation method, at varying levels of missingness, outperforms most imputation methods, based on root mean squared error (RMSE). AVAILABILITY Code in R is available on request from the authors.
Collapse
Affiliation(s)
- Dorothy S V Wong
- Department of Statistics, Macquarie University, NSW 2109, Australia.
| | | | | |
Collapse
|
40
|
Meunier B, Dumas E, Piec I, Béchet D, Hébraud M, Hocquette JF. Assessment of Hierarchical Clustering Methodologies for Proteomic Data Mining. J Proteome Res 2006; 6:358-66. [PMID: 17203979 DOI: 10.1021/pr060343h] [Citation(s) in RCA: 119] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Hierarchical clustering methodology is a powerful data mining approach for a first exploration of proteomic data. It enables samples or proteins to be grouped blindly according to their expression profiles. Nevertheless, the clustering results depend on parameters such as data preprocessing, between-profile similarity measurement, and the dendrogram construction procedure. We assessed several clustering strategies by calculating the F-measure, a widely used quality metric. The combination, on logged matrix, of Pearson correlation and Ward's methods for data aggregation is among the best clustering strategies, at least with the data sets we studied. This study was carried out using PermutMatrix, a freely available software derived from transcriptomics.
Collapse
Affiliation(s)
- Bruno Meunier
- UR 1213, Unité de Recherches sur les Herbivores, Equipe Croissance et Métabolisme du Muscle, INRA de Clermont-Ferrand/Theix, F-63122 [corrected] Saint-Genès Champanelle, France.
| | | | | | | | | | | |
Collapse
|
41
|
Raab RM. Incorporating genome-scale tools for studying energy homeostasis. Nutr Metab (Lond) 2006; 3:40. [PMID: 17081308 PMCID: PMC1636640 DOI: 10.1186/1743-7075-3-40] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2006] [Accepted: 11/03/2006] [Indexed: 11/16/2022] Open
Abstract
Mammals have evolved complex regulatory systems that enable them to maintain energy homeostasis despite constant environmental challenges that limit the availability of energy inputs and their composition. Biological control relies upon intricate systems composed of multiple organs and specialized cell types that regulate energy up-take, storage, and expenditure. Because these systems simultaneously perform diverse functions and are highly integrated, they are extremely difficult to understand in terms of their individual component contributions to energy homeostasis. In order to provide improved treatments and clinical options, it is important to identify the principle genetic and molecular components, as well as the systemic features of regulation. To begin, many of these features can be discovered by integrating experimental technologies with advanced methods of analysis. This review focuses on the analysis of transcriptional data derived from microarrays and how it can complement other experimental techniques to study energy homeostasis.
Collapse
|
42
|
Hu J, Li H, Waterman MS, Zhou XJ. Integrative missing value estimation for microarray data. BMC Bioinformatics 2006; 7:449. [PMID: 17038176 PMCID: PMC1622759 DOI: 10.1186/1471-2105-7-449] [Citation(s) in RCA: 30] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2006] [Accepted: 10/12/2006] [Indexed: 11/10/2022] Open
Abstract
Background Missing value estimation is an important preprocessing step in microarray analysis. Although several methods have been developed to solve this problem, their performance is unsatisfactory for datasets with high rates of missing data, high measurement noise, or limited numbers of samples. In fact, more than 80% of the time-series datasets in Stanford Microarray Database contain less than eight samples. Results We present the integrative Missing Value Estimation method (iMISS) by incorporating information from multiple reference microarray datasets to improve missing value estimation. For each gene with missing data, we derive a consistent neighbor-gene list by taking reference data sets into consideration. To determine whether the given reference data sets are sufficiently informative for integration, we use a submatrix imputation approach. Our experiments showed that iMISS can significantly and consistently improve the accuracy of the state-of-the-art Local Least Square (LLS) imputation algorithm by up to 15% improvement in our benchmark tests. Conclusion We demonstrated that the order-statistics-based integrative imputation algorithms can achieve significant improvements over the state-of-the-art missing value estimation approaches such as LLS and is especially good for imputing microarray datasets with a limited number of samples, high rates of missing data, or very noisy measurements. With the rapid accumulation of microarray datasets, the performance of our approach can be further improved by incorporating larger and more appropriate reference datasets.
Collapse
Affiliation(s)
- Jianjun Hu
- Molecular and Computational Biology Section, Department of Biological Sciences, University of Southern California, Los Angeles, CA, 900089, USA
| | - Haifeng Li
- Molecular and Computational Biology Section, Department of Biological Sciences, University of Southern California, Los Angeles, CA, 900089, USA
| | - Michael S Waterman
- Molecular and Computational Biology Section, Department of Biological Sciences, University of Southern California, Los Angeles, CA, 900089, USA
| | - Xianghong Jasmine Zhou
- Molecular and Computational Biology Section, Department of Biological Sciences, University of Southern California, Los Angeles, CA, 900089, USA
| |
Collapse
|
43
|
Wang D, Lv Y, Guo Z, Li X, Li Y, Zhu J, Yang D, Xu J, Wang C, Rao S, Yang B. Effects of replacing the unreliable cDNA microarray measurements on the disease classification based on gene expression profiles and functional modules. Bioinformatics 2006; 22:2883-9. [PMID: 16809389 DOI: 10.1093/bioinformatics/btl339] [Citation(s) in RCA: 33] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023] Open
Abstract
MOTIVATION Microarrays datasets frequently contain a large number of missing values (MVs), which need to be estimated and replaced for subsequent data mining. The focus of the paper is to study the effects of different MV treatments for cDNA microarray data on disease classification analysis. RESULTS By analyzing five datasets, we demonstrate that among three kinds of classifiers evaluated in this study, support vector machine (SVM) classifiers are robust to varied MV imputation methods [e.g. replacing MVs by zero, K nearest-neighbor (KNN) imputation algorithm, local least square imputation and Bayesian principal component analysis], while the classification and regression tree classifiers are sensitive in terms of classification accuracy. The KNNclassifiers built on differentially expressed genes (DEGs) are robust to the varied MV treatments, but the performances of the KNN classifiers based on all measured genes can be significantly deteriorated when imputing MVs for genes with larger missing rate (MR) (e.g. MR > 5%). Generally, while replacing MVs by zero performs relatively poor, the other imputation algorithms have little difference in affecting classification performances of the SVM or KNN classifiers. We further demonstrate the power and feasibility of our recently proposed functional expression profile (FEP) approach as means to handle microarray data with MVs. The FEPs, which are derived from the functional modules that are enriched with sets of DEGs and thus can be consistently identified under varied MV treatments, achieve precise disease classification with better biological interpretation. We conclude that the choice of MV treatments should be determined in context of the later approaches used for disease classification. The suggested exclusion criterion of ignoring the genes with larger MR (e.g. >5%), while justifiable for some classifiers such as KNN classifiers, might not be considered as a general rule for all classifiers.
Collapse
Affiliation(s)
- Dong Wang
- Department of Bioinformatics and Bio-pharmaceutical Key Laboratory of Heilongjiang Province and State, Harbin Medical University Harbin 150086, China
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
44
|
Clarke JD, Zhu T. Microarray analysis of the transcriptome as a stepping stone towards understanding biological systems: practical considerations and perspectives. THE PLANT JOURNAL : FOR CELL AND MOLECULAR BIOLOGY 2006; 45:630-50. [PMID: 16441353 DOI: 10.1111/j.1365-313x.2006.02668.x] [Citation(s) in RCA: 32] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/06/2023]
Abstract
DNA microarrays have been used to characterize plant transcriptomes to answer various biological questions. While many studies have provided significant insights, there has been great debate about the general reliability of the technology and data analysis. When compared to well-established transcript analysis technologies, such as RNA blot analysis or quantitative reverse transcription-PCR, discrepancies have frequently been observed. The reasons for these discrepancies often relate to the technical and experimental systems. This review-tutorial addresses common problems in microarray analysis and describes: (i) methods to maximize extraction of valuable biological information from the vast amount of microarray data and (ii) approaches to balance resource availability with high scientific standards and technological innovation with peer acceptability.
Collapse
Affiliation(s)
- Joseph D Clarke
- Syngenta Biotechnology Inc., 3054 Cornwallis Road, Research Triangle Park, NC 27709-2257, USA
| | | |
Collapse
|
45
|
Tuikkala J, Elo L, Nevalainen OS, Aittokallio T. Improving missing value estimation in microarray data with gene ontology. Bioinformatics 2005; 22:566-72. [PMID: 16377613 DOI: 10.1093/bioinformatics/btk019] [Citation(s) in RCA: 81] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Gene expression microarray experiments produce datasets with frequent missing expression values. Accurate estimation of missing values is an important prerequisite for efficient data analysis as many statistical and machine learning techniques either require a complete dataset or their results are significantly dependent on the quality of such estimates. A limitation of the existing estimation methods for microarray data is that they use no external information but the estimation is based solely on the expression data. We hypothesized that utilizing a priori information on functional similarities available from public databases facilitates the missing value estimation. RESULTS We investigated whether semantic similarity originating from gene ontology (GO) annotations could improve the selection of relevant genes for missing value estimation. The relative contribution of each information source was automatically estimated from the data using an adaptive weight selection procedure. Our experimental results in yeast cDNA microarray datasets indicated that by considering GO information in the k-nearest neighbor algorithm we can enhance its performance considerably, especially when the number of experimental conditions is small and the percentage of missing values is high. The increase of performance was less evident with a more sophisticated estimation method. We conclude that even a small proportion of annotated genes can provide improvements in data quality significant for the eventual interpretation of the microarray experiments. AVAILABILITY Java and Matlab codes are available on request from the authors. SUPPLEMENTARY MATERIAL Available online at http://users.utu.fi/jotatu/GOImpute.html.
Collapse
Affiliation(s)
- Johannes Tuikkala
- Department of Information Technology, University of Turku, Lemminkäisenkatu 14A, FIN-20520, Finland.
| | | | | | | |
Collapse
|
46
|
Scheel I, Aldrin M, Glad IK, Sørum R, Lyng H, Frigessi A. The influence of missing value imputation on detection of differentially expressed genes from microarray data. Bioinformatics 2005; 21:4272-9. [PMID: 16216830 DOI: 10.1093/bioinformatics/bti708] [Citation(s) in RCA: 53] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open
Abstract
MOTIVATION Missing values are problematic for the analysis of microarray data. Imputation methods have been compared in terms of the similarity between imputed and true values in simulation experiments and not of their influence on the final analysis. The focus has been on missing at random, while entries are missing also not at random. RESULTS We investigate the influence of imputation on the detection of differentially expressed genes from cDNA microarray data. We apply ANOVA for microarrays and SAM and look to the differentially expressed genes that are lost because of imputation. We show that this new measure provides useful information that the traditional root mean squared error cannot capture. We also show that the type of missingness matters: imputing 5% missing not at random has the same effect as imputing 10-30% missing at random. We propose a new method for imputation (LinImp), fitting a simple linear model for each channel separately, and compare it with the widely used KNNimpute method. For 10% missing at random, KNNimpute leads to twice as many lost differentially expressed genes as LinImp. AVAILABILITY The R package for LinImp is available at http://folk.uio.no/idasch/imp.
Collapse
Affiliation(s)
- Ida Scheel
- Department of Mathematics, University of Oslo PO Box 1053, Blindern, NO-0316 Oslo, Norway.
| | | | | | | | | | | |
Collapse
|
47
|
Imai K, Kawai M, Tada M, Nagase T, Ohara O, Koga H. Temporal change in mKIAA gene expression during the early stage of retinoic acid-induced neurite outgrowth. Gene 2005; 364:114-22. [PMID: 16169686 DOI: 10.1016/j.gene.2005.05.037] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2005] [Revised: 04/28/2005] [Accepted: 05/30/2005] [Indexed: 10/25/2022]
Abstract
mKIAA genes are mouse counterparts of human KIAA genes, which were isolated in our cDNA project and were functionally unknown at the time they were sequenced. Because KIAA/mKIAA genes were isolated mainly from cDNA libraries derived from brain tissues, they are thought to be important for the organization and function of the brain. To investigate the participation of mKIAA genes in neuronal phenomena, we analyzed retinoic acid-induced neurite outgrowth using an mKIAA oligonucleotide microarray. Focusing on the early stage of this outgrowth phenomenon, we analyzed temporal gene expression changes 1-24 h after treatment with retinoic acid and found several change patterns in 38 mKIAA genes. Among them, six were upregulated at 3 h and subsequently returned to the steady state. Supposing that these genes had important roles, we performed semi-quantitative RT-PCR analysis and confirmed the existence of temporal expression patterns in two genes (mKIAA0182 and mKIAA1039). Further computational analysis of the 38 genes enabled us to find the cellular pathway associated with 6 of them with high confidence. These results indicate that some mKIAA genes are apparently relevant to retinoic acid-induced neurite outgrowth.
Collapse
Affiliation(s)
- Kazuhide Imai
- Chiba Industry Advancement Center, 2-6 Nakase, Mihama-ku, Chiba 261-7126, Japan
| | | | | | | | | | | |
Collapse
|