1
|
Winnicki MJ, Brown CA, Porter HL, Giles CB, Wren JD. BioVDB: biological vector database for high-throughput gene expression meta-analysis. Front Artif Intell 2024; 7:1366273. [PMID: 38525301 PMCID: PMC10957786 DOI: 10.3389/frai.2024.1366273] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2024] [Accepted: 02/26/2024] [Indexed: 03/26/2024] Open
Abstract
High-throughput sequencing has created an exponential increase in the amount of gene expression data, much of which is freely, publicly available in repositories such as NCBI's Gene Expression Omnibus (GEO). Querying this data for patterns such as similarity and distance, however, becomes increasingly challenging as the total amount of data increases. Furthermore, vectorization of the data is commonly required in Artificial Intelligence and Machine Learning (AI/ML) approaches. We present BioVDB, a vector database for storage and analysis of gene expression data, which enhances the potential for integrating biological studies with AI/ML tools. We used a previously developed approach called Automatic Label Extraction (ALE) to extract sample labels from metadata, including age, sex, and tissue/cell-line. BioVDB stores 438,562 samples from eight microarray GEO platforms. We show that it allows for efficient querying of data using similarity search, which can also be useful for identifying and inferring missing labels of samples, and for rapid similarity analysis.
Collapse
Affiliation(s)
- Michał J. Winnicki
- Genes and Human Disease Research Program, Oklahoma Medical Research Foundation, Oklahoma City, OK, United States
| | - Chase A. Brown
- Genes and Human Disease Research Program, Oklahoma Medical Research Foundation, Oklahoma City, OK, United States
- Oklahoma Center for Neuroscience, University of Oklahoma Health Sciences Center, Oklahoma City, OK, United States
| | - Hunter L. Porter
- Genes and Human Disease Research Program, Oklahoma Medical Research Foundation, Oklahoma City, OK, United States
| | - Cory B. Giles
- Genes and Human Disease Research Program, Oklahoma Medical Research Foundation, Oklahoma City, OK, United States
| | - Jonathan D. Wren
- Genes and Human Disease Research Program, Oklahoma Medical Research Foundation, Oklahoma City, OK, United States
- Oklahoma Center for Neuroscience, University of Oklahoma Health Sciences Center, Oklahoma City, OK, United States
- Department of Biochemistry and Molecular Biology, University of Oklahoma Health Sciences Center, Oklahoma City, OK, United States
- Oklahoma Nathan Shock Center, Oklahoma City, OK, United States
| |
Collapse
|
2
|
Hephzibah Cathryn R, Udhaya Kumar S, Younes S, Zayed H, George Priya Doss C. A review of bioinformatics tools and web servers in different microarray platforms used in cancer research. ADVANCES IN PROTEIN CHEMISTRY AND STRUCTURAL BIOLOGY 2022; 131:85-164. [PMID: 35871897 DOI: 10.1016/bs.apcsb.2022.05.002] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/26/2023]
Abstract
Over the past decade, conventional lab work strategies have gradually shifted from being limited to a laboratory setting towards a bioinformatics era to help manage and process the vast amounts of data generated by omics technologies. The present work outlines the latest contributions of bioinformatics in analyzing microarray data and their application to cancer. We dissect different microarray platforms and their use in gene expression in cancer models. We highlight how computational advances empowered the microarray technology in gene expression analysis. The study on protein-protein interaction databases classified into primary, derived, meta-database, and prediction databases describes the strategies to curate and predict novel interaction networks in silico. In addition, we summarize the areas of bioinformatics where neural graph networks are currently being used, such as protein functions, protein interaction prediction, and in silico drug discovery and development. We also discuss the role of deep learning as a potential tool in the prognosis, diagnosis, and treatment of cancer. Integrating these resources efficiently, practically, and ethically is likely to be the most challenging task for the healthcare industry over the next decade; however, we believe that it is achievable in the long term.
Collapse
Affiliation(s)
- R Hephzibah Cathryn
- Laboratory of Integrative Genomics, Department of Integrative Biology, School of Biosciences and Technology, Vellore Institute of Technology, Vellore, India
| | - S Udhaya Kumar
- Laboratory of Integrative Genomics, Department of Integrative Biology, School of Biosciences and Technology, Vellore Institute of Technology, Vellore, India
| | - Salma Younes
- Department of Biomedical Sciences, College of Health and Sciences, Qatar University, QU Health, Doha, Qatar
| | - Hatem Zayed
- Department of Biomedical Sciences, College of Health and Sciences, Qatar University, QU Health, Doha, Qatar
| | - C George Priya Doss
- Laboratory of Integrative Genomics, Department of Integrative Biology, School of Biosciences and Technology, Vellore Institute of Technology, Vellore, India.
| |
Collapse
|
3
|
Ahmad S, Prathipati P, Tripathi LP, Chen YA, Arya A, Murakami Y, Mizuguchi K. Integrating sequence and gene expression information predicts genome-wide DNA-binding proteins and suggests a cooperative mechanism. Nucleic Acids Res 2019; 46:54-70. [PMID: 29186632 PMCID: PMC5758906 DOI: 10.1093/nar/gkx1166] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2016] [Accepted: 11/15/2017] [Indexed: 12/29/2022] Open
Abstract
DNA-binding proteins (DBPs) perform diverse biological functions ranging from transcription to pathogen sensing. Machine learning methods can not only identify DBPs de novo but also provide insights into their DNA-recognition dynamics. However, it remains unclear whether available methods that can accurately predict DNA-binding sites in known DBPs can also identify novel DBPs. Moreover, sequence information is blind to the cellular- and disease-specific contexts of DBP activities, whereas the under-utilized knowledge from public gene expression data offers great promise. To address these issues, we have developed novel methods for predicting DBPs by integrating sequence and gene expression-derived features and applied them to explore human, mouse and Arabidopsis proteomes. While our sequence-based models outperformed the gene expression-based ones, some proteins with weaker DBP-like sequence features were correctly predicted by gene expression-based features, suggesting that these proteins acquire a tangible DBP functionality in a conducive gene expression environment. Analysis of motif enrichment among the co-expressed genes of top 100 candidates DBPs from hitherto unannotated genes provides further avenues to explore their functional associations.
Collapse
Affiliation(s)
- Shandar Ahmad
- School of Computational and Integrative Sciences, Jawaharlal Nehru University, New Delhi 110067, India.,Laboratory of Bioinformatics, National Institutes of Biomedical Innovation, Health and Nutrition, 7-6-8 Saito-asagi, Ibaraki, Osaka 5670085, Japan
| | - Philip Prathipati
- Laboratory of Bioinformatics, National Institutes of Biomedical Innovation, Health and Nutrition, 7-6-8 Saito-asagi, Ibaraki, Osaka 5670085, Japan
| | - Lokesh P Tripathi
- Laboratory of Bioinformatics, National Institutes of Biomedical Innovation, Health and Nutrition, 7-6-8 Saito-asagi, Ibaraki, Osaka 5670085, Japan
| | - Yi-An Chen
- Laboratory of Bioinformatics, National Institutes of Biomedical Innovation, Health and Nutrition, 7-6-8 Saito-asagi, Ibaraki, Osaka 5670085, Japan
| | - Ajay Arya
- School of Computational and Integrative Sciences, Jawaharlal Nehru University, New Delhi 110067, India
| | - Yoichi Murakami
- Laboratory of Bioinformatics, National Institutes of Biomedical Innovation, Health and Nutrition, 7-6-8 Saito-asagi, Ibaraki, Osaka 5670085, Japan
| | - Kenji Mizuguchi
- Laboratory of Bioinformatics, National Institutes of Biomedical Innovation, Health and Nutrition, 7-6-8 Saito-asagi, Ibaraki, Osaka 5670085, Japan
| |
Collapse
|
4
|
Gendoo DMA, Zon M, Sandhu V, Manem VSK, Ratanasirigulchai N, Chen GM, Waldron L, Haibe-Kains B. MetaGxData: Clinically Annotated Breast, Ovarian and Pancreatic Cancer Datasets and their Use in Generating a Multi-Cancer Gene Signature. Sci Rep 2019; 9:8770. [PMID: 31217513 PMCID: PMC6584731 DOI: 10.1038/s41598-019-45165-4] [Citation(s) in RCA: 25] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2018] [Accepted: 05/31/2019] [Indexed: 12/13/2022] Open
Abstract
A wealth of transcriptomic and clinical data on solid tumours are under-utilized due to unharmonized data storage and format. We have developed the MetaGxData package compendium, which includes manually-curated and standardized clinical, pathological, survival, and treatment metadata across breast, ovarian, and pancreatic cancer data. MetaGxData is the largest compendium of curated transcriptomic data for these cancer types to date, spanning 86 datasets and encompassing 15,249 samples. Open access to standardized metadata across cancer types promotes use of their transcriptomic and clinical data in a variety of cross-tumour analyses, including identification of common biomarkers, and assessing the validity of prognostic signatures. Here, we demonstrate that MetaGxData is a flexible framework that facilitates meta-analyses by using it to identify common prognostic genes in ovarian and breast cancer. Furthermore, we use the data compendium to create the first gene signature that is prognostic in a meta-analysis across 3 cancer types. These findings demonstrate the potential of MetaGxData to serve as an important resource in oncology research, and provide a foundation for future development of cancer-specific compendia.
Collapse
Affiliation(s)
- Deena M A Gendoo
- Centre for Computational Biology, Institute of Cancer and Genomic Sciences, University of Birmingham, Birmingham, B15 2TT, United Kingdom.
| | - Michael Zon
- Princess Margaret Cancer Center, University Health Network, Toronto, M5G 2C1, Canada.,Department of Biomedical Engineering, McMaster University, Toronto, L8S 4L8, Canada
| | - Vandana Sandhu
- Princess Margaret Cancer Center, University Health Network, Toronto, M5G 2C1, Canada
| | - Venkata S K Manem
- Princess Margaret Cancer Center, University Health Network, Toronto, M5G 2C1, Canada.,Department of Medical Biophysics, University of Toronto, Toronto, M5S 3H7, Canada.,Institut Universitaire de Cardiologie et de Pneumologie de Québec, Université Laval, Québec City, G1V 4G5, Canada
| | | | - Gregory M Chen
- Princess Margaret Cancer Center, University Health Network, Toronto, M5G 2C1, Canada
| | - Levi Waldron
- Graduate School of Public Health and Health Policy, Institute of Implementation Science in Population Health, City University of New York School, New York, 11101, USA.
| | - Benjamin Haibe-Kains
- Princess Margaret Cancer Center, University Health Network, Toronto, M5G 2C1, Canada. .,Department of Medical Biophysics, University of Toronto, Toronto, M5S 3H7, Canada. .,Department of Computer Science, University of Toronto, Toronto, M5T 3A1, Canada. .,Ontario Institute of Cancer Research, Toronto, M5G 0A3, Canada. .,Vector Institute, Toronto, M5G 1M1, Canada.
| |
Collapse
|
5
|
Siangphoe U, Archer KJ, Mukhopadhyay ND. Classical and Bayesian random-effects meta-analysis models with sample quality weights in gene expression studies. BMC Bioinformatics 2019; 20:18. [PMID: 30626315 PMCID: PMC6327440 DOI: 10.1186/s12859-018-2491-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2018] [Accepted: 11/12/2018] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Random-effects (RE) models are commonly applied to account for heterogeneity in effect sizes in gene expression meta-analysis. The degree of heterogeneity may differ due to inconsistencies in sample quality. High heterogeneity can arise in meta-analyses containing poor quality samples. We applied sample-quality weights to adjust the study heterogeneity in the DerSimonian and Laird (DSL) and two-step DSL (DSLR2) RE models and the Bayesian random-effects (BRE) models with unweighted and weighted data, Gibbs and Metropolis-Hasting (MH) sampling algorithms, weighted common effect, and weighted between-study variance. We evaluated the performance of the models through simulations and illustrated application of the methods using Alzheimer's gene expression datasets. RESULTS Sample quality adjusting within study variance (wP6) models provided an appropriate reduction of differentially expressed (DE) genes compared to other weighted functions in classical RE models. The BRE model with a uniform(0,1) prior was appropriate for detecting DE genes as compared to the models with other prior distributions. The precision of DE gene detection in the heterogeneous data was increased with the DSLR2wP6 weighted model compared to the DSLwP6 weighted model. Among the BRE weighted models, the wP6weighted- and unweighted-data models and both Gibbs- and MH-based models performed similarly. The wP6 weighted common-effect model performed similarly to the unweighted model in the homogeneous data, but performed worse in the heterogeneous data. The wP6weighted data were appropriate for detecting DE genes with high precision, while the wP6weighted between-study variance models were appropriate for detecting DE genes with high overall accuracy. Without the weight, when the number of genes in microarray increased, the DSLR2 performed stably, while the overall accuracy of the BRE model was reduced. When applying the weighted models in the Alzheimer's gene expression data, the number of DE genes decreased in all metadata sets with the DSLR2wP6weighted and the wP6weighted between study variance models. Four hundred and forty-six DE genes identified by the wP6weighted between study variance model could be potentially down-regulated genes that may contribute to good classification of Alzheimer's samples. CONCLUSIONS The application of sample quality weights can increase precision and accuracy of the classical RE and BRE models; however, the performance of the models varied depending on data features, levels of sample quality, and adjustment of parameter estimates.
Collapse
Affiliation(s)
- Uma Siangphoe
- Office of Biostatistics, Center for Drug Evaluation and Research, U.S. Food and Drug Administration, Silver Spring, Maryland USA
| | - Kellie J. Archer
- Division of Biostatistics, College of Public Health, The Ohio State University, Columbus, Ohio USA
| | - Nitai D. Mukhopadhyay
- Department of Biostatistics, Virginia Commonwealth University, Richmond, Virginia USA
| |
Collapse
|
6
|
LCE: an open web portal to explore gene expression and clinical associations in lung cancer. Oncogene 2018; 38:2551-2564. [PMID: 30532070 PMCID: PMC6477796 DOI: 10.1038/s41388-018-0588-2] [Citation(s) in RCA: 64] [Impact Index Per Article: 9.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2018] [Revised: 09/04/2018] [Accepted: 09/05/2018] [Indexed: 02/06/2023]
Abstract
We constructed a lung cancer-specific database housing expression data and clinical data from over 6700 patients in 56 studies. Expression data from 23 genome-wide platforms were carefully processed and quality controlled, whereas clinical data were standardized and rigorously curated. Empowered by this lung cancer database, we created an open access web resource—the Lung Cancer Explorer (LCE), which enables researchers and clinicians to explore these data and perform analyses. Users can perform meta-analyses on LCE to gain a quick overview of the results on tumor vs non-malignant tissue (normal) differential gene expression and expression-survival association. Individual dataset-based survival analysis, comparative analysis, and correlation analysis are also provided with flexible options to allow for customized analyses from the user.
Collapse
|
7
|
Lakiotaki K, Vorniotakis N, Tsagris M, Georgakopoulos G, Tsamardinos I. BioDataome: a collection of uniformly preprocessed and automatically annotated datasets for data-driven biology. Database (Oxford) 2018; 2018:4917852. [PMID: 29688366 PMCID: PMC5836265 DOI: 10.1093/database/bay011] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2017] [Revised: 01/12/2018] [Accepted: 01/15/2018] [Indexed: 01/12/2023]
Abstract
Biotechnology revolution generates a plethora of omics data with an exponential growth pace. Therefore, biological data mining demands automatic, 'high quality' curation efforts to organize biomedical knowledge into online databases. BioDataome is a database of uniformly preprocessed and disease-annotated omics data with the aim to promote and accelerate the reuse of public data. We followed the same preprocessing pipeline for each biological mart (microarray gene expression, RNA-Seq gene expression and DNA methylation) to produce ready for downstream analysis datasets and automatically annotated them with disease-ontology terms. We also designate datasets that share common samples and automatically discover control samples in case-control studies. Currently, BioDataome includes ∼5600 datasets, ∼260 000 samples spanning ∼500 diseases and can be easily used in large-scale massive experiments and meta-analysis. All datasets are publicly available for querying and downloading via BioDataome web application. We demonstrate BioDataome's utility by presenting exploratory data analysis examples. We have also developed BioDataome R package found in: https://github.com/mensxmachina/BioDataome/.Database URL: http://dataome.mensxmachina.org/.
Collapse
Affiliation(s)
- Kleanthi Lakiotaki
- Computer Science Department, University of Crete, Voutes Campus, 70013 Heraklion, Crete, Greece
| | - Nikolaos Vorniotakis
- Computer Science Department, University of Crete, Voutes Campus, 70013 Heraklion, Crete, Greece
| | - Michail Tsagris
- Computer Science Department, University of Crete, Voutes Campus, 70013 Heraklion, Crete, Greece
| | - Georgios Georgakopoulos
- Computer Science Department, University of Crete, Voutes Campus, 70013 Heraklion, Crete, Greece
| | - Ioannis Tsamardinos
- Computer Science Department, University of Crete, Voutes Campus, 70013 Heraklion, Crete, Greece
- Gnosis Data Analysis PC, Palaiokapa 64, 71305 Heraklion, Crete, Greece
| |
Collapse
|
8
|
Gene selection for microarray data classification via subspace learning and manifold regularization. Med Biol Eng Comput 2017; 56:1271-1284. [PMID: 29256006 DOI: 10.1007/s11517-017-1751-6] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2017] [Accepted: 11/03/2017] [Indexed: 10/18/2022]
Abstract
With the rapid development of DNA microarray technology, large amount of genomic data has been generated. Classification of these microarray data is a challenge task since gene expression data are often with thousands of genes but a small number of samples. In this paper, an effective gene selection method is proposed to select the best subset of genes for microarray data with the irrelevant and redundant genes removed. Compared with original data, the selected gene subset can benefit the classification task. We formulate the gene selection task as a manifold regularized subspace learning problem. In detail, a projection matrix is used to project the original high dimensional microarray data into a lower dimensional subspace, with the constraint that the original genes can be well represented by the selected genes. Meanwhile, the local manifold structure of original data is preserved by a Laplacian graph regularization term on the low-dimensional data space. The projection matrix can serve as an importance indicator of different genes. An iterative update algorithm is developed for solving the problem. Experimental results on six publicly available microarray datasets and one clinical dataset demonstrate that the proposed method performs better when compared with other state-of-the-art methods in terms of microarray data classification. Graphical Abstract The graphical abstract of this work.
Collapse
|
9
|
Nandal UK, van Kampen AHC, Moerland PD. compendiumdb: an R package for retrieval and storage of functional genomics data. Bioinformatics 2016; 32:2856-7. [DOI: 10.1093/bioinformatics/btw335] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2015] [Accepted: 05/23/2016] [Indexed: 01/28/2023] Open
|
10
|
Lee SY, Park CH, Yoon JH, Yun S, Kim JH. GEE: An Informatics Tool for Gene Expression Data Explore. Healthc Inform Res 2016; 22:81-8. [PMID: 27200217 PMCID: PMC4871849 DOI: 10.4258/hir.2016.22.2.81] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2016] [Revised: 03/28/2016] [Accepted: 04/08/2016] [Indexed: 11/30/2022] Open
Abstract
Objectives Major public high-throughput functional genomic data repositories, including the Gene Expression Omnibus (GEO) and ArrayExpress have rapidly expanded. As a result, a large number of diverse high-throughput functional genomic data retrieval systems have been developed. However, high-throughput functional genomic data retrieval remains challenging. Methods We developed Gene Expression data Explore (GEE), the first powerful, flexible web and mobile search application for searching whole-genome epigenetic data and microarray data in public databases, such as GEO and ArrayExpress. Results GEE provides an elaborate, convenient interface of query generation competences not available via various high-throughput functional genomic data retrieval systems, including GEO, ArrayExpress, and Atlas. In particular, GEE provides a suitable query generator using eVOC, the Experimental Factor Ontology (EFO), which is well represented with a variety of high-throughput functional genomic data experimental conditions. In addition, GEE provides an experimental design query constructor (EDQC), which provides elaborate retrieval filter conditions when the user designs real experiments. Conclusions The web version of GEE is available at http://www.snubi.org/software/gee, and its app version is available from the Apple App Store.
Collapse
Affiliation(s)
- Soo Youn Lee
- Seoul National University Biomedical Informatics (SNUBI), Seoul National University College of Medicine, Seoul, Korea
| | - Chan Hee Park
- Seoul National University Biomedical Informatics (SNUBI), Seoul National University College of Medicine, Seoul, Korea
| | - Jun Hee Yoon
- Seoul National University Biomedical Informatics (SNUBI), Seoul National University College of Medicine, Seoul, Korea
| | - Sunmin Yun
- Seoul National University Biomedical Informatics (SNUBI), Seoul National University College of Medicine, Seoul, Korea
| | - Ju Han Kim
- Seoul National University Biomedical Informatics (SNUBI), Seoul National University College of Medicine, Seoul, Korea.; Systems Biomedical Informatics-National Core Research Center (SBI-NCRC), Seoul National University College of Medicine, Seoul, Korea
| |
Collapse
|
11
|
Bagewadi S, Adhikari S, Dhrangadhariya A, Irin AK, Ebeling C, Namasivayam AA, Page M, Hofmann-Apitius M, Senger P. NeuroTransDB: highly curated and structured transcriptomic metadata for neurodegenerative diseases. Database (Oxford) 2015; 2015:bav099. [PMID: 26475471 PMCID: PMC4608514 DOI: 10.1093/database/bav099] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2015] [Revised: 09/07/2015] [Accepted: 09/10/2015] [Indexed: 12/12/2022]
Abstract
Neurodegenerative diseases are chronic debilitating conditions, characterized by progressive loss of neurons that represent a significant health care burden as the global elderly population continues to grow. Over the past decade, high-throughput technologies such as the Affymetrix GeneChip microarrays have provided new perspectives into the pathomechanisms underlying neurodegeneration. Public transcriptomic data repositories, namely Gene Expression Omnibus and curated ArrayExpress, enable researchers to conduct integrative meta-analysis; increasing the power to detect differentially regulated genes in disease and explore patterns of gene dysregulation across biologically related studies. The reliability of retrospective, large-scale integrative analyses depends on an appropriate combination of related datasets, in turn requiring detailed meta-annotations capturing the experimental setup. In most cases, we observe huge variation in compliance to defined standards for submitted metadata in public databases. Much of the information to complete, or refine meta-annotations are distributed in the associated publications. For example, tissue preparation or comorbidity information is frequently described in an article's supplementary tables. Several value-added databases have employed additional manual efforts to overcome this limitation. However, none of these databases explicate annotations that distinguish human and animal models in neurodegeneration context. Therefore, adopting a more specific disease focus, in combination with dedicated disease ontologies, will better empower the selection of comparable studies with refined annotations to address the research question at hand. In this article, we describe the detailed development of NeuroTransDB, a manually curated database containing metadata annotations for neurodegenerative studies. The database contains more than 20 dimensions of metadata annotations within 31 mouse, 5 rat and 45 human studies, defined in collaboration with domain disease experts. We elucidate the step-by-step guidelines used to critically prioritize studies from public archives and their metadata curation and discuss the key challenges encountered. Curated metadata for Alzheimer's disease gene expression studies are available for download. Database URL: www.scai.fraunhofer.de/NeuroTransDB.html.
Collapse
Affiliation(s)
- Shweta Bagewadi
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Schloss Birlinghoven, 53754 Sankt Augustin, Germany, Rheinische Friedrich-Wilhelms-Universitaet Bonn, Bonn-Aachen International Center for Information Technology, 53113, Bonn, Germany,
| | - Subash Adhikari
- Department of Chemistry, South University of Science and Technology of China, No 1088, Xueyuan Road, Xili, Shenzhen, China
| | - Anjani Dhrangadhariya
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Schloss Birlinghoven, 53754 Sankt Augustin, Germany, Rheinische Friedrich-Wilhelms-Universitaet Bonn, Bonn-Aachen International Center for Information Technology, 53113, Bonn, Germany
| | - Afroza Khanam Irin
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Schloss Birlinghoven, 53754 Sankt Augustin, Germany, Rheinische Friedrich-Wilhelms-Universitaet Bonn, Bonn-Aachen International Center for Information Technology, 53113, Bonn, Germany
| | - Christian Ebeling
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Schloss Birlinghoven, 53754 Sankt Augustin, Germany
| | - Aishwarya Alex Namasivayam
- Luxembourg Centre for Systems Biomedicine, University of Luxembourg, 7, avenue des Hauts-Fourneaux, L-4362 Esch-sur-Alzette, Luxembourg and
| | - Matthew Page
- Translational Bioinformatics, UCB Pharma, 216 Bath Rd, Slough SL1 3WE, United Kingdom
| | - Martin Hofmann-Apitius
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Schloss Birlinghoven, 53754 Sankt Augustin, Germany, Rheinische Friedrich-Wilhelms-Universitaet Bonn, Bonn-Aachen International Center for Information Technology, 53113, Bonn, Germany
| | - Philipp Senger
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Schloss Birlinghoven, 53754 Sankt Augustin, Germany,
| |
Collapse
|
12
|
Omics-based identification of biomarkers for nasopharyngeal carcinoma. DISEASE MARKERS 2015; 2015:762128. [PMID: 25999660 PMCID: PMC4427004 DOI: 10.1155/2015/762128] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/07/2014] [Accepted: 03/10/2015] [Indexed: 12/14/2022]
Abstract
Nasopharyngeal carcinoma (NPC) is a head and neck cancer that is highly found in distinct geographic areas, such as Southeast Asia. The management of NPC remains burdensome as the prognosis is poor due to the late presentation of the disease and the complex nature of NPC pathogenesis. Therefore, it is necessary to find effective molecular markers for early detection and therapeutic measure of NPC. In this paper, the discovery of molecular biomarker for NPC through the emerging omics technologies including genomics, miRNA-omics, transcriptomics, proteomics, and metabolomics will be extensively reviewed. These markers have been shown to play roles in various cellular pathways in NPC progression. The knowledge on their function will help us understand in more detail the complexity in tumor biology, leading to the better strategies for early detection, outcome prediction, detection of disease recurrence, and therapeutic approach.
Collapse
|
13
|
Chen KH, Wang KJ, Tsai ML, Wang KM, Adrian AM, Cheng WC, Yang TS, Teng NC, Tan KP, Chang KS. Gene selection for cancer identification: a decision tree model empowered by particle swarm optimization algorithm. BMC Bioinformatics 2014; 15:49. [PMID: 24555567 PMCID: PMC3944936 DOI: 10.1186/1471-2105-15-49] [Citation(s) in RCA: 81] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2013] [Accepted: 02/07/2014] [Indexed: 11/21/2022] Open
Abstract
BACKGROUND In the application of microarray data, how to select a small number of informative genes from thousands of genes that may contribute to the occurrence of cancers is an important issue. Many researchers use various computational intelligence methods to analyzed gene expression data. RESULTS To achieve efficient gene selection from thousands of candidate genes that can contribute in identifying cancers, this study aims at developing a novel method utilizing particle swarm optimization combined with a decision tree as the classifier. This study also compares the performance of our proposed method with other well-known benchmark classification methods (support vector machine, self-organizing map, back propagation neural network, C4.5 decision tree, Naive Bayes, CART decision tree, and artificial immune recognition system) and conducts experiments on 11 gene expression cancer datasets. CONCLUSION Based on statistical analysis, our proposed method outperforms other popular classifiers for all test datasets, and is compatible to SVM for certain specific datasets. Further, the housekeeping genes with various expression patterns and tissue-specific genes are identified. These genes provide a high discrimination power on cancer classification.
Collapse
Affiliation(s)
- Kun-Huang Chen
- Department of Industrial Management, National Taiwan University of Science and Technology, Taipei 106, Taiwan, R.O.C
| | - Kung-Jeng Wang
- Department of Industrial Management, National Taiwan University of Science and Technology, Taipei 106, Taiwan, R.O.C
| | - Min-Lung Tsai
- Department of Food Science, Yuanpei University, No. 306, Yuanpei Street, Hsinchu 300, Taiwan, R.O.C
| | - Kung-Min Wang
- Department of Surgery, Shin-Kong Wu Ho-Su Memorial Hospital, Taipei, Taiwan, R.O.C
| | - Angelia Melani Adrian
- Department of Industrial Management, National Taiwan University of Science and Technology, Taipei 106, Taiwan, R.O.C
| | - Wei-Chung Cheng
- Pediatric Neurosurgery, Department of Surgery, Cheng Hsin General Hospital, Taipei 11220, Taiwan, R.O.C
- Genomic Research Center, National Yang-Ming University, Taipei 11221, Taiwan, R.O.C
| | - Tzu-Sen Yang
- School of Dental Technology, Taipei Medical University, Taipei 110, Taiwan, R.O.C
- Taiwan Research Center for Biomedical Implants and Microsurgery Devices, Taipei Medical University Taipei 110, Taiwan, R.O.C
| | - Nai-Chia Teng
- School of Dentistry, College of Oral Medicine, Taipei Medical University, Taipei, Taiwan, R.O.C
| | - Kuo-Pin Tan
- MBA, School of Management, National Taiwan University of Science and Technology, Taipei 106, Taiwan, R.O.C
| | - Ku-Shang Chang
- Department of Food Science, Yuanpei University, No. 306, Yuanpei Street, Hsinchu 300, Taiwan, R.O.C
| |
Collapse
|
14
|
Bragazzi NL, Pechkova E, Nicolini C. Proteomics and Proteogenomics Approaches for Oral Diseases. ADVANCES IN PROTEIN CHEMISTRY AND STRUCTURAL BIOLOGY 2014; 95:125-62. [DOI: 10.1016/b978-0-12-800453-1.00004-x] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
|
15
|
ProfileDB: a resource for proteomics and cross-omics biomarker discovery. BIOCHIMICA ET BIOPHYSICA ACTA-PROTEINS AND PROTEOMICS 2013; 1844:960-6. [PMID: 24270047 DOI: 10.1016/j.bbapap.2013.11.007] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/10/2013] [Revised: 10/18/2013] [Accepted: 11/13/2013] [Indexed: 01/09/2023]
Abstract
The increasing size and complexity of high-throughput datasets pose a growing challenge for researchers. Often very different (cross-omics) techniques with individual data analysis pipelines are employed making a unified biomarker discovery strategy and a direct comparison of different experiments difficult and time consuming. Here we present the comprehensive web-based application ProfileDB. The application is designed to integrate data from different high-throughput 'omics' data types (Transcriptomics, Proteomics, Metabolomics) with clinical parameters and prior knowledge on pathways and ontologies. Beyond data storage, ProfileDB provides a set of dedicated tools for study inspection and data visualization. The user can gain insights into a complex experiment with just a few mouse clicks. We will demonstrate the application by presenting typical use cases for the identification of proteomics biomarkers. All presented analyses can be reproduced using the public ProfileDB web server. The ProfileDB application is available by standard browser (Firefox 18+, Internet Explorer Version 9+) technology via http://profileDB.-microdiscovery.de/ (login and pass-word: profileDB). The installation contains several public datasets including different cross-'omics' experiments. This article is part of a Special Issue entitled: Biomarkers: A Proteomic Challenge.
Collapse
|
16
|
Xia J, Fjell CD, Mayer ML, Pena OM, Wishart DS, Hancock REW. INMEX--a web-based tool for integrative meta-analysis of expression data. Nucleic Acids Res 2013; 41:W63-70. [PMID: 23766290 PMCID: PMC3692077 DOI: 10.1093/nar/gkt338] [Citation(s) in RCA: 130] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022] Open
Abstract
The widespread applications of various ‘omics’ technologies in biomedical research together with the emergence of public data repositories have resulted in a plethora of data sets for almost any given physiological state or disease condition. Properly combining or integrating these data sets with similar basic hypotheses can help reduce study bias, increase statistical power and improve overall biological understanding. However, the difficulties in data management and the complexities of analytical approaches have significantly limited data integration to enable meta-analysis. Here, we introduce integrative meta-analysis of expression data (INMEX), a user-friendly web-based tool designed to support meta-analysis of multiple gene-expression data sets, as well as to enable integration of data sets from gene expression and metabolomics experiments. INMEX contains three functional modules. The data preparation module supports flexible data processing, annotation and visualization of individual data sets. The statistical analysis module allows researchers to combine multiple data sets based on P-values, effect sizes, rank orders and other features. The significant genes can be examined in functional analysis module for enriched Gene Ontology terms or Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways, or expression profile visualization. INMEX has built-in support for common gene/metabolite identifiers (IDs), as well as 45 popular microarray platforms for human, mouse and rat. Complex operations are performed through a user-friendly web interface in a step-by-step manner. INMEX is freely available at http://www.inmex.ca.
Collapse
Affiliation(s)
- Jianguo Xia
- Department of Microbiology and Immunology, University of British Columbia, Vancouver, V6T 1Z3, Canada
| | | | | | | | | | | |
Collapse
|
17
|
Abstract
Our understanding of gene expression has changed dramatically over the past decade, largely catalysed by technological developments. High-throughput experiments - microarrays and next-generation sequencing - have generated large amounts of genome-wide gene expression data that are collected in public archives. Added-value databases process, analyse and annotate these data further to make them accessible to every biologist. In this Review, we discuss the utility of the gene expression data that are in the public domain and how researchers are making use of these data. Reuse of public data can be very powerful, but there are many obstacles in data preparation and analysis and in the interpretation of the results. We will discuss these challenges and provide recommendations that we believe can improve the utility of such data.
Collapse
Affiliation(s)
- Johan Rung
- EMBL-EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | | |
Collapse
|
18
|
Feichtinger J, McFarlane RJ, Larcombe LD. CancerMA: a web-based tool for automatic meta-analysis of public cancer microarray data. Database (Oxford) 2012; 2012:bas055. [PMID: 23241162 PMCID: PMC3522872 DOI: 10.1093/database/bas055] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2012] [Revised: 11/22/2012] [Accepted: 11/25/2012] [Indexed: 11/14/2022]
Abstract
The identification of novel candidate markers is a key challenge in the development of cancer therapies. This can be facilitated by putting accessible and automated approaches analysing the current wealth of 'omic'-scale data in the hands of researchers who are directly addressing biological questions. Data integration techniques and standardized, automated, high-throughput analyses are needed to manage the data available as well as to help narrow down the excessive number of target gene possibilities presented by modern databases and system-level resources. Here we present CancerMA, an online, integrated bioinformatic pipeline for automated identification of novel candidate cancer markers/targets; it operates by means of meta-analysing expression profiles of user-defined sets of biologically significant and related genes across a manually curated database of 80 publicly available cancer microarray datasets covering 13 cancer types. A simple-to-use web interface allows bioinformaticians and non-bioinformaticians alike to initiate new analyses as well as to view and retrieve the meta-analysis results. The functionality of CancerMA is shown by means of two validation datasets.
Collapse
Affiliation(s)
- Julia Feichtinger
- North West Cancer Research Fund Institute, Bangor University, Bangor, Gwynedd LL57 2UW, UK, Institute for Genomics and Bioinformatics, Graz University of Technology, Graz, Petersgasse 14, 8010, Austria, NISCHR Cancer Genetics Biomedical Research Unit, Bangor University, Bangor, Gwynedd LL57 2UW, UK and Cranfield Health, Cranfield University, Cranfield, Bedfordshire MK43 0AL, UK
| | - Ramsay J. McFarlane
- North West Cancer Research Fund Institute, Bangor University, Bangor, Gwynedd LL57 2UW, UK, Institute for Genomics and Bioinformatics, Graz University of Technology, Graz, Petersgasse 14, 8010, Austria, NISCHR Cancer Genetics Biomedical Research Unit, Bangor University, Bangor, Gwynedd LL57 2UW, UK and Cranfield Health, Cranfield University, Cranfield, Bedfordshire MK43 0AL, UK
| | - Lee D. Larcombe
- North West Cancer Research Fund Institute, Bangor University, Bangor, Gwynedd LL57 2UW, UK, Institute for Genomics and Bioinformatics, Graz University of Technology, Graz, Petersgasse 14, 8010, Austria, NISCHR Cancer Genetics Biomedical Research Unit, Bangor University, Bangor, Gwynedd LL57 2UW, UK and Cranfield Health, Cranfield University, Cranfield, Bedfordshire MK43 0AL, UK
| |
Collapse
|
19
|
Lei C, Ruan J. A novel link prediction algorithm for reconstructing protein-protein interaction networks by topological similarity. ACTA ACUST UNITED AC 2012; 29:355-64. [PMID: 23235927 DOI: 10.1093/bioinformatics/bts688] [Citation(s) in RCA: 75] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
Abstract
MOTIVATION Recent advances in technology have dramatically increased the availability of protein-protein interaction (PPI) data and stimulated the development of many methods for improving the systems level understanding the cell. However, those efforts have been significantly hindered by the high level of noise, sparseness and highly skewed degree distribution of PPI networks. Here, we present a novel algorithm to reduce the noise present in PPI networks. The key idea of our algorithm is that two proteins sharing some higher-order topological similarities, measured by a novel random walk-based procedure, are likely interacting with each other and may belong to the same protein complex. RESULTS Applying our algorithm to a yeast PPI network, we found that the edges in the reconstructed network have higher biological relevance than in the original network, assessed by multiple types of information, including gene ontology, gene expression, essentiality, conservation between species and known protein complexes. Comparison with existing methods shows that the network reconstructed by our method has the highest quality. Using two independent graph clustering algorithms, we found that the reconstructed network has resulted in significantly improved prediction accuracy of protein complexes. Furthermore, our method is applicable to PPI networks obtained with different experimental systems, such as affinity purification, yeast two-hybrid (Y2H) and protein-fragment complementation assay (PCA), and evidence shows that the predicted edges are likely bona fide physical interactions. Finally, an application to a human PPI network increased the coverage of the network by at least 100%. AVAILABILITY www.cs.utsa.edu/∼jruan/RWS/.
Collapse
Affiliation(s)
- Chengwei Lei
- Department of Computer Science, The University of Texas at San Antonio, San Antonio, TX 78249, USA
| | | |
Collapse
|
20
|
Michalopoulos I, Pavlopoulos GA, Malatras A, Karelas A, Kostadima MA, Schneider R, Kossida S. Human gene correlation analysis (HGCA): a tool for the identification of transcriptionally co-expressed genes. BMC Res Notes 2012; 5:265. [PMID: 22672625 PMCID: PMC3441226 DOI: 10.1186/1756-0500-5-265] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2011] [Accepted: 05/24/2012] [Indexed: 12/29/2022] Open
Abstract
Background Bioinformatics and high-throughput technologies such as microarray studies allow the measure of the expression levels of large numbers of genes simultaneously, thus helping us to understand the molecular mechanisms of various biological processes in a cell. Findings We calculate the Pearson Correlation Coefficient (r-value) between probe set signal values from Affymetrix Human Genome Microarray samples and cluster the human genes according to the r-value correlation matrix using the Neighbour Joining (NJ) clustering method. A hyper-geometric distribution is applied on the text annotations of the probe sets to quantify the term overrepresentations. The aim of the tool is the identification of closely correlated genes for a given gene of interest and/or the prediction of its biological function, which is based on the annotations of the respective gene cluster. Conclusion Human Gene Correlation Analysis (HGCA) is a tool to classify human genes according to their coexpression levels and to identify overrepresented annotation terms in correlated gene groups. It is available at: http://biobank-informatics.bioacademy.gr/coexpression/.
Collapse
Affiliation(s)
- Ioannis Michalopoulos
- Cryobiology of Stem Cells, Centre of Immunology and Transplantation, Biomedical Research Foundation, Academy of Athens, Soranou Athens, Greece.
| | | | | | | | | | | | | |
Collapse
|
21
|
Tseng GC, Ghosh D, Feingold E. Comprehensive literature review and statistical considerations for microarray meta-analysis. Nucleic Acids Res 2012; 40:3785-99. [PMID: 22262733 PMCID: PMC3351145 DOI: 10.1093/nar/gkr1265] [Citation(s) in RCA: 277] [Impact Index Per Article: 21.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023] Open
Abstract
With the rapid advances of various high-throughput technologies, generation of ‘-omics’ data is commonplace in almost every biomedical field. Effective data management and analytical approaches are essential to fully decipher the biological knowledge contained in the tremendous amount of experimental data. Meta-analysis, a set of statistical tools for combining multiple studies of a related hypothesis, has become popular in genomic research. Here, we perform a systematic search from PubMed and manual collection to obtain 620 genomic meta-analysis papers, of which 333 microarray meta-analysis papers are summarized as the basis of this paper and the other 249 GWAS meta-analysis papers are discussed in the next companion paper. The review in the present paper focuses on various biological purposes of microarray meta-analysis, databases and software and related statistical procedures. Statistical considerations of such an analysis are further scrutinized and illustrated by a case study. Finally, several open questions are listed and discussed.
Collapse
Affiliation(s)
- George C Tseng
- Department of Biostatistics, University of Pittsburgh, Pittsburgh, PA, USA.
| | | | | |
Collapse
|
22
|
Sinha AU, Merrill E, Armstrong SA, Clark TW, Das S. eXframe: reusable framework for storage, analysis and visualization of genomics experiments. BMC Bioinformatics 2011; 12:452. [PMID: 22103807 PMCID: PMC3235155 DOI: 10.1186/1471-2105-12-452] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2011] [Accepted: 11/21/2011] [Indexed: 11/19/2022] Open
Abstract
Background Genome-wide experiments are routinely conducted to measure gene expression, DNA-protein interactions and epigenetic status. Structured metadata for these experiments is imperative for a complete understanding of experimental conditions, to enable consistent data processing and to allow retrieval, comparison, and integration of experimental results. Even though several repositories have been developed for genomics data, only a few provide annotation of samples and assays using controlled vocabularies. Moreover, many of them are tailored for a single type of technology or measurement and do not support the integration of multiple data types. Results We have developed eXframe - a reusable web-based framework for genomics experiments that provides 1) the ability to publish structured data compliant with accepted standards 2) support for multiple data types including microarrays and next generation sequencing 3) query, analysis and visualization integration tools (enabled by consistent processing of the raw data and annotation of samples) and is available as open-source software. We present two case studies where this software is currently being used to build repositories of genomics experiments - one contains data from hematopoietic stem cells and another from Parkinson's disease patients. Conclusion The web-based framework eXframe offers structured annotation of experiments as well as uniform processing and storage of molecular data from microarray and next generation sequencing platforms. The framework allows users to query and integrate information across species, technologies, measurement types and experimental conditions. Our framework is reusable and freely modifiable - other groups or institutions can deploy their own custom web-based repositories based on this software. It is interoperable with the most important data formats in this domain. We hope that other groups will not only use eXframe, but also contribute their own useful modifications.
Collapse
Affiliation(s)
- Amit U Sinha
- Department of Pediatric Oncology, Dana-Farber Cancer Institute and Harvard Medical School, Boston, MA 02115, USA
| | | | | | | | | |
Collapse
|
23
|
Dozmorov MG, Wren JD. High-throughput processing and normalization of one-color microarrays for transcriptional meta-analyses. BMC Bioinformatics 2011; 12 Suppl 10:S2. [PMID: 22166002 PMCID: PMC3236842 DOI: 10.1186/1471-2105-12-s10-s2] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023] Open
Abstract
Background Microarray experiments are becoming increasingly common in biomedical research, as is their deposition in publicly accessible repositories, such as Gene Expression Omnibus (GEO). As such, there has been a surge in interest to use this microarray data for meta-analytic approaches, whether to increase sample size for a more powerful analysis of a specific disease (e.g. lung cancer) or to re-examine experiments for reasons different than those examined in the initial, publishing study that generated them. For the average biomedical researcher, there are a number of practical barriers to conducting such meta-analyses such as manually aggregating, filtering and formatting the data. Methods to automatically process large repositories of microarray data into a standardized, directly comparable format will enable easier and more reliable access to microarray data to conduct meta-analyses. Methods We present a straightforward, simple but robust against potential outliers method for automatic quality control and pre-processing of tens of thousands of single-channel microarray data files. GEO GDS files are quality checked by comparing parametric distributions and quantile normalized to enable direct comparison of expression level for subsequent meta-analyses. Results 13,000 human 1-color experiments were processed to create a single gene expression matrix that subsets can be extracted from to conduct meta-analyses. Interestingly, we found that when conducting a global meta-analysis of gene-gene co-expression patterns across all 13,000 experiments to predict gene function, normalization had minimal improvement over using the raw data. Conclusions Normalization of microarray data appears to be of minimal importance on analyses based on co-expression patterns when the sample size is on the order of thousands microarray datasets. Smaller subsets, however, are more prone to aberrations and artefacts, and effective means of automating normalization procedures not only empowers meta-analytic approaches, but aids in reproducibility by providing a standard way of approaching the problem. Data availability: matrix containing normalized expression of 20,813 genes across 13,000 experiments is available for download at . Source code for GDS files pre-processing is available from the authors upon request.
Collapse
Affiliation(s)
- Mikhail G Dozmorov
- Arthritis and Clinical Immunology Research Program, Oklahoma Medical Research Foundation 825 NE 13th Street, Oklahoma City, Oklahoma 73104-5005, USA.
| | | |
Collapse
|
24
|
Chang CW, Cheng WC, Chen CR, Shu WY, Tsai ML, Huang CL, Hsu IC. Identification of human housekeeping genes and tissue-selective genes by microarray meta-analysis. PLoS One 2011; 6:e22859. [PMID: 21818400 PMCID: PMC3144958 DOI: 10.1371/journal.pone.0022859] [Citation(s) in RCA: 99] [Impact Index Per Article: 7.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2011] [Accepted: 06/29/2011] [Indexed: 01/26/2023] Open
Abstract
Background Categorizing protein-encoding transcriptomes of normal tissues into housekeeping genes and tissue-selective genes is a fundamental step toward studies of genetic functions and genetic associations to tissue-specific diseases. Previous studies have been mainly based on a few data sets with limited samples in each tissue, which restrained the representativeness of their identified genes, and resulted in low consensus among them. Results This study compiled 1,431 samples in 43 normal human tissues from 104 microarray data sets. We developed a new method to improve gene expression assessment, and showed that more than ten samples are needed to robustly identify the protein-encoding transcriptome of a tissue. We identified 2,064 housekeeping genes and 2,293 tissue-selective genes, and analyzed gene lists by functional enrichment analysis. The housekeeping genes are mainly involved in fundamental cellular functions, and the tissue-selective genes are strikingly related to functions and diseases corresponding to tissue-origin. We also compared agreements and related functions among our housekeeping genes and those of previous studies, and pointed out some reasons for the low consensuses. Conclusions The results indicate that sufficient samples have improved the identification of protein-encoding transcriptome of a tissue. Comprehensive meta-analysis has proved the high quality of our identified HK and TS genes. These results could offer a useful resource for future research on functional and genomic features of HK and TS genes.
Collapse
Affiliation(s)
- Cheng-Wei Chang
- Department of Biomedical Engineering and Environmental Sciences, National Tsing Hua University, Hsinchu, Taiwan
| | - Wei-Chung Cheng
- Department of Biomedical Engineering and Environmental Sciences, National Tsing Hua University, Hsinchu, Taiwan
| | - Chaang-Ray Chen
- Department of Biomedical Engineering and Environmental Sciences, National Tsing Hua University, Hsinchu, Taiwan
| | - Wun-Yi Shu
- Institute of Statistics, National Tsing Hua University, Hsinchu, Taiwan
| | - Min-Lung Tsai
- Institute of Athletics, National Taiwan Sport University, Taichung, Taiwan
| | - Ching-Lung Huang
- Department of Biomedical Engineering and Environmental Sciences, National Tsing Hua University, Hsinchu, Taiwan
| | - Ian C. Hsu
- Department of Biomedical Engineering and Environmental Sciences, National Tsing Hua University, Hsinchu, Taiwan
- * E-mail:
| |
Collapse
|
25
|
Cheng WC, Chang CW, Chen CR, Tsai ML, Shu WY, Li CY, Hsu IC. Identification of reference genes across physiological states for qRT-PCR through microarray meta-analysis. PLoS One 2011; 6:e17347. [PMID: 21390309 PMCID: PMC3044736 DOI: 10.1371/journal.pone.0017347] [Citation(s) in RCA: 44] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2010] [Accepted: 01/31/2011] [Indexed: 01/11/2023] Open
Abstract
Background The accuracy of quantitative real-time PCR (qRT-PCR) is highly dependent on
reliable reference gene(s). Some housekeeping genes which are commonly used
for normalization are widely recognized as inappropriate in many
experimental conditions. This study aimed to identify reference genes for
clinical studies through microarray meta-analysis of human clinical
samples. Methodology/Principal Findings After uniform data preprocessing and data quality control, 4,804 Affymetrix
HU-133A arrays performed by clinical samples were classified into four
physiological states with 13 organ/tissue types. We identified a list of
reference genes for each organ/tissue types which exhibited stable
expression across physiological states. Furthermore, 102 genes identified as
reference gene candidates in multiple organ/tissue types were selected for
further analysis. These genes have been frequently identified as
housekeeping genes in previous studies, and approximately 71% of them
fall into Gene Expression (GO:0010467) category in Gene Ontology. Conclusions/Significance Based on microarray meta-analysis of human clinical sample arrays, we
identified sets of reference gene candidates for various organ/tissue types
and then examined the functions of these genes. Additionally, we found that
many of the reference genes are functionally related to transcription, RNA
processing and translation. According to our results, researchers could
select single or multiple reference gene(s) for normalization of qRT-PCR in
clinical studies.
Collapse
Affiliation(s)
- Wei-Chung Cheng
- Department of Biomedical Engineering and
Environmental Sciences, National Tsing Hua University, Hsinchu,
Taiwan
| | - Cheng-Wei Chang
- Department of Biomedical Engineering and
Environmental Sciences, National Tsing Hua University, Hsinchu,
Taiwan
| | - Chaang-Ray Chen
- Department of Biomedical Engineering and
Environmental Sciences, National Tsing Hua University, Hsinchu,
Taiwan
| | - Min-Lung Tsai
- Institute of Athletics, National Taiwan Sport
University, Taichung, Taiwan
| | - Wun-Yi Shu
- Institute of Statistics, National Tsing Hua
University, Hsinchu, Taiwan
| | - Chia-Yang Li
- Department of Biomedical Engineering and
Environmental Sciences, National Tsing Hua University, Hsinchu,
Taiwan
| | - Ian C. Hsu
- Department of Biomedical Engineering and
Environmental Sciences, National Tsing Hua University, Hsinchu,
Taiwan
- * E-mail:
| |
Collapse
|