1
|
Chen B, Gao L, Shang X. A two-way rectification method for identifying differentially expressed genes by maximizing the co-function relationship. BMC Genomics 2021; 22:471. [PMID: 34171992 PMCID: PMC8229713 DOI: 10.1186/s12864-021-07772-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2020] [Accepted: 06/04/2021] [Indexed: 11/15/2022] Open
Abstract
BACKGROUND The identification of differentially expressed genes (DEGs) is an important task in many biological studies. The currently widely used methods often calculate a score for each gene by estimating the significance level in terms of the differential expression. However, biological experiments often have only three duplications, plus plenty of noises contain in gene expression datasets, which brings a great challenge to statistical analysis methods. Moreover, the abundance of gene expression levels are not evenly distributed. Thus, those low expressed genes are more easily to be detected by fold-change based methods, which may results in high false positives among the DEG list. Since phenotypical changes result from DEGs should be strongly related to several distinct cellular functions, a more robust method should be designed to increase the true positive rate of the functional related DEGs. RESULTS In this study, we propose a two-way rectification method for identifying DEGs by maximizing the co-function relationships between genes and their enriched cellular pathways. An iteration strategy is employed to sequentially narrow down the group of identified DEGs and their associated biological functions. Functional analyses reveal that the identified DEGs are well organized in the form of functional modules, and the enriched pathways are very significant with lower p-value and larger gene count. CONCLUSIONS An integrative rectification method was proposed to identify key DEGs and their related functions simultaneously. The experimental validations demonstrate that the method has high interpretability and feasibility. It performs very well in terms of the identification of remarkable functional related genes.
Collapse
Affiliation(s)
- Bolin Chen
- School of Computer Science, Northwestern Polytechnical University, 127 Youyi west road, Xi’an, 710072 China
- Key Laboratory of Big Data Storage and Management, Ministry of Industry and Information Technology, 127 Youyi west road, Xi’an, 710072 China
- Centre for Multidisciplinary Convergence Computing (CMCC), 127 Youyi west road, Xi’an, 710072 China
- National Engineering Laboratory for Integrated Aero-Space-Ground-Ocean Big Data Application Technology, 127 Youyi west road, Xi’an, 710072 China
| | - Li Gao
- School of Software, Northwestern Polytechnical University, 127 Youyi west road, Xi’an, 710072 China
| | - Xuequn Shang
- School of Computer Science, Northwestern Polytechnical University, 127 Youyi west road, Xi’an, 710072 China
- Key Laboratory of Big Data Storage and Management, Ministry of Industry and Information Technology, 127 Youyi west road, Xi’an, 710072 China
| |
Collapse
|
2
|
Jia X, Han Q, Lu Z. Analyzing the similarity of samples and genes by MG-PCC algorithm, t-SNE-SS and t-SNE-SG maps. BMC Bioinformatics 2018; 19:512. [PMID: 30558536 PMCID: PMC6296107 DOI: 10.1186/s12859-018-2495-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2018] [Accepted: 11/16/2018] [Indexed: 12/16/2022] Open
Abstract
BACKGROUND For analyzing these gene expression data sets under different samples, clustering and visualizing samples and genes are important methods. However, it is difficult to integrate clustering and visualizing techniques when the similarities of samples and genes are defined by PCC(Person correlation coefficient) measure. RESULTS Here, for rare samples of gene expression data sets, we use MG-PCC (mini-groups that are defined by PCC) algorithm to divide them into mini-groups, and use t-SNE-SSP maps to display these mini-groups, where the idea of MG-PCC algorithm is that the nearest neighbors should be in the same mini-groups, t-SNE-SSP map is selected from a series of t-SNE(t-statistic Stochastic Neighbor Embedding) maps of standardized samples, and these t-SNE maps have different perplexity parameter. Moreover, for PCC clusters of mass genes, they are displayed by t-SNE-SGI map, where t-SNE-SGI map is selected from a series of t-SNE maps of standardized genes, and these t-SNE maps have different initialization dimensions. Here, t-SNE-SSP and t-SNE-SGI maps are selected by A-value, where A-value is modeled from areas of clustering projections, and t-SNE-SSP and t-SNE-SGI maps are such t-SNE map that has the smallest A-value. CONCLUSIONS From the analysis of cancer gene expression data sets, we demonstrate that MG-PCC algorithm is able to put tumor and normal samples into their respective mini-groups, and t-SNE-SSP(or t-SNE-SGI) maps are able to display the relationships between mini-groups(or PCC clusters) clearly. Furthermore, t-SNE-SS(m)(or t-SNE-SG(n)) maps are able to construct independent tree diagrams of the nearest sample(or gene) neighbors, where each tree diagram is corresponding to a mini-group of samples(or genes).
Collapse
Affiliation(s)
- Xingang Jia
- School of Mathematics, Southeast University, Nanjing, 210096, People's Republic of China.
| | - Qiuhong Han
- Department of Mathematics, Nanjing Forestry University, Nanjing, 210037, People's Republic of China
| | - Zuhong Lu
- State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing, 210096, People's Republic of China
| |
Collapse
|
3
|
Simon TW, Budinsky RA, Rowlands JC. A model for aryl hydrocarbon receptor-activated gene expression shows potency and efficacy changes and predicts squelching due to competition for transcription co-activators. PLoS One 2015; 10:e0127952. [PMID: 26039703 PMCID: PMC4454675 DOI: 10.1371/journal.pone.0127952] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2015] [Accepted: 04/22/2015] [Indexed: 12/17/2022] Open
Abstract
A stochastic model of nuclear receptor-mediated transcription was developed based on activation of the aryl hydrocarbon receptor (AHR) by 2,3,7,8-tetrachlorodibenzodioxin (TCDD) and subsequent binding the activated AHR to xenobiotic response elements (XREs) on DNA. The model was based on effects observed in cells lines commonly used as in vitro experimental systems. Following ligand binding, the AHR moves into the cell nucleus and forms a heterodimer with the aryl hydrocarbon nuclear translocator (ARNT). In the model, a requirement for binding to DNA is that a generic coregulatory protein is subsequently bound to the AHR-ARNT dimer. Varying the amount of coregulator available within the nucleus altered both the potency and efficacy of TCDD for inducing for transcription of CYP1A1 mRNA, a commonly used marker for activation of the AHR. Lowering the amount of available cofactor slightly increased the EC50 for the transcriptional response without changing the efficacy or maximal response. Further reduction in the amount of cofactor reduced the efficacy and produced non-monotonic dose-response curves (NMDRCs) at higher ligand concentrations. The shapes of these NMDRCs were reminiscent of the phenomenon of squelching. Resource limitations for transcriptional machinery are becoming apparent in eukaryotic cells. Within single cells, nuclear receptor-mediated gene expression appears to be a stochastic process; however, intercellular communication and other aspects of tissue coordination may represent a compensatory process to maintain an organism’s ability to respond on a phenotypic level to various stimuli within an inconstant environment.
Collapse
Affiliation(s)
- Ted W. Simon
- Ted Simon LLC, Winston, GA, United States of America
- * E-mail:
| | - Robert A. Budinsky
- The Dow Chemical Company, Toxicology and Environmental Research & Consulting. Midland, MI, United States of America
| | - J. Craig Rowlands
- The Dow Chemical Company, Toxicology and Environmental Research & Consulting. Midland, MI, United States of America
| |
Collapse
|
4
|
Khan HA. A novel gene expression index (GEI) with software support for comparing microarray gene signatures. Gene 2013; 512:82-8. [PMID: 23059903 DOI: 10.1016/j.gene.2012.09.101] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2012] [Revised: 09/13/2012] [Accepted: 09/29/2012] [Indexed: 02/05/2023]
Abstract
This study was aimed to examine the validity of commonly used statistical tests for comparison of expression data from simulated and real gene signatures as well as pathway-characterized gene sets. A novel algorithm based on 10 sub-gradations (5 for up- and 5 for down-regulation) of fold-changes has been designed and testified using an Excel add-in software support. Our findings showed the limitations of conventional statistics for comparing the microarray gene expression data. However, the newly introduced Gene Expression Index (GEI) appeared to be more robust and straightforward for two-group comparison of normalized data. The software automation simplifies the task and the results are displayed in a comprehensive format including a color-coded bar showing the intensity of cumulative gene expression.
Collapse
Affiliation(s)
- Haseeb Ahmad Khan
- Analytical and Molecular Bioscience Research Group, Department of Biochemistry, College of Science, King Saud University, Riyadh, Saudi Arabia.
| |
Collapse
|
5
|
Pirim H, Ekşioğlu B, Perkins A, Yüceer Ç. Clustering of High Throughput Gene Expression Data. COMPUTERS & OPERATIONS RESEARCH 2012; 39:3046-3061. [PMID: 23144527 PMCID: PMC3491664 DOI: 10.1016/j.cor.2012.03.008] [Citation(s) in RCA: 40] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/06/2023]
Abstract
High throughput biological data need to be processed, analyzed, and interpreted to address problems in life sciences. Bioinformatics, computational biology, and systems biology deal with biological problems using computational methods. Clustering is one of the methods used to gain insight into biological processes, particularly at the genomics level. Clearly, clustering can be used in many areas of biological data analysis. However, this paper presents a review of the current clustering algorithms designed especially for analyzing gene expression data. It is also intended to introduce one of the main problems in bioinformatics - clustering gene expression data - to the operations research community.
Collapse
Affiliation(s)
- Harun Pirim
- Department of Industrial and Systems Engineering, Mississippi State University, P.O. Box 9542, Mississippi State, MS 39762
- Corresponding author. Tel.:+1-662-325-4226;
| | - Burak Ekşioğlu
- Department of Industrial and Systems Engineering, Mississippi State University, P.O. Box 9542, Mississippi State, MS 39762
| | - Andy Perkins
- Department of Computer Science and Engineering, Mississippi State University
| | - Çetin Yüceer
- Department of Forestry, Mississippi State University
| |
Collapse
|
6
|
Hu J, Xu J. Density based pruning for identification of differentially expressed genes from microarray data. BMC Genomics 2010; 11 Suppl 2:S3. [PMID: 21047384 PMCID: PMC2975422 DOI: 10.1186/1471-2164-11-s2-s3] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/16/2023] Open
Abstract
Motivation Identification of differentially expressed genes from microarray datasets is one of the most important analyses for microarray data mining. Popular algorithms such as statistical t-test rank genes based on a single statistics. The false positive rate of these methods can be improved by considering other features of differentially expressed genes. Results We proposed a pattern recognition strategy for identifying differentially expressed genes. Genes are mapped to a two dimension feature space composed of average difference of gene expression and average expression levels. A density based pruning algorithm (DB Pruning) is developed to screen out potential differentially expressed genes usually located in the sparse boundary region. Biases of popular algorithms for identifying differentially expressed genes are visually characterized. Experiments on 17 datasets from Gene Omnibus Database (GEO) with experimentally verified differentially expressed genes showed that DB pruning can significantly improve the prediction accuracy of popular identification algorithms such as t-test, rank product, and fold change. Conclusions Density based pruning of non-differentially expressed genes is an effective method for enhancing statistical testing based algorithms for identifying differentially expressed genes. It improves t-test, rank product, and fold change by 11% to 50% in the numbers of identified true differentially expressed genes. The source code of DB pruning is freely available on our website http://mleg.cse.sc.edu/degprune
Collapse
Affiliation(s)
- Jianjun Hu
- Department of Computer Science and Engineering, University of South Carolina, Columbia, SC 29208, USA.
| | | |
Collapse
|
7
|
Abstract
PURPOSE OF REVIEW The desire for biomarkers for diagnosis and prognosis of diseases has never been greater. With the availability of genome data and an increased availability of proteome data, the discovery of biomarkers has become increasingly feasible. This article reviews some recent applications of the many evolving 'omic technologies to organ transplantation. RECENT FINDINGS With the advancement of many high-throughput 'omic techniques such as genomics, metabolomics, antibiomics, peptidomics, and proteomics, efforts have been made to understand potential mechanisms of specific graft injuries and develop novel biomarkers for acute rejection, chronic rejection, and operational tolerance. SUMMARY The translation of potential biomarkers from the laboratory bench to the clinical bedside is not an easy task and will require the concerted effort of the immunologists, molecular biologists, transplantation specialists, geneticists, and experts in bioinformatics. Rigorous prospective validation studies will be needed using large sets of independent patient samples. The appropriate and timely exploitation of evolving 'omic technologies will lay the cornerstone for a new age of translational research for organ transplant monitoring.
Collapse
|
8
|
Zhang H, Song X, Wang H, Zhang X. MIClique: An algorithm to identify differentially coexpressed disease gene subset from microarray data. J Biomed Biotechnol 2010; 2009:642524. [PMID: 20169000 PMCID: PMC2822236 DOI: 10.1155/2009/642524] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2009] [Accepted: 10/28/2009] [Indexed: 01/05/2023] Open
Abstract
Computational analysis of microarray data has provided an effective way to identify disease-related genes. Traditional disease gene selection methods from microarray data such as statistical test always focus on differentially expressed genes in different samples by individual gene prioritization. These traditional methods might miss differentially coexpressed (DCE) gene subsets because they ignore the interaction between genes. In this paper, MIClique algorithm is proposed to identify DEC gene subsets based on mutual information and clique analysis. Mutual information is used to measure the coexpression relationship between each pair of genes in two different kinds of samples. Clique analysis is a commonly used method in biological network, which generally represents biological module of similar function. By applying the MIClique algorithm to real gene expression data, some DEC gene subsets which correlated under one experimental condition but uncorrelated under another condition are detected from the graph of colon dataset and leukemia dataset.
Collapse
Affiliation(s)
- Huanping Zhang
- Department of Biomedical Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China
| | - Xiaofeng Song
- Department of Biomedical Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China
| | - Huinan Wang
- Department of Biomedical Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China
| | - Xiaobai Zhang
- Department of Biomedical Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China
| |
Collapse
|
9
|
Shan WJ, Tong CF, Shi JS. [Comparison of statistical methods for detecting differential expression in microarray data]. YI CHUAN = HEREDITAS 2009; 30:1640-6. [PMID: 19073583 DOI: 10.3724/sp.j.1005.2008.01640] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
DNA microarray is a new tool in biotechnology, which allows simultaneously monitoring thousands of gene expression in cells. The goal of differential gene expression analysis is to detect genes with significant change of gene expression levels arising from experimental conditions. Although various statistical methods have been suggested to confirm differential gene expression, only a few studies compared performance of the statistical methods. This paper presented comparison of statistical methods for finding differentially expressed genes (DEGs) from the microarray data. Using simulated and real datasets (Populus cDNA microarray data), we compared eight methods of identifying differential gene expression. The simulated datasets included four differential distributions (normal distribution, uniform distribution, c2 distribution, and exponential distribution). The results of simulated datasets analysis showed that the eight methods were more preferable with the microarray data of uniform distribution than normal distribution. They were not preferable with the c2 distribution and exponential distribution. Of these eight methods, SAM (Significance Analysis of Microarrays) and Wilcoxon rank sum test performed well in most cases. The results of real cDNA microarray data of Populus showed that there was much similarity of SAM, Samroc, and regression modeling approach. Wilcoxon rank sum test was different from them. Samroc and regression modeling approach were similar in the eight methods. For both simulated and real datasets, SAM, Samroc, and regression modeling approach performed better than other methods.
Collapse
Affiliation(s)
- Wen-Juan Shan
- The Key Laboratory of Forest Genetics and Gene Engineering of the State Administration and Jiangsu Province, Nanjing Forestry University, Nanjing 210037, China.
| | | | | |
Collapse
|
10
|
Abstract
Metabolomics describes the measurement of the full complement of the products of metabolism in a single biological sample and correlating these metabolomic profiles with known physiological or pathological states. The metabolome offers the possibility of finding unique fingerprints responsible for different phenotypes. Analytical techniques such as nuclear magnetic resonance or mass spectrometry measure thousands of compounds within the metabolome simultaneously and appropriate data mining and database tools allow the finding of significant correlations between the measured metabolomes. The first direct outcome of nutritional metabolomics will be the discovery of biomarkers, which can reveal changes in health and disease but also indicate short term and long-term dietary intake. The concerted actions of nutrigenomics and metabolomics will play a crucial role in understanding how specific interactions of single nucleotide polymorphisms (SNP) influence a person's response to a diet. Finally, systems biology approaches to human nutrition combine transcriptomics, proteomics and metabolomics with the aim of understanding how diets interact within the human being.
Collapse
Affiliation(s)
- A Koulman
- Medical Research Council Human Nutrition Research, Cambridge, UK
| | | |
Collapse
|
11
|
Abstract
The desire for biomarkers for diagnosis and prognosis of diseases has never been greater. With the availability of genome data and an increased availability of proteome data, the discovery of biomarkers has become increasingly feasible. However, the task is daunting and requires collaborations among researchers working in the fields of transplantation, immunology, genetics, molecular biology, biostatistics and bioinformatics. With the advancement of high throughput omic techniques such as genomics and proteomics (collectively known as proteogenomics), efforts have been made to develop diagnostic tools from new and to-be discovered biomarkers. Yet biomarker validation, particularly in organ transplantation, remains challenging because of the lack of a true gold standard for diagnostic categories and analytical bottlenecks that face high-throughput data deconvolution. Even though microarray technique is relatively mature, proteomics is still growing with regards to data normalization and analysis methods. Study design, sample selection and rigorous data analysis are the critical issues for biomarker discovery using high-throughput proteogenomic technologies that combine the use and strengths of both genomics and proteomics. In this review, we look into the current status and latest developments in the field of biomarker discovery using genomics and proteomics related to organ transplantation, with an emphasis on the evolution of proteomic technologies.
Collapse
Affiliation(s)
- Tara K Sigdel
- Department of Pediatrics-Nephrology, Stanford University Medical School, Stanford University, Stanford, CA 94305, USA
| | | |
Collapse
|