1
|
Hui TX, Kasim S, Aziz IA, Fudzee MFM, Haron NS, Sutikno T, Hassan R, Mahdin H, Sen SC. Robustness evaluations of pathway activity inference methods on gene expression data. BMC Bioinformatics 2024; 25:23. [PMID: 38216898 PMCID: PMC10785356 DOI: 10.1186/s12859-024-05632-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2023] [Accepted: 01/02/2024] [Indexed: 01/14/2024] Open
Abstract
BACKGROUND With the exponential growth of high-throughput technologies, multiple pathway analysis methods have been proposed to estimate pathway activities from gene expression profiles. These pathway activity inference methods can be divided into two main categories: non-Topology-Based (non-TB) and Pathway Topology-Based (PTB) methods. Although some review and survey articles discussed the topic from different aspects, there is a lack of systematic assessment and comparisons on the robustness of these approaches. RESULTS Thus, this study presents comprehensive robustness evaluations of seven widely used pathway activity inference methods using six cancer datasets based on two assessments. The first assessment seeks to investigate the robustness of pathway activity in pathway activity inference methods, while the second assessment aims to assess the robustness of risk-active pathways and genes predicted by these methods. The mean reproducibility power and total number of identified informative pathways and genes were evaluated. Based on the first assessment, the mean reproducibility power of pathway activity inference methods generally decreased as the number of pathway selections increased. Entropy-based Directed Random Walk (e-DRW) distinctly outperformed other methods in exhibiting the greatest reproducibility power across all cancer datasets. On the other hand, the second assessment shows that no methods provide satisfactory results across datasets. CONCLUSION However, PTB methods generally appear to perform better in producing greater reproducibility power and identifying potential cancer markers compared to non-TB methods.
Collapse
Affiliation(s)
- Tay Xin Hui
- Soft Computing and Data Mining Center, Faculty of Computer Sciences and Information Technology, Universiti Tun Hussein Onn Malaysia (UTHM), 83000, Batu Pahat, Malaysia
| | - Shahreen Kasim
- Soft Computing and Data Mining Center, Faculty of Computer Sciences and Information Technology, Universiti Tun Hussein Onn Malaysia (UTHM), 83000, Batu Pahat, Malaysia.
| | - Izzatdin Abdul Aziz
- Computer and Information Sciences Department (CISD), Universiti Teknologi PETRONAS (UTP), 32610, Seri Iskandar, Malaysia
| | - Mohd Farhan Md Fudzee
- Soft Computing and Data Mining Center, Faculty of Computer Sciences and Information Technology, Universiti Tun Hussein Onn Malaysia (UTHM), 83000, Batu Pahat, Malaysia
| | - Nazleeni Samiha Haron
- Computer and Information Sciences Department (CISD), Universiti Teknologi PETRONAS (UTP), 32610, Seri Iskandar, Malaysia
| | - Tole Sutikno
- Department of Electrical Engineering, Universitas Ahmad Dahlan (UAD), 55166, Yogyakarta, Indonesia
| | - Rohayanti Hassan
- Faculty of Electrical Engineering, Universiti Teknologi Malaysia (UTM), 81310, Johor Bahru, Malaysia
| | - Hairulnizam Mahdin
- Soft Computing and Data Mining Center, Faculty of Computer Sciences and Information Technology, Universiti Tun Hussein Onn Malaysia (UTHM), 83000, Batu Pahat, Malaysia
| | - Seah Choon Sen
- Faculty of Computing, Universiti Teknologi Malaysia (UTM), 81310, Johor Bahru, Malaysia
| |
Collapse
|
2
|
Shah I, Bundy J, Chambers B, Everett LJ, Haggard D, Harrill J, Judson RS, Nyffeler J, Patlewicz G. Navigating Transcriptomic Connectivity Mapping Workflows to Link Chemicals with Bioactivities. Chem Res Toxicol 2022; 35:1929-1949. [PMID: 36301716 PMCID: PMC10483698 DOI: 10.1021/acs.chemrestox.2c00245] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
Abstract
Screening new compounds for potential bioactivities against cellular targets is vital for drug discovery and chemical safety. Transcriptomics offers an efficient approach for assessing global gene expression changes, but interpreting chemical mechanisms from these data is often challenging. Connectivity mapping is a potential data-driven avenue for linking chemicals to mechanisms based on the observation that many biological processes are associated with unique gene expression signatures (gene signatures). However, mining the effects of a chemical on gene signatures for biological mechanisms is challenging because transcriptomic data contain thousands of noisy genes. New connectivity mapping approaches seeking to distinguish signal from noise continue to be developed, spurred by the promise of discovering chemical mechanisms, new drugs, and disease targets from burgeoning transcriptomic data. Here, we analyze these approaches in terms of diverse transcriptomic technologies, public databases, gene signatures, pattern-matching algorithms, and statistical evaluation criteria. To navigate the complexity of connectivity mapping, we propose a harmonized scheme to coherently organize and compare published workflows. We first standardize concepts underlying transcriptomic profiles and gene signatures based on various transcriptomic technologies such as microarrays, RNA-Seq, and L1000 and discuss the widely used data sources such as Gene Expression Omnibus, ArrayExpress, and MSigDB. Next, we generalize connectivity mapping as a pattern-matching task for finding similarity between a query (e.g., transcriptomic profile for new chemical) and a reference (e.g., gene signature of known target). Published pattern-matching approaches fall into two main categories: vector-based use metrics like correlation, Jaccard index, etc., and aggregation-based use parametric and nonparametric statistics (e.g., gene set enrichment analysis). The statistical methods for evaluating the performance of different approaches are described, along with comparisons reported in the literature on benchmark transcriptomic data sets. Lastly, we review connectivity mapping applications in toxicology and offer guidance on evaluating chemical-induced toxicity with concentration-response transcriptomic data. In addition to serving as a high-level guide and tutorial for understanding and implementing connectivity mapping workflows, we hope this review will stimulate new algorithms for evaluating chemical safety and drug discovery using transcriptomic data.
Collapse
Affiliation(s)
- Imran Shah
- Center for Computational Toxicology and Exposure, Office of Research and Development, US. Environmental Protection Agency, Research Triangle Park, North Carolina 27711, USA
| | - Joseph Bundy
- Center for Computational Toxicology and Exposure, Office of Research and Development, US. Environmental Protection Agency, Research Triangle Park, North Carolina 27711, USA
| | - Bryant Chambers
- Center for Computational Toxicology and Exposure, Office of Research and Development, US. Environmental Protection Agency, Research Triangle Park, North Carolina 27711, USA
| | - Logan J. Everett
- Center for Computational Toxicology and Exposure, Office of Research and Development, US. Environmental Protection Agency, Research Triangle Park, North Carolina 27711, USA
| | - Derik Haggard
- Center for Computational Toxicology and Exposure, Office of Research and Development, US. Environmental Protection Agency, Research Triangle Park, North Carolina 27711, USA
| | - Joshua Harrill
- Center for Computational Toxicology and Exposure, Office of Research and Development, US. Environmental Protection Agency, Research Triangle Park, North Carolina 27711, USA
| | - Richard S. Judson
- Center for Computational Toxicology and Exposure, Office of Research and Development, US. Environmental Protection Agency, Research Triangle Park, North Carolina 27711, USA
| | - Johanna Nyffeler
- Center for Computational Toxicology and Exposure, Office of Research and Development, US. Environmental Protection Agency, Research Triangle Park, North Carolina 27711, USA
- Oak Ridge Institute for Science and Education (ORISE) Postdoctoral Fellow, Oak Ridge, Tennessee, 37831, US
| | - Grace Patlewicz
- Center for Computational Toxicology and Exposure, Office of Research and Development, US. Environmental Protection Agency, Research Triangle Park, North Carolina 27711, USA
| |
Collapse
|
3
|
DHULI KRISTJANA, BONETTI GABRIELE, ANPILOGOV KYRYLO, HERBST KARENL, CONNELLY STEPHENTHADDEUS, BELLINATO FRANCESCO, GISONDI PAOLO, BERTELLI MATTEO. Validating methods for testing natural molecules on molecular pathways of interest in silico and in vitro. JOURNAL OF PREVENTIVE MEDICINE AND HYGIENE 2022; 63:E279-E288. [PMID: 36479497 PMCID: PMC9710400 DOI: 10.15167/2421-4248/jpmh2022.63.2s3.2770] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Indexed: 01/25/2023]
Abstract
Differentially expressed genes can serve as drug targets and are used to predict drug response and disease progression. In silico drug analysis based on the expression of these genetic biomarkers allows the detection of putative therapeutic agents, which could be used to reverse a pathological gene expression signature. Indeed, a set of bioinformatics tools can increase the accuracy of drug discovery, helping in biomarker identification. Once a drug target is identified, in vitro cell line models of disease are used to evaluate and validate the therapeutic potential of putative drugs and novel natural molecules. This study describes the development of efficacious PCR primers that can be used to identify gene expression of specific genetic pathways, which can lead to the identification of natural molecules as therapeutic agents in specific molecular pathways. For this study, genes involved in health conditions and processes were considered. In particular, the expression of genes involved in obesity, xenobiotics metabolism, endocannabinoid pathway, leukotriene B4 metabolism and signaling, inflammation, endocytosis, hypoxia, lifespan, and neurotrophins were evaluated. Exploiting the expression of specific genes in different cell lines can be useful in in vitro to evaluate the therapeutic effects of small natural molecules.
Collapse
Affiliation(s)
- KRISTJANA DHULI
- MAGI’S LAB, Rovereto (TN), Italy
- Correspondence: Kristjana Dhuli, MAGI’S LAB, Rovereto (TN), 38068, Italy. E-mail:
| | | | | | - KAREN L. HERBST
- Total Lipedema Care, Beverly Hills California and Tucson Arizona, USA
| | - STEPHEN THADDEUS CONNELLY
- San Francisco Veterans Affairs Health Care System, Department of Oral & Maxillofacial Surgery, University of California, San Francisco, CA, USA7
| | - FRANCESCO BELLINATO
- Section of Dermatology and Venereology, Department of Medicine, University of Verona, Verona, Italy
| | - PAOLO GISONDI
- Section of Dermatology and Venereology, Department of Medicine, University of Verona, Verona, Italy
| | - MATTEO BERTELLI
- MAGI’S LAB, Rovereto (TN), Italy
- MAGI EUREGIO, Bolzano, BZ, Italy
- MAGISNAT, Peachtree Corners (GA), USA
| |
Collapse
|
4
|
Grassi M, Tarantino B. SEMgsa: topology-based pathway enrichment analysis with structural equation models. BMC Bioinformatics 2022; 23:344. [PMID: 35978279 PMCID: PMC9385099 DOI: 10.1186/s12859-022-04884-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2022] [Accepted: 08/09/2022] [Indexed: 11/25/2022] Open
Abstract
BACKGROUND Pathway enrichment analysis is extensively used in high-throughput experimental studies to gain insight into the functional roles of pre-defined subsets of genes, proteins and metabolites. Methods that leverages information on the topology of the underlying pathways outperform simpler methods that only consider pathway membership, leading to improved performance. Among all the proposed software tools, there's the need to combine high statistical power together with a user-friendly framework, making it difficult to choose the best method for a particular experimental environment. RESULTS We propose SEMgsa, a topology-based algorithm developed into the framework of structural equation models. SEMgsa combine the SEM p values regarding node-specific group effect estimates in terms of activation or inhibition, after statistically controlling biological relations among genes within pathways. We used SEMgsa to identify biologically relevant results in a Coronavirus disease (COVID-19) RNA-seq dataset (GEO accession: GSE172114) together with a frontotemporal dementia (FTD) DNA methylation dataset (GEO accession: GSE53740) and compared its performance with some existing methods. SEMgsa is highly sensitive to the pathways designed for the specific disease, showing low p values ([Formula: see text]) and ranking in high positions, outperforming existing software tools. Three pathway dysregulation mechanisms were used to generate simulated expression data and evaluate the performance of methods in terms of type I error followed by their statistical power. Simulation results confirm best overall performance of SEMgsa. CONCLUSIONS SEMgsa is a novel yet powerful method for identifying enrichment with regard to gene expression data. It takes into account topological information and exploits pathway perturbation statistics to reveal biological information. SEMgsa is implemented in the R package SEMgraph, easily available at https://CRAN.R-project.org/package=SEMgraph .
Collapse
Affiliation(s)
- Mario Grassi
- Department of Brain and Behavioral Sciences, University of Pavia, Pavia, Italy
| | - Barbara Tarantino
- Department of Brain and Behavioral Sciences, University of Pavia, Pavia, Italy
| |
Collapse
|
5
|
NMR in Metabolomics: From Conventional Statistics to Machine Learning and Neural Network Approaches. APPLIED SCIENCES-BASEL 2022. [DOI: 10.3390/app12062824] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Abstract
NMR measurements combined with chemometrics allow achieving a great amount of information for the identification of potential biomarkers responsible for a precise metabolic pathway. These kinds of data are useful in different fields, ranging from food to biomedical fields, including health science. The investigation of the whole set of metabolites in a sample, representing its fingerprint in the considered condition, is known as metabolomics and may take advantage of different statistical tools. The new frontier is to adopt self-learning techniques to enhance clustering or classification actions that can improve the predictive power over large amounts of data. Although machine learning is already employed in metabolomics, deep learning and artificial neural networks approaches were only recently successfully applied. In this work, we give an overview of the statistical approaches underlying the wide range of opportunities that machine learning and neural networks allow to perform with accurate metabolites assignment and quantification.Various actual challenges are discussed, such as proper metabolomics, deep learning architectures and model accuracy.
Collapse
|
6
|
Fifteen Years of Gene Set Analysis for High-Throughput Genomic Data: A Review of Statistical Approaches and Future Challenges. ENTROPY 2020; 22:e22040427. [PMID: 33286201 PMCID: PMC7516904 DOI: 10.3390/e22040427] [Citation(s) in RCA: 24] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/24/2020] [Revised: 03/18/2020] [Accepted: 04/03/2020] [Indexed: 12/22/2022]
Abstract
Over the last decade, gene set analysis has become the first choice for gaining insights into underlying complex biology of diseases through gene expression and gene association studies. It also reduces the complexity of statistical analysis and enhances the explanatory power of the obtained results. Although gene set analysis approaches are extensively used in gene expression and genome wide association data analysis, the statistical structure and steps common to these approaches have not yet been comprehensively discussed, which limits their utility. In this article, we provide a comprehensive overview, statistical structure and steps of gene set analysis approaches used for microarrays, RNA-sequencing and genome wide association data analysis. Further, we also classify the gene set analysis approaches and tools by the type of genomic study, null hypothesis, sampling model and nature of the test statistic, etc. Rather than reviewing the gene set analysis approaches individually, we provide the generation-wise evolution of such approaches for microarrays, RNA-sequencing and genome wide association studies and discuss their relative merits and limitations. Here, we identify the key biological and statistical challenges in current gene set analysis, which will be addressed by statisticians and biologists collectively in order to develop the next generation of gene set analysis approaches. Further, this study will serve as a catalog and provide guidelines to genome researchers and experimental biologists for choosing the proper gene set analysis approach based on several factors.
Collapse
|
7
|
Nguyen TM, Shafi A, Nguyen T, Draghici S. Identifying significantly impacted pathways: a comprehensive review and assessment. Genome Biol 2019; 20:203. [PMID: 31597578 PMCID: PMC6784345 DOI: 10.1186/s13059-019-1790-4] [Citation(s) in RCA: 96] [Impact Index Per Article: 19.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2019] [Accepted: 08/13/2019] [Indexed: 01/01/2023] Open
Abstract
BACKGROUND Many high-throughput experiments compare two phenotypes such as disease vs. healthy, with the goal of understanding the underlying biological phenomena characterizing the given phenotype. Because of the importance of this type of analysis, more than 70 pathway analysis methods have been proposed so far. These can be categorized into two main categories: non-topology-based (non-TB) and topology-based (TB). Although some review papers discuss this topic from different aspects, there is no systematic, large-scale assessment of such methods. Furthermore, the majority of the pathway analysis approaches rely on the assumption of uniformity of p values under the null hypothesis, which is often not true. RESULTS This article presents the most comprehensive comparative study on pathway analysis methods available to date. We compare the actual performance of 13 widely used pathway analysis methods in over 1085 analyses. These comparisons were performed using 2601 samples from 75 human disease data sets and 121 samples from 11 knockout mouse data sets. In addition, we investigate the extent to which each method is biased under the null hypothesis. Together, these data and results constitute a reliable benchmark against which future pathway analysis methods could and should be tested. CONCLUSION Overall, the result shows that no method is perfect. In general, TB methods appear to perform better than non-TB methods. This is somewhat expected since the TB methods take into consideration the structure of the pathway which is meant to describe the underlying phenomena. We also discover that most, if not all, listed approaches are biased and can produce skewed results under the null.
Collapse
Affiliation(s)
- Tuan-Minh Nguyen
- Department of Computer Science, Wayne State University, Detroit, 48202 USA
| | - Adib Shafi
- Department of Computer Science, Wayne State University, Detroit, 48202 USA
| | - Tin Nguyen
- Department of Computer Science and Engineering, University of Nevada, Reno, 89557 USA
| | - Sorin Draghici
- Department of Computer Science, Wayne State University, Detroit, 48202 USA
- Department of Obstetrics and Gynecology, Wayne State University, Detroit, 48202 USA
| |
Collapse
|
8
|
Amadoz A, Hidalgo MR, Çubuk C, Carbonell-Caballero J, Dopazo J. A comparison of mechanistic signaling pathway activity analysis methods. Brief Bioinform 2019; 20:1655-1668. [PMID: 29868818 PMCID: PMC6917216 DOI: 10.1093/bib/bby040] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2018] [Revised: 03/31/2018] [Indexed: 12/11/2022] Open
Abstract
Understanding the aspects of cell functionality that account for disease mechanisms or drug modes of action is a main challenge for precision medicine. Classical gene-based approaches ignore the modular nature of most human traits, whereas conventional pathway enrichment approaches produce only illustrative results of limited practical utility. Recently, a family of new methods has emerged that change the focus from the whole pathways to the definition of elementary subpathways within them that have any mechanistic significance and to the study of their activities. Thus, mechanistic pathway activity (MPA) methods constitute a new paradigm that allows recoding poorly informative genomic measurements into cell activity quantitative values and relate them to phenotypes. Here we provide a review on the MPA methods available and explain their contribution to systems medicine approaches for addressing challenges in the diagnostic and treatment of complex diseases.
Collapse
Affiliation(s)
- Alicia Amadoz
- Department of Bioinformatics, Igenomix S.L., 46980 Valencia, Spain
| | - Marta R Hidalgo
- Clinical Bioinformatics Area, Fundación Progreso y Salud (FPS), CDCA, Hospital Virgen del Rocio, Sevilla 41013, Spain
| | - Cankut Çubuk
- Clinical Bioinformatics Area, Fundación Progreso y Salud (FPS), CDCA, Hospital Virgen del Rocio, Sevilla 41013, Spain
| | - José Carbonell-Caballero
- Chromatin and Gene expression Lab, Gene Regulation, Stem Cells and Cancer Program, Centre de Regulació Genòmica (CRG), The Barcelona Institute of Science and Technology, PRBB, Barcelona 08003, Spain
| | - Joaquín Dopazo
- Clinical Bioinformatics Area, Fundación Progreso y Salud (FPS), CDCA, Hospital Virgen del Rocio, Sevilla 41013, Spain
- Chromatin and Gene expression Lab, Gene Regulation, Stem Cells and Cancer Program, Centre de Regulació Genòmica (CRG), The Barcelona Institute of Science and Technology, PRBB, Barcelona 08003, Spain
- Clinical Bioinformatics Area, Fundación Progreso y Salud (FPS), CDCA, Hospital Virgen del Rocio, Sevilla 41013, Spain, Functional Genomics Node (INB), FPS, Hospital Virgen del Rocío, Sevilla 41013, Spain and Bioinformatics in Rare Diseases (BiER), Centro de Investigación Biomédica en Red de Enfermedades Raras (CIBERER), FPS, Hospital Virgen del Rocío, Sevilla 41013, Spain
| |
Collapse
|
9
|
Li Y, Wu Y, Zhang X, Bai Y, Akthar LM, Lu X, Shi M, Zhao J, Jiang Q, Li Y. SCIA: A Novel Gene Set Analysis Applicable to Data With Different Characteristics. Front Genet 2019; 10:598. [PMID: 31293623 PMCID: PMC6603225 DOI: 10.3389/fgene.2019.00598] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2019] [Accepted: 06/05/2019] [Indexed: 01/06/2023] Open
Abstract
Gene set analysis is commonly used in functional enrichment and molecular pathway analyses. Most of the present methods are based on the competitive testing methods which assume each gene is independent of the others. However, the false discovery rates of competitive methods are amplified when they are applied to datasets with high inter-gene correlations. The self-contained testing methods could solve this problem, but there are other restrictions on data characteristics. Therefore, a statistically rigorous testing method applicable to different datasets with various complex characteristics is needed to obtain unbiased and comparable results. We propose a self-contained and competitive incorporated analysis (SCIA) to alleviate the bias caused by the limited application scope of existing gene set analysis methods. This is accomplished through a novel permutation strategy using a priori biological networks to selectively permute gene labels with different probabilities. In simulation studies, SCIA was compared with four representative analysis methods (GSEA, CAMERA, ROAST, and NES), and produced the best performance in both false discovery rate and sensitivity under most conditions with different parameter settings. Further, the KEGG pathway analysis on two real datasets of lung cancer showed that the results found by SCIA in both of the two datasets are much more than that of GSEA and most of them could be supported by literature. Overall, SCIA promisingly offers researchers more reliable and comparable results with different datasets.
Collapse
Affiliation(s)
- Yiqun Li
- Department of Laboratory of Cancer Biology, School of Life Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Ying Wu
- Department of Biostatistics, School of Public Health, Southern Medical University, Guangzhou, China
| | - Xiaohan Zhang
- Department of Laboratory of Cancer Biology, School of Life Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Yunfan Bai
- Department of Laboratory of Cancer Biology, School of Life Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Luqman Muhammad Akthar
- Department of Laboratory of Cancer Biology, School of Life Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Xin Lu
- Department of Laboratory of Cancer Biology, School of Life Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Ming Shi
- Department of Laboratory of Cancer Biology, School of Life Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Jianxiang Zhao
- Department of Laboratory of Cancer Biology, School of Life Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Qinghua Jiang
- Department of Laboratory of Cancer Biology, School of Life Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Yu Li
- Department of Laboratory of Cancer Biology, School of Life Science and Technology, Harbin Institute of Technology, Harbin, China
| |
Collapse
|
10
|
Statistical approach for selection of biologically informative genes. Gene 2018; 655:71-83. [PMID: 29458166 DOI: 10.1016/j.gene.2018.02.044] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2017] [Revised: 11/26/2017] [Accepted: 02/14/2018] [Indexed: 11/23/2022]
Abstract
Selection of informative genes from high dimensional gene expression data has emerged as an important research area in genomics. Many gene selection techniques have been proposed so far are either based on relevancy or redundancy measure. Further, the performance of these techniques has been adjudged through post selection classification accuracy computed through a classifier using the selected genes. This performance metric may be statistically sound but may not be biologically relevant. A statistical approach, i.e. Boot-MRMR, was proposed based on a composite measure of maximum relevance and minimum redundancy, which is both statistically sound and biologically relevant for informative gene selection. For comparative evaluation of the proposed approach, we developed two biological sufficient criteria, i.e. Gene Set Enrichment with QTL (GSEQ) and biological similarity score based on Gene Ontology (GO). Further, a systematic and rigorous evaluation of the proposed technique with 12 existing gene selection techniques was carried out using five gene expression datasets. This evaluation was based on a broad spectrum of statistically sound (e.g. subject classification) and biological relevant (based on QTL and GO) criteria under a multiple criteria decision-making framework. The performance analysis showed that the proposed technique selects informative genes which are more biologically relevant. The proposed technique is also found to be quite competitive with the existing techniques with respect to subject classification and computational time. Our results also showed that under the multiple criteria decision-making setup, the proposed technique is best for informative gene selection over the available alternatives. Based on the proposed approach, an R Package, i.e. BootMRMR has been developed and available at https://cran.r-project.org/web/packages/BootMRMR. This study will provide a practical guide to select statistical techniques for selecting informative genes from high dimensional expression data for breeding and system biology studies.
Collapse
|
11
|
Statistical Approach for Gene Set Analysis with Trait Specific Quantitative Trait Loci. Sci Rep 2018; 8:2391. [PMID: 29402907 PMCID: PMC5799309 DOI: 10.1038/s41598-018-19736-w] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2017] [Accepted: 12/06/2017] [Indexed: 11/20/2022] Open
Abstract
The analysis of gene sets is usually carried out based on gene ontology terms and known biological pathways. These approaches may not establish any formal relation between genotype and trait specific phenotype. In plant biology and breeding, analysis of gene sets with trait specific Quantitative Trait Loci (QTL) data are considered as great source for biological knowledge discovery. Therefore, we proposed an innovative statistical approach called Gene Set Analysis with QTLs (GSAQ) for interpreting gene expression data in context of gene sets with traits. The utility of GSAQ was studied on five different complex abiotic and biotic stress scenarios in rice, which yields specific trait/stress enriched gene sets. Further, the GSAQ approach was more innovative and effective in performing gene set analysis with underlying QTLs and identifying QTL candidate genes than the existing approach. The GSAQ approach also provided two potential biological relevant criteria for performance analysis of gene selection methods. Based on this proposed approach, an R package, i.e., GSAQ (https://cran.r-project.org/web/packages/GSAQ) has been developed. The GSAQ approach provides a valuable platform for integrating the gene expression data with genetically rich QTL data.
Collapse
|
12
|
Gelli M, Konda AR, Liu K, Zhang C, Clemente TE, Holding DR, Dweikat IM. Validation of QTL mapping and transcriptome profiling for identification of candidate genes associated with nitrogen stress tolerance in sorghum. BMC PLANT BIOLOGY 2017; 17:123. [PMID: 28697783 PMCID: PMC5505042 DOI: 10.1186/s12870-017-1064-9] [Citation(s) in RCA: 33] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/13/2016] [Accepted: 06/25/2017] [Indexed: 05/10/2023]
Abstract
BACKGROUND Quantitative trait loci (QTLs) detected in one mapping population may not be detected in other mapping populations at all the time. Therefore, before being used for marker assisted breeding, QTLs need to be validated in different environments and/or genetic backgrounds to rule out statistical anomalies. In this regard, we mapped the QTLs controlling various agronomic traits in a recombinant inbred line (RIL) population in response to Nitrogen (N) stress and validated these with the reported QTLs in our earlier study to find the stable and consistent QTLs across populations. Also, with Illumina RNA-sequencing we checked the differential expression of gene (DEG) transcripts between parents and pools of RILs with high and low nitrogen use efficiency (NUE) and overlaid these DEGs on to the common validated QTLs to find candidate genes associated with N-stress tolerance in sorghum. RESULTS An F7 RIL population derived from a cross between CK60 (N-stress sensitive) and San Chi San (N-stress tolerant) inbred sorghum lines was used to map QTLs for 11 agronomic traits tested under different N-levels. Composite interval mapping analysis detected a total of 32 QTLs for 11 agronomic traits. Validation of these QTLs revealed that of the detected, nine QTLs from this population were consistent with the reported QTLs in earlier study using CK60/China17 RIL population. The validated QTLs were located on chromosomes 1, 6, 7, 8, and 9. In addition, root transcriptomic profiling detected 55 and 20 differentially expressed gene (DEG) transcripts between parents and pools of RILs with high and low NUE respectively. Also, overlay of these DEG transcripts on to the validated QTLs found candidate genes transcripts for NUE and also showed the expected differential expression. For example, DEG transcripts encoding Lysine histidine transporter 1 (LHT1) had abundant expression in San Chi San and the tolerant RIL pool, whereas DEG transcripts encoding seed storage albumin, transcription factor IIIC (TFIIIC) and dwarfing gene (DW2) encoding multidrug resistance-associated protein-9 homolog showed abundant expression in CK60 parent, similar to earlier study. CONCLUSIONS The validated QTLs among different mapping populations would be the most reliable and stable QTLs across germplasm. The DEG transcripts found in the validated QTL regions will serve as future candidate genes for enhancing NUE in sorghum using molecular approaches.
Collapse
Affiliation(s)
- Malleswari Gelli
- Department of Agronomy and Horticulture, University of Nebraska, Lincoln, NE, 68583, USA
| | - Anji Reddy Konda
- Department of Biochemistry, University of Nebraska, Lincoln, NE, 68588, USA
- Center for Plant Science Innovation, University of Nebraska, Lincoln, NE, 68588, USA
| | - Kan Liu
- Center for Plant Science Innovation, University of Nebraska, Lincoln, NE, 68588, USA
- School of Biological Sciences, University of Nebraska, Lincoln, NE, 68588, USA
| | - Chi Zhang
- Center for Plant Science Innovation, University of Nebraska, Lincoln, NE, 68588, USA
- School of Biological Sciences, University of Nebraska, Lincoln, NE, 68588, USA
| | - Thomas E Clemente
- Department of Agronomy and Horticulture, University of Nebraska, Lincoln, NE, 68583, USA
- Center for Plant Science Innovation, University of Nebraska, Lincoln, NE, 68588, USA
| | - David R Holding
- Department of Agronomy and Horticulture, University of Nebraska, Lincoln, NE, 68583, USA
- Center for Plant Science Innovation, University of Nebraska, Lincoln, NE, 68588, USA
| | - Ismail M Dweikat
- Department of Agronomy and Horticulture, University of Nebraska, Lincoln, NE, 68583, USA.
| |
Collapse
|
13
|
Ren X, Hu Q, Liu S, Wang J, Miecznikowski JC. Gene set analysis controlling for length bias in RNA-seq experiments. BioData Min 2017; 10:5. [PMID: 28184252 PMCID: PMC5294840 DOI: 10.1186/s13040-017-0125-9] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2016] [Accepted: 01/11/2017] [Indexed: 01/29/2023] Open
Abstract
Background In gene set analysis, the researchers are interested in determining the gene sets that are significantly correlated with an outcome, e.g. disease status or treatment. With the rapid development of high throughput sequencing technologies, Ribonucleic acid sequencing (RNA-seq) has become an important alternative to traditional expression arrays in gene expression studies. Challenges exist in adopting the existent algorithms to RNA-seq data given the intrinsic difference of the technologies and data. In RNA-seq experiments, the measure of gene expression is correlated with gene length. This inherent correlation may cause bias in gene set analysis. Results We develop SeqGSA, a new method for gene set analysis with length bias adjustment for RNA-seq data. It extends from the R package GSA designed for microarrays. Our method compares the gene set maxmean statistic against permutations, while also taking into account of the statistics of the other gene sets. To adjust for the gene length bias, we implement a flexible weighted sampling scheme in the restandardization step of our algorithm. We show our method improves the power of identifying significant gene sets that are affected by the length bias. We also show that our method maintains the type I error comparing with another representative method for gene set enrichment test. Conclusions SeqGSA is a promising tool for testing significant gene pathways with RNA-seq data while adjusting for inherent gene length effect. It enhances the power to detect gene sets affected by the bias and maintains type I error under various situations.
Collapse
Affiliation(s)
- Xing Ren
- Department of Biostatistics, SUNY University at Buffalo, Buffalo, 14214 USA
| | - Qiang Hu
- Department of Biostatistics and Bioinformatics, Roswell Park Cancer Institute, Buffalo, 14263 USA
| | - Song Liu
- Department of Biostatistics and Bioinformatics, Roswell Park Cancer Institute, Buffalo, 14263 USA
| | - Jianmin Wang
- Department of Biostatistics and Bioinformatics, Roswell Park Cancer Institute, Buffalo, 14263 USA
| | | |
Collapse
|
14
|
Du J, Li M, Yuan Z, Guo M, Song J, Xie X, Chen Y. A decision analysis model for KEGG pathway analysis. BMC Bioinformatics 2016; 17:407. [PMID: 27716040 PMCID: PMC5053338 DOI: 10.1186/s12859-016-1285-1] [Citation(s) in RCA: 38] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2015] [Accepted: 09/28/2016] [Indexed: 11/18/2022] Open
Abstract
Background The knowledge base-driven pathway analysis is becoming the first choice for many investigators, in that it not only can reduce the complexity of functional analysis by grouping thousands of genes into just several hundred pathways, but also can increase the explanatory power for the experiment by identifying active pathways in different conditions. However, current approaches are designed to analyze a biological system assuming that each pathway is independent of the other pathways. Results A decision analysis model is developed in this article that accounts for dependence among pathways in time-course experiments and multiple treatments experiments. This model introduces a decision coefficient—a designed index, to identify the most relevant pathways in a given experiment by taking into account not only the direct determination factor of each Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway itself, but also the indirect determination factors from its related pathways. Meanwhile, the direct and indirect determination factors of each pathway are employed to demonstrate the regulation mechanisms among KEGG pathways, and the sign of decision coefficient can be used to preliminarily estimate the impact direction of each KEGG pathway. The simulation study of decision analysis demonstrated the application of decision analysis model for KEGG pathway analysis. Conclusions A microarray dataset from bovine mammary tissue over entire lactation cycle was used to further illustrate our strategy. The results showed that the decision analysis model can provide the promising and more biologically meaningful results. Therefore, the decision analysis model is an initial attempt of optimizing pathway analysis methodology. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1285-1) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Junli Du
- College of sciences, Northwest A&F University, Yangling, 712100, People's Republic of China.,College of Animal Science and Technology, Northwest A&F University, Yangling, 712100, People's Republic of China
| | - Manlin Li
- College of sciences, Northwest A&F University, Yangling, 712100, People's Republic of China
| | - Zhifa Yuan
- College of sciences, Northwest A&F University, Yangling, 712100, People's Republic of China
| | - Mancai Guo
- College of sciences, Northwest A&F University, Yangling, 712100, People's Republic of China
| | - Jiuzhou Song
- Department of Animal and Avian Sciences, University of Maryland, College Park, MD, 20742, USA
| | - Xiaozhen Xie
- College of sciences, Northwest A&F University, Yangling, 712100, People's Republic of China
| | - Yulin Chen
- College of Animal Science and Technology, Northwest A&F University, Yangling, 712100, People's Republic of China.
| |
Collapse
|
15
|
Ma J, Shojaie A, Michailidis G. Network-based pathway enrichment analysis with incomplete network information. Bioinformatics 2016; 32:3165-3174. [PMID: 27357170 DOI: 10.1093/bioinformatics/btw410] [Citation(s) in RCA: 39] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2016] [Accepted: 06/22/2016] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Pathway enrichment analysis has become a key tool for biomedical researchers to gain insight into the underlying biology of differentially expressed genes, proteins and metabolites. It reduces complexity and provides a system-level view of changes in cellular activity in response to treatments and/or in disease states. Methods that use existing pathway network information have been shown to outperform simpler methods that only take into account pathway membership. However, despite significant progress in understanding the association amongst members of biological pathways, and expansion of data bases containing information about interactions of biomolecules, the existing network information may be incomplete or inaccurate and is not cell-type or disease condition-specific. RESULTS We propose a constrained network estimation framework that combines network estimation based on cell- and condition-specific high-dimensional Omics data with interaction information from existing data bases. The resulting pathway topology information is subsequently used to provide a framework for simultaneous testing of differences in expression levels of pathway members, as well as their interactions. We study the asymptotic properties of the proposed network estimator and the test for pathway enrichment, and investigate its small sample performance in simulated and real data settings. AVAILABILITY AND IMPLEMENTATION The proposed method has been implemented in the R-package netgsa available on CRAN. CONTACT jinma@upenn.eduSupplementary information: Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jing Ma
- Department of Biostatistics and Epidemiology, University of Pennsylvania Perelman School of Medicine, PA 19104, USA
| | - Ali Shojaie
- Department of Biostatistics, University of Washington, Seattle, WA 98915, USA
| | - George Michailidis
- Department of Statistics, University of Florida, Gainesville, FL 32611, USA
| |
Collapse
|
16
|
Mass spectrometry analysis and transcriptome sequencing reveal glowing squid crystal proteins are in the same superfamily as firefly luciferase. Sci Rep 2016; 6:27638. [PMID: 27279452 PMCID: PMC4899746 DOI: 10.1038/srep27638] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2016] [Accepted: 05/18/2016] [Indexed: 01/14/2023] Open
Abstract
The Japanese firefly squid Hotaru-ika (Watasenia scintillans) produces intense blue light from photophores at the tips of two arms. These photophores are densely packed with protein microcrystals that catalyse the bioluminescent reaction using ATP and the substrate coelenterazine disulfate. The squid is the only organism known to produce light using protein crystals. We extracted microcrystals from arm tip photophores and identified the constituent proteins using mass spectrometry and transcriptome libraries prepared from arm tip tissue. The crystals contain three proteins, wsluc1–3, all members of the ANL superfamily of adenylating enzymes. They share 19 to 21% sequence identity with firefly luciferases, which produce light using ATP and the unrelated firefly luciferin substrate. We propose that wsluc1–3 form a complex that crystallises inside the squid photophores, and that in the crystal one or more of the proteins catalyses the production of light using coelenterazine disulfate and ATP. These results suggest that ANL superfamily enzymes have independently evolved in distant species to produce light using unrelated substrates.
Collapse
|
17
|
Rue-Albrecht K, McGettigan PA, Hernández B, Nalpas NC, Magee DA, Parnell AC, Gordon SV, MacHugh DE. GOexpress: an R/Bioconductor package for the identification and visualisation of robust gene ontology signatures through supervised learning of gene expression data. BMC Bioinformatics 2016; 17:126. [PMID: 26968614 PMCID: PMC4788925 DOI: 10.1186/s12859-016-0971-3] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2015] [Accepted: 02/25/2016] [Indexed: 02/06/2023] Open
Abstract
Background Identification of gene expression profiles that differentiate experimental groups is critical for discovery and analysis of key molecular pathways and also for selection of robust diagnostic or prognostic biomarkers. While integration of differential expression statistics has been used to refine gene set enrichment analyses, such approaches are typically limited to single gene lists resulting from simple two-group comparisons or time-series analyses. In contrast, functional class scoring and machine learning approaches provide powerful alternative methods to leverage molecular measurements for pathway analyses, and to compare continuous and multi-level categorical factors. Results We introduce GOexpress, a software package for scoring and summarising the capacity of gene ontology features to simultaneously classify samples from multiple experimental groups. GOexpress integrates normalised gene expression data (e.g., from microarray and RNA-seq experiments) and phenotypic information of individual samples with gene ontology annotations to derive a ranking of genes and gene ontology terms using a supervised learning approach. The default random forest algorithm allows interactions between all experimental factors, and competitive scoring of expressed genes to evaluate their relative importance in classifying predefined groups of samples. Conclusions GOexpress enables rapid identification and visualisation of ontology-related gene panels that robustly classify groups of samples and supports both categorical (e.g., infection status, treatment) and continuous (e.g., time-series, drug concentrations) experimental factors. The use of standard Bioconductor extension packages and publicly available gene ontology annotations facilitates straightforward integration of GOexpress within existing computational biology pipelines. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-0971-3) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Kévin Rue-Albrecht
- Animal Genomics Laboratory, UCD School of Agriculture and Food Science, University College Dublin, Dublin 4, Ireland.,Centre for Pharmacology and Therapeutics, Division of Experimental Medicine, Imperial College London, Hammersmith Hospital, London, W12 0NN, UK
| | - Paul A McGettigan
- Animal Genomics Laboratory, UCD School of Agriculture and Food Science, University College Dublin, Dublin 4, Ireland.,Novartis Pharmaceuticals, Elm Park Business Campus, Merrion Road, Dublin 4, Ireland
| | - Belinda Hernández
- UCD School of Mathematics and Statistics, Insight Centre for Data Analytics, University College Dublin, Dublin 4, Ireland.,UCD Conway Institute of Biomolecular and Biomedical Research, University College Dublin, Dublin 4, Ireland
| | - Nicolas C Nalpas
- Animal Genomics Laboratory, UCD School of Agriculture and Food Science, University College Dublin, Dublin 4, Ireland.,Proteome Center Tübingen, Interfaculty Institute for Cell Biology, University of Tübingen, Auf der Morgenstelle 15, 72076, Tübingen, Germany
| | - David A Magee
- Animal Genomics Laboratory, UCD School of Agriculture and Food Science, University College Dublin, Dublin 4, Ireland
| | - Andrew C Parnell
- UCD School of Mathematics and Statistics, Insight Centre for Data Analytics, University College Dublin, Dublin 4, Ireland
| | - Stephen V Gordon
- UCD School of Veterinary Medicine, University College Dublin, Dublin 4, Ireland.,UCD Conway Institute of Biomolecular and Biomedical Research, University College Dublin, Dublin 4, Ireland
| | - David E MacHugh
- Animal Genomics Laboratory, UCD School of Agriculture and Food Science, University College Dublin, Dublin 4, Ireland. .,UCD Conway Institute of Biomolecular and Biomedical Research, University College Dublin, Dublin 4, Ireland.
| |
Collapse
|
18
|
Alonso R, Salavert F, Garcia-Garcia F, Carbonell-Caballero J, Bleda M, Garcia-Alonso L, Sanchis-Juan A, Perez-Gil D, Marin-Garcia P, Sanchez R, Cubuk C, Hidalgo MR, Amadoz A, Hernansaiz-Ballesteros RD, Alemán A, Tarraga J, Montaner D, Medina I, Dopazo J. Babelomics 5.0: functional interpretation for new generations of genomic data. Nucleic Acids Res 2015; 43:W117-21. [PMID: 25897133 PMCID: PMC4489263 DOI: 10.1093/nar/gkv384] [Citation(s) in RCA: 99] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2015] [Accepted: 04/11/2015] [Indexed: 02/02/2023] Open
Abstract
Babelomics has been running for more than one decade offering a user-friendly interface for the functional analysis of gene expression and genomic data. Here we present its fifth release, which includes support for Next Generation Sequencing data including gene expression (RNA-seq), exome or genome resequencing. Babelomics has simplified its interface, being now more intuitive. Improved visualization options, such as a genome viewer as well as an interactive network viewer, have been implemented. New technical enhancements at both, client and server sides, makes the user experience faster and more dynamic. Babelomics offers user-friendly access to a full range of methods that cover: (i) primary data analysis, (ii) a variety of tests for different experimental designs and (iii) different enrichment and network analysis algorithms for the interpretation of the results of such tests in the proper functional context. In addition to the public server, local copies of Babelomics can be downloaded and installed. Babelomics is freely available at: http://www.babelomics.org.
Collapse
Affiliation(s)
- Roberto Alonso
- Computational Genomics Department, Centro de Investigación Príncipe Felipe (CIPF), Valencia, 46012, Spain Computational Genomics Chair, Bull-CIPF, Valencia, 46012, Spain
| | - Francisco Salavert
- Computational Genomics Department, Centro de Investigación Príncipe Felipe (CIPF), Valencia, 46012, Spain Bioinformatics of Rare Diseases (BIER), CIBER de Enfermedades Raras (CIBERER), Valencia, 46012, Spain
| | - Francisco Garcia-Garcia
- Computational Genomics Department, Centro de Investigación Príncipe Felipe (CIPF), Valencia, 46012, Spain
| | - Jose Carbonell-Caballero
- Computational Genomics Department, Centro de Investigación Príncipe Felipe (CIPF), Valencia, 46012, Spain
| | - Marta Bleda
- Department of Medicine, University of Cambridge, School of Clinical Medicine, Addenbrooke's Hospital, Hills Road, Cambridge CB2 0QQ, UK
| | - Luz Garcia-Alonso
- Computational Genomics Department, Centro de Investigación Príncipe Felipe (CIPF), Valencia, 46012, Spain
| | - Alba Sanchis-Juan
- Fundación Investigación Clínico de Valencia-INCLIVA, Valencia, 46010, Spain
| | - Daniel Perez-Gil
- Fundación Investigación Clínico de Valencia-INCLIVA, Valencia, 46010, Spain
| | - Pablo Marin-Garcia
- Fundación Investigación Clínico de Valencia-INCLIVA, Valencia, 46010, Spain
| | - Ruben Sanchez
- Computational Genomics Department, Centro de Investigación Príncipe Felipe (CIPF), Valencia, 46012, Spain Functional Genomics Node, (INB) at CIPF, Valencia, 46012, Spain
| | - Cankut Cubuk
- Computational Genomics Department, Centro de Investigación Príncipe Felipe (CIPF), Valencia, 46012, Spain
| | - Marta R Hidalgo
- Computational Genomics Department, Centro de Investigación Príncipe Felipe (CIPF), Valencia, 46012, Spain
| | - Alicia Amadoz
- Computational Genomics Department, Centro de Investigación Príncipe Felipe (CIPF), Valencia, 46012, Spain
| | | | - Alejandro Alemán
- Computational Genomics Department, Centro de Investigación Príncipe Felipe (CIPF), Valencia, 46012, Spain Bioinformatics of Rare Diseases (BIER), CIBER de Enfermedades Raras (CIBERER), Valencia, 46012, Spain
| | - Joaquin Tarraga
- Computational Genomics Department, Centro de Investigación Príncipe Felipe (CIPF), Valencia, 46012, Spain
| | - David Montaner
- Computational Genomics Department, Centro de Investigación Príncipe Felipe (CIPF), Valencia, 46012, Spain
| | - Ignacio Medina
- HPC Services, University of Cambridge, Cambridge, CB3 0RB UK
| | - Joaquin Dopazo
- Computational Genomics Department, Centro de Investigación Príncipe Felipe (CIPF), Valencia, 46012, Spain Computational Genomics Chair, Bull-CIPF, Valencia, 46012, Spain Bioinformatics of Rare Diseases (BIER), CIBER de Enfermedades Raras (CIBERER), Valencia, 46012, Spain Functional Genomics Node, (INB) at CIPF, Valencia, 46012, Spain
| |
Collapse
|
19
|
Rizza S, Conesa A, Juarez J, Catara A, Navarro L, Duran-Vila N, Ancillo G. Microarray analysis of Etrog citron (Citrus medica L.) reveals changes in chloroplast, cell wall, peroxidase and symporter activities in response to viroid infection. MOLECULAR PLANT PATHOLOGY 2012; 13:852-64. [PMID: 22420919 PMCID: PMC6638686 DOI: 10.1111/j.1364-3703.2012.00794.x] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/25/2023]
Abstract
Viroids are small (246-401 nucleotides), single-stranded, circular RNA molecules that infect several crop plants and can cause diseases of economic importance. Citrus are the hosts in which the largest number of viroids have been identified. Citrus exocortis viroid (CEVd), the causal agent of citrus exocortis disease, induces considerable losses in citrus crops. Changes in the gene expression profile during the early (pre-symptomatic) and late (post-symptomatic) stages of Etrog citron infected with CEVd were investigated using a citrus cDNA microarray. MaSigPro analysis was performed and, on the basis of gene expression profiles as a function of the time after infection, the differentially expressed genes were classified into five clusters. FatiScan analysis revealed significant enrichment of functional categories for each cluster, indicating that viroid infection triggers important changes in chloroplast, cell wall, peroxidase and symporter activities.
Collapse
Affiliation(s)
- Serena Rizza
- Department of Phytosanitary Sciences and Technologies-University of Catania, Via S. Sofia 102, 95123 Catania, Italy
| | | | | | | | | | | | | |
Collapse
|
20
|
|
21
|
Wit EC, Bakewell DJG. Borrowing strength: a likelihood ratio test for related sparse signals. Bioinformatics 2012; 28:1980-9. [PMID: 22668791 DOI: 10.1093/bioinformatics/bts316] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Cancer biology is a field where the complexity of the phenomena battles against the availability of data. Often only a few observations per signal source, i.e. genes, are available. Such scenarios are becoming increasingly more relevant as modern sensing technologies generally have no trouble in measuring lots of channels, but where the number of subjects, such as patients or samples, is limited. In statistics, this problem falls under the heading 'large p, small n'. Moreover, in such situations the use of asymptotic analytical results should generally be mistrusted. RESULTS We consider two cancer datasets, with the aim to mine the activity of functional groups of genes. We propose a hierarchical model with two layers in which the individual signals share a common variance component. A likelihood ratio test is defined for the difference between two collections of corresponding signals. The small number of observations requires a careful consideration of the bias of the statistic, which is corrected through an explicit Bartlett correction. The test is validated on Monte Carlo simulations, which show improved detection of differences compared with other methods. In a leukaemia study and a cancerous fibroblast cell line, we find that the method also works better in practice, i.e. it gives a richer picture of the underlying biology. AVAILABILITY The MATLAB code is available from the authors or on http://www.math.rug.nl/stat/Software. CONTACT e.c.wit@rug.nl d.bakewell@liv.ac.uk.
Collapse
Affiliation(s)
- Ernst C Wit
- Johann Bernoulli Institute, University of Groningen, 9747 AG Groningen, The Netherlands.
| | | |
Collapse
|
22
|
Ibrahim MAH, Jassim S, Cawthorne MA, Langlands K. A topology-based score for pathway enrichment. J Comput Biol 2012; 19:563-73. [PMID: 22468678 DOI: 10.1089/cmb.2011.0182] [Citation(s) in RCA: 36] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023] Open
Abstract
Investigators require intuitive tools to rationalize complex datasets generated by transcriptional profiling experiments. Pathway analysis methods, in which differentially expressed genes are mapped to databases of reference pathways to facilitate assessment of relative enrichment, lead investigators more effectively to biologically testable hypotheses. However, once a set of differentially expressed genes is isolated, pathway analysis approaches tend to ignore rich gene expression information and, moreover, do not exploit relationships between transcripts. In this article, we report the development of a new method in which both pathway topology and the magnitude of gene expression changes inform the scoring system, thereby providing a powerful filter in the enrichment of biologically relevant information. When four sample datasets were evaluated with this method, literature mining confirmed that those pathways germane to the physiological process under investigation were highlighted by our method relative to z-score overrepresentation calculations. Moreover, non-relevant processes were downgraded using the method described herein. The inclusion of expression and topological data in the calculation of a pathway regulation score (PRS) facilitated discrimination of key processes in real biological datasets. Specifically, by combining fold-change data for those transcripts exceeding a significance threshold, and by taking into account the potential for altered gene expression to impact upon downstream transcription, one may readily identify those pathways most relevant to pathophysiological processes.
Collapse
|
23
|
KVIST JOUNI, WHEAT CHRISTOPHERW, KALLIONIEMI EVELIINA, SAASTAMOINEN MARJO, HANSKI ILKKA, FRILANDER MIKKOJ. Temperature treatments during larval development reveal extensive heritable and plastic variation in gene expression and life history traits. Mol Ecol 2012; 22:602-19. [DOI: 10.1111/j.1365-294x.2012.05521.x] [Citation(s) in RCA: 50] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
24
|
Khatri P, Sirota M, Butte AJ. Ten years of pathway analysis: current approaches and outstanding challenges. PLoS Comput Biol 2012; 8:e1002375. [PMID: 22383865 PMCID: PMC3285573 DOI: 10.1371/journal.pcbi.1002375] [Citation(s) in RCA: 1005] [Impact Index Per Article: 83.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022] Open
Abstract
Pathway analysis has become the first choice for gaining insight into the underlying biology of differentially expressed genes and proteins, as it reduces complexity and has increased explanatory power. We discuss the evolution of knowledge base–driven pathway analysis over its first decade, distinctly divided into three generations. We also discuss the limitations that are specific to each generation, and how they are addressed by successive generations of methods. We identify a number of annotation challenges that must be addressed to enable development of the next generation of pathway analysis methods. Furthermore, we identify a number of methodological challenges that the next generation of methods must tackle to take advantage of the technological advances in genomics and proteomics in order to improve specificity, sensitivity, and relevance of pathway analysis.
Collapse
Affiliation(s)
- Purvesh Khatri
- Division of Systems Medicine, Department of Pediatrics, Stanford University School of Medicine, Stanford, California, United States of America
- Lucile Packard Children's Hospital, Palo Alto, California, United States of America
- * E-mail: (PK); (AJB)
| | - Marina Sirota
- Division of Systems Medicine, Department of Pediatrics, Stanford University School of Medicine, Stanford, California, United States of America
- Lucile Packard Children's Hospital, Palo Alto, California, United States of America
| | - Atul J. Butte
- Division of Systems Medicine, Department of Pediatrics, Stanford University School of Medicine, Stanford, California, United States of America
- Lucile Packard Children's Hospital, Palo Alto, California, United States of America
- * E-mail: (PK); (AJB)
| |
Collapse
|
25
|
Ji RR, Ott KH, Yordanova R, Bruccoleri RE. FDR-FET: an optimizing gene set enrichment analysis method. Adv Appl Bioinform Chem 2011; 4:37-42. [PMID: 21918636 PMCID: PMC3169954 DOI: 10.2147/aabc.s15840] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
Gene set enrichment analysis for analyzing large profiling and screening experiments can reveal unifying biological schemes based on previously accumulated knowledge represented as “gene sets”. Most of the existing implementations use a fixed fold-change or P value cutoff to generate regulated gene lists. However, the threshold selection in most cases is arbitrary, and has a significant effect on the test outcome and interpretation of the experiment. We developed a new gene set enrichment analysis method, ie, FDR-FET, which dynamically optimizes the threshold choice and improves the sensitivity and selectivity of gene set enrichment analysis. The procedure translates experimental results into a series of regulated gene lists at multiple false discovery rate (FDR) cutoffs, and computes the P value of the overrepresentation of a gene set using a Fisher’s exact test (FET) in each of these gene lists. The lowest P value is retained to represent the significance of the gene set. We also implemented improved methods to define a more relevant global reference set for the FET. We demonstrate the validity of the method using a published microarray study of three protease inhibitors of the human immunodeficiency virus and compare the results with those from other popular gene set enrichment analysis algorithms. Our results show that combining FDR with multiple cutoffs allows us to control the error while retaining genes that increase information content. We conclude that FDR-FET can selectively identify significant affected biological processes. Our method can be used for any user-generated gene list in the area of transcriptome, proteome, and other biological and scientific applications.
Collapse
Affiliation(s)
- Rui-Ru Ji
- Applied Genomics, Research and Development, Bristol-Myers Squibb, Pennington, NJ, USA
| | | | | | | |
Collapse
|
26
|
Gallego-Bartolomé J, Alabadí D, Blázquez MA. DELLA-induced early transcriptional changes during etiolated development in Arabidopsis thaliana. PLoS One 2011; 6:e23918. [PMID: 21904598 PMCID: PMC3164146 DOI: 10.1371/journal.pone.0023918] [Citation(s) in RCA: 46] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2011] [Accepted: 08/01/2011] [Indexed: 11/24/2022] Open
Abstract
The hormones gibberellins (GAs) control a wide variety of processes in plants, including stress and developmental responses. This task largely relies on the activity of the DELLA proteins, nuclear-localized transcriptional regulators that do not seem to have DNA binding capacity. The identification of early target genes of DELLA action is key not only to understand how GAs regulate physiological responses, but also to get clues about the molecular mechanisms by which DELLAs regulate gene expression. Here, we have investigated the global, early transcriptional response triggered by the Arabidopsis DELLA protein GAI during skotomorphogenesis, a developmental program tightly regulated by GAs. Our results show that the induction of GAI activity has an almost immediate effect on gene expression. Although this transcriptional regulation is largely mediated by the PIFs and HY5 transcription factors based on target meta-analysis, additional evidence points to other transcription factors that would be directly involved in DELLA regulation of gene expression. First, we have identified cis elements recognized by Dofs and type-B ARRs among the sequences enriched in the promoters of GAI targets; and second, an enrichment in additional cis elements appeared when this analysis was extended to a dataset of early targets of the DELLA protein RGA: CArG boxes, bound by MADS-box proteins, and the E-box CACATG that links the activity of DELLAs to circadian transcriptional regulation. Finally, Gene Ontology analysis highlights the impact of DELLA regulation upon the homeostasis of the GA, auxin, and ethylene pathways, as well as upon pre-existing transcriptional networks.
Collapse
Affiliation(s)
- Javier Gallego-Bartolomé
- Instituto de Biología Molecular y Celular de Plantas (CSIC-Universidad Politécnica de Valencia), Valencia, Spain
| | | | | |
Collapse
|
27
|
Natural selection on functional modules, a genome-wide analysis. PLoS Comput Biol 2011; 7:e1001093. [PMID: 21390268 PMCID: PMC3048381 DOI: 10.1371/journal.pcbi.1001093] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2010] [Accepted: 01/27/2011] [Indexed: 12/24/2022] Open
Abstract
Classically, the functional consequences of natural selection over genomes have been analyzed as the compound effects of individual genes. The current paradigm for large-scale analysis of adaptation is based on the observed significant deviations of rates of individual genes from neutral evolutionary expectation. This approach, which assumed independence among genes, has not been able to identify biological functions significantly enriched in positively selected genes in individual species. Alternatively, pooling related species has enhanced the search for signatures of selection. However, grouping signatures does not allow testing for adaptive differences between species. Here we introduce the Gene-Set Selection Analysis (GSSA), a new genome-wide approach to test for evidences of natural selection on functional modules. GSSA is able to detect lineage specific evolutionary rate changes in a notable number of functional modules. For example, in nine mammal and Drosophilae genomes GSSA identifies hundreds of functional modules with significant associations to high and low rates of evolution. Many of the detected functional modules with high evolutionary rates have been previously identified as biological functions under positive selection. Notably, GSSA identifies conserved functional modules with many positively selected genes, which questions whether they are exclusively selected for fitting genomes to environmental changes. Our results agree with previous studies suggesting that adaptation requires positive selection, but not every mutation under positive selection contributes to the adaptive dynamical process of the evolution of species.
Collapse
|
28
|
Functional analysis: evaluation of response intensities--tailoring ANOVA for lists of expression subsets. BMC Bioinformatics 2010; 11:510. [PMID: 20942918 PMCID: PMC2964684 DOI: 10.1186/1471-2105-11-510] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2010] [Accepted: 10/13/2010] [Indexed: 02/06/2023] Open
Abstract
Background Microarray data is frequently used to characterize the expression profile of a whole genome and to compare the characteristics of that genome under several conditions. Geneset analysis methods have been described previously to analyze the expression values of several genes related by known biological criteria (metabolic pathway, pathology signature, co-regulation by a common factor, etc.) at the same time and the cost of these methods allows for the use of more values to help discover the underlying biological mechanisms. Results As several methods assume different null hypotheses, we propose to reformulate the main question that biologists seek to answer. To determine which genesets are associated with expression values that differ between two experiments, we focused on three ad hoc criteria: expression levels, the direction of individual gene expression changes (up or down regulation), and correlations between genes. We introduce the FAERI methodology, tailored from a two-way ANOVA to examine these criteria. The significance of the results was evaluated according to the self-contained null hypothesis, using label sampling or by inferring the null distribution from normally distributed random data. Evaluations performed on simulated data revealed that FAERI outperforms currently available methods for each type of set tested. We then applied the FAERI method to analyze three real-world datasets on hypoxia response. FAERI was able to detect more genesets than other methodologies, and the genesets selected were coherent with current knowledge of cellular response to hypoxia. Moreover, the genesets selected by FAERI were confirmed when the analysis was repeated on two additional related datasets. Conclusions The expression values of genesets are associated with several biological effects. The underlying mathematical structure of the genesets allows for analysis of data from several genes at the same time. Focusing on expression levels, the direction of the expression changes, and correlations, we showed that two-step data reduction allowed us to significantly improve the performance of geneset analysis using a modified two-way ANOVA procedure, and to detect genesets that current methods fail to detect.
Collapse
|
29
|
Minguez P, Dopazo J. Functional genomics and networks: new approaches in the extraction of complex gene modules. Expert Rev Proteomics 2010; 7:55-63. [PMID: 20121476 DOI: 10.1586/epr.09.103] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
The engine that makes the cell work is made of an intricate network of molecular interactions. Nowadays, the elements and relationships of this complex network can be studied with several types of high-throughput techniques. The dream of having a global picture of the cell from different perspectives that can jointly explain cell behavior is, at least technically, feasible. However, this task can only be accomplished by filling the gap between data and information. The availability of methods capable of accurately managing, integrating and analyzing the results from these experiments is crucial for this purpose. Here, we review the new challenges raised by the availability of different genomic data, as well as the new proposals presented to cope with the increasing data complexity. Special emphasis is given to approaches that explore the transcriptome trying to describe the modules of genes that account for the traits studied.
Collapse
Affiliation(s)
- Pablo Minguez
- Department of Bioinformatics and Genomics, Centro de Investigación Príncipe Felipe, Valencia, Spain
| | | |
Collapse
|
30
|
Montaner D, Dopazo J. Multidimensional gene set analysis of genomic data. PLoS One 2010; 5:e10348. [PMID: 20436964 PMCID: PMC2860497 DOI: 10.1371/journal.pone.0010348] [Citation(s) in RCA: 60] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2009] [Accepted: 03/30/2010] [Indexed: 11/27/2022] Open
Abstract
Understanding the functional implications of changes in gene expression, mutations, etc., is the aim of most genomic experiments. To achieve this, several functional profiling methods have been proposed. Such methods study the behaviour of different gene modules (e.g. gene ontology terms) in response to one particular variable (e.g. differential gene expression). In spite to the wealth of information provided by functional profiling methods, a common limitation to all of them is their inherent unidimensional nature. In order to overcome this restriction we present a multidimensional logistic model that allows studying the relationship of gene modules with different genome-scale measurements (e.g. differential expression, genotyping association, methylation, copy number alterations, heterozygosity, etc.) simultaneously. Moreover, the relationship of such functional modules with the interactions among the variables can also be studied, which produces novel results impossible to be derived from the conventional unidimensional functional profiling methods. We report sound results of gene sets associations that remained undetected by the conventional one-dimensional gene set analysis in several examples. Our findings demonstrate the potential of the proposed approach for the discovery of new cell functionalities with complex dependences on more than one variable.
Collapse
Affiliation(s)
- David Montaner
- Department of Bioinformatics and Genomics, Centro de Investigación Príncipe Felipe (CIPF), Valencia, Spain
- Functional Genomics Node (INB), Centro de Investigación Príncipe Felipe (CIPF), Valencia, Spain
| | - Joaquín Dopazo
- Department of Bioinformatics and Genomics, Centro de Investigación Príncipe Felipe (CIPF), Valencia, Spain
- Functional Genomics Node (INB), Centro de Investigación Príncipe Felipe (CIPF), Valencia, Spain
- CIBER de Enfermedades Raras (CIBERER), Valencia, Spain
| |
Collapse
|
31
|
Abstract
The introduction of new high-throughput methodologies such as DNA microarrays constitutes a major breakthrough in cancer research. The unprecedented amount of data produced by such technologies has opened new avenues for interrogating living systems although, at the same time, it has demanded of the development of new data analytical methods as well as new strategies for testing hypotheses. A history of early successful applications in cancer boosted the use of microarrays and fostered further applications in other fields. Keeping the pace with these technologies, bioinformatics offers new solutions for data analysis and, what is more important, permits the formulation of a new class of hypotheses inspired in systems biology, more oriented to pathways or, in general, to modules of functionally related genes. Although these analytical methodologies are new, some options are already available and are discussed in this chapter.
Collapse
Affiliation(s)
- Joaquín Dopazo
- Bioinformatics Department, Centro de Investigación Príncipe Felipe, Valencio, Spain
| |
Collapse
|
32
|
Stavang JA, Gallego-Bartolomé J, Gómez MD, Yoshida S, Asami T, Olsen JE, García-Martínez JL, Alabadí D, Blázquez MA. Hormonal regulation of temperature-induced growth in Arabidopsis. THE PLANT JOURNAL : FOR CELL AND MOLECULAR BIOLOGY 2009; 60:589-601. [PMID: 19686536 DOI: 10.1111/j.1365-313x.2009.03983.x] [Citation(s) in RCA: 192] [Impact Index Per Article: 12.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/03/2023]
Abstract
Successful plant survival depends upon the proper integration of information from the environment with endogenous cues to regulate growth and development. We have investigated the interplay between ambient temperature and hormone action during the regulation of hypocotyl elongation, and we have found that gibberellins (GAs) and auxin are quickly and independently recruited by temperature to modulate growth rate, whereas activity of brassinosteroids (BRs) seems to be required later on. Impairment of GA biosynthesis blocked the increased elongation caused at higher temperatures, but hypocotyls of pentuple DELLA knockout mutants still reduced their response to higher temperatures when BR synthesis or auxin polar transport were blocked. The expression of several key genes involved in the biosynthesis of GAs and auxin was regulated by temperature, which indirectly resulted in coherent variations in the levels of accumulation of nuclear GFP-RGA (repressor of GA1) and in the activity of the DR5 reporter. DNA microarray and genetic analyses allowed the identification of the transcription factor PIF4 (phytochrome-interacting factor 4) as a major target in the promotion of growth at higher temperature. These results suggest that temperature regulates hypocotyl growth by individually impinging on several elements of a pre-existing network of signaling pathways involving auxin, BRs, GAs, and PIF4.
Collapse
Affiliation(s)
- Jon A Stavang
- Department of Plant and Environmental Sciences, Norwegian University of Life Sciences, N1432 As, Norway
| | | | | | | | | | | | | | | | | |
Collapse
|
33
|
Bartholomé K, Kreutz C, Timmer J. Estimation of gene induction enables a relevance-based ranking of gene sets. J Comput Biol 2009; 16:959-67. [PMID: 19580524 DOI: 10.1089/cmb.2008.0226] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023] Open
Abstract
In order to handle and interpret the vast amounts of data produced by microarray experiments, the analysis of sets of genes with a common biological functionality has been shown to be advantageous compared to single gene analyses. Some statistical methods have been proposed to analyse the differential gene expression of gene sets in microarray experiments. However, most of these methods either require threshhold values to be chosen for the analysis, or they need some reference set for the determination of significance. We present a method that estimates the number of differentially expressed genes in a gene set without requiring a threshold value for significance of genes. The method is self-contained (i.e., it does not require a reference set for comparison). In contrast to other methods which are focused on significance, our approach emphasizes the relevance of the regulation of gene sets. The presented method measures the degree of regulation of a gene set and is a useful tool to compare the induction of different gene sets and place the results of microarray experiments into the biological context. An R-package is available.
Collapse
|
34
|
Zhang L, Hammell M, Kudlow BA, Ambros V, Han M. Systematic analysis of dynamic miRNA-target interactions during C. elegans development. Development 2009; 136:3043-55. [PMID: 19675127 PMCID: PMC2730362 DOI: 10.1242/dev.039008] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 06/24/2009] [Indexed: 11/20/2022]
Abstract
Although microRNA (miRNA)-mediated functions have been implicated in many aspects of animal development, the majority of miRNA::mRNA regulatory interactions remain to be characterized experimentally. We used an AIN/GW182 protein immunoprecipitation approach to systematically analyze miRNA::mRNA interactions during C. elegans development. We characterized the composition of miRNAs in functional miRNA-induced silencing complexes (miRISCs) at each developmental stage and identified three sets of miRNAs with distinct stage-specificity of function. We then identified thousands of miRNA targets in each developmental stage, including a significant portion that is subject to differential miRNA regulation during development. By identifying thousands of miRNA family-mRNA pairs with temporally correlated patterns of AIN-2 association, we gained valuable information on the principles of physiological miRNA::target recognition and predicted 1589 high-confidence miRNA family::mRNA interactions. Our data support the idea that miRNAs preferentially target genes involved in signaling processes and avoid genes with housekeeping functions, and that miRNAs orchestrate temporal developmental programs by coordinately targeting or avoiding genes involved in particular biological functions.
Collapse
Affiliation(s)
- Liang Zhang
- Howard Hughes Medical Institute and Department of MCDB, University of Colorado, Boulder, CO 80309, USA
| | | | | | | | | |
Collapse
|
35
|
Moreno-Manzano V, Rodríguez-Jiménez FJ, García-Roselló M, Laínez S, Erceg S, Calvo MT, Ronaghi M, Lloret M, Planells-Cases R, Sánchez-Puelles JM, Stojkovic M. Activated spinal cord ependymal stem cells rescue neurological function. Stem Cells 2009; 27:733-43. [PMID: 19259940 DOI: 10.1002/stem.24] [Citation(s) in RCA: 115] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Abstract
Spinal cord injury (SCI) is a major cause of paralysis. Currently, there are no effective therapies to reverse this disabling condition. The presence of ependymal stem/progenitor cells (epSPCs) in the adult spinal cord suggests that endogenous stem cell-associated mechanisms might be exploited to repair spinal cord lesions. epSPC cells that proliferate after SCI are recruited by the injured zone, and can be modulated by innate and adaptive immune responses. Here we demonstrate that when epSPCs are cultured from rats with a SCI (ependymal stem/progenitor cells injury [epSPCi]), these cells proliferate 10 times faster in vitro than epSPC derived from control animals and display enhanced self renewal. Genetic profile analysis revealed an important influence of inflammation on signaling pathways in epSPCi after injury, including the upregulation of Jak/Stat and mitogen activated protein kinase pathways. Although neurospheres derived from either epSPCs or epSPCi differentiated efficiently to oligodendrocites and functional spinal motoneurons, a better yield of differentiated cells was consistently obtained from epSPCi cultures. Acute transplantation of undifferentiated epSPCi or the resulting oligodendrocyte precursor cells into a rat model of severe spinal cord contusion produced a significant recovery of motor activity 1 week after injury. These transplanted cells migrated long distances from the rostral and caudal regions of the transplant to the neurofilament-labeled axons in and around the lesion zone. Our findings demonstrate that modulation of endogenous epSPCs represents a viable cell-based strategy for restoring neuronal dysfunction in patients with spinal cord damage.
Collapse
|
36
|
Guan P, Huang D, He M, Zhou B. Lung cancer gene expression database analysis incorporating prior knowledge with support vector machine-based classification method. JOURNAL OF EXPERIMENTAL & CLINICAL CANCER RESEARCH : CR 2009; 28:103. [PMID: 19615083 PMCID: PMC2719616 DOI: 10.1186/1756-9966-28-103] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/03/2009] [Accepted: 07/18/2009] [Indexed: 01/13/2023]
Abstract
Background A reliable and precise classification is essential for successful diagnosis and treatment of cancer. Gene expression microarrays have provided the high-throughput platform to discover genomic biomarkers for cancer diagnosis and prognosis. Rational use of the available bioinformation can not only effectively remove or suppress noise in gene chips, but also avoid one-sided results of separate experiment. However, only some studies have been aware of the importance of prior information in cancer classification. Methods Together with the application of support vector machine as the discriminant approach, we proposed one modified method that incorporated prior knowledge into cancer classification based on gene expression data to improve accuracy. A public well-known dataset, Malignant pleural mesothelioma and lung adenocarcinoma gene expression database, was used in this study. Prior knowledge is viewed here as a means of directing the classifier using known lung adenocarcinoma related genes. The procedures were performed by software R 2.80. Results The modified method performed better after incorporating prior knowledge. Accuracy of the modified method improved from 98.86% to 100% in training set and from 98.51% to 99.06% in test set. The standard deviations of the modified method decreased from 0.26% to 0 in training set and from 3.04% to 2.10% in test set. Conclusion The method that incorporates prior knowledge into discriminant analysis could effectively improve the capacity and reduce the impact of noise. This idea may have good future not only in practice but also in methodology.
Collapse
Affiliation(s)
- Peng Guan
- Department of Epidemiology, School of Public Health, China Medical University, Shenyang 110001, PR China.
| | | | | | | |
Collapse
|
37
|
Nueda MJ, Sebastián P, Tarazona S, García-García F, Dopazo J, Ferrer A, Conesa A. Functional assessment of time course microarray data. BMC Bioinformatics 2009; 10 Suppl 6:S9. [PMID: 19534758 PMCID: PMC2697656 DOI: 10.1186/1471-2105-10-s6-s9] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022] Open
Abstract
Motivation Time-course microarray experiments study the progress of gene expression along time across one or several experimental conditions. Most developed analysis methods focus on the clustering or the differential expression analysis of genes and do not integrate functional information. The assessment of the functional aspects of time-course transcriptomics data requires the use of approaches that exploit the activation dynamics of the functional categories to where genes are annotated. Methods We present three novel methodologies for the functional assessment of time-course microarray data. i) maSigFun derives from the maSigPro method, a regression-based strategy to model time-dependent expression patterns and identify genes with differences across series. maSigFun fits a regression model for groups of genes labeled by a functional class and selects those categories which have a significant model. ii) PCA-maSigFun fits a PCA model of each functional class-defined expression matrix to extract orthogonal patterns of expression change, which are then assessed for their fit to a time-dependent regression model. iii) ASCA-functional uses the ASCA model to rank genes according to their correlation to principal time expression patterns and assess functional enrichment on a GSA fashion. We used simulated and experimental datasets to study these novel approaches. Results were compared to alternative methodologies. Results Synthetic and experimental data showed that the different methods are able to capture different aspects of the relationship between genes, functions and co-expression that are biologically meaningful. The methods should not be considered as competitive but they provide different insights into the molecular and functional dynamic events taking place within the biological system under study.
Collapse
Affiliation(s)
- María José Nueda
- Department of Statistics and Operation Research, University of Alicante, Ctra, San Vicente del Raspeig, S/N 03690 Alicante, Spain.
| | | | | | | | | | | | | |
Collapse
|
38
|
Medina I, Montaner D, Bonifaci N, Pujana MA, Carbonell J, Tarraga J, Al-Shahrour F, Dopazo J. Gene set-based analysis of polymorphisms: finding pathways or biological processes associated to traits in genome-wide association studies. Nucleic Acids Res 2009; 37:W340-4. [PMID: 19502494 PMCID: PMC2703970 DOI: 10.1093/nar/gkp481] [Citation(s) in RCA: 55] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2023] Open
Abstract
Genome-wide association studies have become a popular strategy to find associations of genes to traits of interest. Despite the high-resolution available today to carry out genotyping studies, the success of its application in real studies has been limited by the testing strategy used. As an alternative to brute force solutions involving the use of very large cohorts, we propose the use of the Gene Set Analysis (GSA), a different analysis strategy based on testing the association of modules of functionally related genes. We show here how the Gene Set-based Analysis of Polymorphisms (GeSBAP), which is a simple implementation of the GSA strategy for the analysis of genome-wide association studies, provides a significant increase in the power testing for this type of studies. GeSBAP is freely available at http://bioinfo.cipf.es/gesbap/
Collapse
Affiliation(s)
- Ignacio Medina
- Department of Bioinformatics and Genomics, CIPF, Valencia, Spain
| | | | | | | | | | | | | | | |
Collapse
|
39
|
Montaner D, Minguez P, Al-Shahrour F, Dopazo J. Gene set internal coherence in the context of functional profiling. BMC Genomics 2009; 10:197. [PMID: 19397819 PMCID: PMC2680416 DOI: 10.1186/1471-2164-10-197] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2008] [Accepted: 04/27/2009] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Functional profiling methods have been extensively used in the context of high-throughput experiments and, in particular, in microarray data analysis. Such methods use available biological information to define different types of functional gene modules (e.g. gene ontology -GO-, KEGG pathways, etc.) whose representation in a pre-defined list of genes is further studied. In the most popular type of microarray experimental designs (e.g. up- or down-regulated genes, clusters of co-expressing genes, etc.) or in other genomic experiments (e.g. Chip-on-chip, epigenomics, etc.) these lists are composed by genes with a high degree of co-expression. Therefore, an implicit assumption in the application of functional profiling methods within this context is that the genes corresponding to the modules tested are effectively defining sets of co-expressing genes. Nevertheless not all the functional modules are biologically coherent entities in terms of co-expression, which will eventually hinder its detection with conventional methods of functional enrichment. RESULTS Using a large collection of microarray data we have carried out a detailed survey of internal correlation in GO terms and KEGG pathways, providing a coherence index to be used for measuring functional module co-regulation. An unexpected low level of internal correlation was found among the modules studied. Only around 30% of the modules defined by GO terms and 57% of the modules defined by KEGG pathways display an internal correlation higher than the expected by chance.This information on the internal correlation of the genes within the functional modules can be used in the context of a logistic regression model in a simple way to improve their detection in gene expression experiments. CONCLUSION For the first time, an exhaustive study on the internal co-expression of the most popular functional categories has been carried out. Interestingly, the real level of coexpression within many of them is lower than expected (or even inexistent), which will preclude its detection by means of most conventional functional profiling methods. If the gene-to-function correlation information is used in functional profiling methods, the results obtained improve the ones obtained by conventional enrichment methods.
Collapse
Affiliation(s)
- David Montaner
- Department of Bioinformatics and Genomics, Centro de Investigación Príncipe Felipe (CIPF), Valencia, Spain.
| | | | | | | |
Collapse
|
40
|
Jantus Lewintre E, Reinoso Martín C, Montaner D, Marín M, José Terol M, Farrás R, Benet I, Calvete JJ, Dopazo J, García-Conde J. Analysis of chronic lymphotic leukemia transcriptomic profile: differences between molecular subgroups. Leuk Lymphoma 2009; 50:68-79. [PMID: 19127482 DOI: 10.1080/10428190802541807] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
Abstract
B cell chronic lymphocytic leukemia (CLL) is a lymphoproliferative disorder with a variable clinical course. Patients with unmutated IgV(H) gene show a shorter progression-free and overall survival than patients with immunoglobulin heavy chain variable regions (IgV(H)) gene mutated. In addition, BCL6 mutations identify a subgroup of patients with high risk of progression. Gene expression was analysed in 36 early-stage patients using high-density microarrays. Around 150 genes differentially expressed were found according to IgV(H) mutations, whereas no difference was found according to BCL6 mutations. Functional profiling methods allowed us to distinguish KEGG and gene ontology terms showing coordinated gene expression changes across subgroups of CLL. We validated a set of differentially expressed genes according to IgV(H) status, scoring them as putative prognostic markers in CLL. Among them, CRY1, LPL, CD82 and DUSP22 are the ones with at least equal or superior performance to ZAP70 which is actually the most used surrogate marker of IgV(H) status.
Collapse
|
41
|
Zhu M, Yu M, Zhao S. Understanding quantitative genetics in the systems biology era. Int J Biol Sci 2009; 5:161-70. [PMID: 19173038 PMCID: PMC2631226 DOI: 10.7150/ijbs.5.161] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2008] [Accepted: 01/21/2009] [Indexed: 01/06/2023] Open
Abstract
Biology is now entering the new era of systems biology and exerting a growing influence on the future development of various disciplines within life sciences. In early classical and molecular periods of Biology, the theoretical frames of classical and molecular quantitative genetics have been systematically established, respectively. With the new advent of systems biology, there is occurring a paradigm shift in the field of quantitative genetics. Where and how the quantitative genetics would develop after having undergone its classical and molecular periods? This is a difficult question to answer exactly. In this perspective article, the major effort was made to discuss the possible development of quantitative genetics in the systems biology era, and for which there is a high potentiality to develop towards "systems quantitative genetics". In our opinion, the systems quantitative genetics can be defined as a new discipline to address the generalized genetic laws of bioalleles controlling the heritable phenotypes of complex traits following a new dynamic network model. Other issues from quantitative genetic perspective relating to the genetical genomics, the updates of network model, and the future research prospects were also discussed.
Collapse
Affiliation(s)
| | | | - Shuhong Zhao
- Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction of Ministry of Education, Huazhong Agricultural University, Wuhan 430070, P. R. China
| |
Collapse
|
42
|
Hamid JS, Hu P, Roslin NM, Ling V, Greenwood CMT, Beyene J. Data integration in genetics and genomics: methods and challenges. HUMAN GENOMICS AND PROTEOMICS : HGP 2009; 2009. [PMID: 20948564 PMCID: PMC2950414 DOI: 10.4061/2009/869093] [Citation(s) in RCA: 87] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/25/2008] [Accepted: 12/01/2008] [Indexed: 01/18/2023]
Abstract
Due to rapid technological advances, various types of genomic and proteomic data with different sizes, formats, and structures have become available. Among them are gene expression, single nucleotide polymorphism, copy number variation, and protein-protein/gene-gene interactions. Each of these distinct data types provides a different, partly independent and complementary, view of the whole genome. However, understanding functions of genes, proteins, and other aspects of the genome requires more information than provided by each of the datasets. Integrating data from different sources is, therefore, an important part of current research in genomics and proteomics. Data integration also plays important roles in combining clinical, environmental, and demographic data with high-throughput genomic data. Nevertheless, the concept of data integration is not well defined in the literature and it may mean different things to different researchers. In this paper, we first propose a conceptual framework for integrating genetic, genomic, and proteomic data. The framework captures fundamental aspects of data integration and is developed taking the key steps in genetic, genomic, and proteomic data fusion. Secondly, we provide a review of some of the most commonly used current methods and approaches for combining genomic data with focus on the statistical aspects.
Collapse
Affiliation(s)
- Jemila S Hamid
- Biostatistics Methodology Unit, The Hospital for Sick Children Research Institute, 555 University Avenue, Toronto, ON, Canada M5G 1X8
| | | | | | | | | | | |
Collapse
|
43
|
Bonifaci N, Berenguer A, Díez J, Reina O, Medina I, Dopazo J, Moreno V, Pujana MA. Biological processes, properties and molecular wiring diagrams of candidate low-penetrance breast cancer susceptibility genes. BMC Med Genomics 2008; 1:62. [PMID: 19094230 PMCID: PMC2628924 DOI: 10.1186/1755-8794-1-62] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2008] [Accepted: 12/18/2008] [Indexed: 12/24/2022] Open
Abstract
Background Recent advances in whole-genome association studies (WGASs) for human cancer risk are beginning to provide the part lists of low-penetrance susceptibility genes. However, statistical analysis in these studies is complicated by the vast number of genetic variants examined and the weak effects observed, as a result of which constraints must be incorporated into the study design and analytical approach. In this scenario, biological attributes beyond the adjusted statistics generally receive little attention and, more importantly, the fundamental biological characteristics of low-penetrance susceptibility genes have yet to be determined. Methods We applied an integrative approach for identifying candidate low-penetrance breast cancer susceptibility genes, their characteristics and molecular networks through the analysis of diverse sources of biological evidence. Results First, examination of the distribution of Gene Ontology terms in ordered WGAS results identified asymmetrical distribution of Cell Communication and Cell Death processes linked to risk. Second, analysis of 11 different types of molecular or functional relationships in genomic and proteomic data sets defined the "omic" properties of candidate genes: i/ differential expression in tumors relative to normal tissue; ii/ somatic genomic copy number changes correlating with gene expression levels; iii/ differentially expressed across age at diagnosis; and iv/ expression changes after BRCA1 perturbation. Finally, network modeling of the effects of variants on germline gene expression showed higher connectivity than expected by chance between novel candidates and with known susceptibility genes, which supports functional relationships and provides mechanistic hypotheses of risk. Conclusion This study proposes that cell communication and cell death are major biological processes perturbed in risk of breast cancer conferred by low-penetrance variants, and defines the common omic properties, molecular interactions and possible functional effects of candidate genes and proteins.
Collapse
Affiliation(s)
- Núria Bonifaci
- Bioinformatics and Biostatistics Unit, and Translational Research Laboratory, Catalan Institute of Oncology, Bellvitge Biomedical Research Institute (IDIBELL), L'Hospitalet, Barcelona, Spain.
| | | | | | | | | | | | | | | |
Collapse
|
44
|
Tintle NL, Best AA, DeJongh M, Van Bruggen D, Heffron F, Porwollik S, Taylor RC. Gene set analyses for interpreting microarray experiments on prokaryotic organisms. BMC Bioinformatics 2008; 9:469. [PMID: 18986519 PMCID: PMC2587482 DOI: 10.1186/1471-2105-9-469] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2008] [Accepted: 11/05/2008] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Despite the widespread usage of DNA microarrays, questions remain about how best to interpret the wealth of gene-by-gene transcriptional levels that they measure. Recently, methods have been proposed which use biologically defined sets of genes in interpretation, instead of examining results gene-by-gene. Despite a serious limitation, a method based on Fisher's exact test remains one of the few plausible options for gene set analysis when an experiment has few replicates, as is typically the case for prokaryotes. RESULTS We extend five methods of gene set analysis from use on experiments with multiple replicates, for use on experiments with few replicates. We then use simulated and real data to compare these methods with each other and with the Fisher's exact test (FET) method. As a result of the simulation we find that a method named MAXMEAN-NR, maintains the nominal rate of false positive findings (type I error rate) while offering good statistical power and robustness to a variety of gene set distributions for set sizes of at least 10. Other methods (ABSSUM-NR or SUM-NR) are shown to be powerful for set sizes less than 10. Analysis of three sets of experimental data shows similar results. Furthermore, the MAXMEAN-NR method is shown to be able to detect biologically relevant sets as significant, when other methods (including FET) cannot. We also find that the popular GSEA-NR method performs poorly when compared to MAXMEAN-NR. CONCLUSION MAXMEAN-NR is a method of gene set analysis for experiments with few replicates, as is common for prokaryotes. Results of simulation and real data analysis suggest that the MAXMEAN-NR method offers increased robustness and biological relevance of findings as compared to FET and other methods, while maintaining the nominal type I error rate.
Collapse
Affiliation(s)
- Nathan L Tintle
- Department of Mathematics, Hope College, Holland, Michigan, USA.
| | | | | | | | | | | | | |
Collapse
|
45
|
Larsson O, Diebold D, Fan D, Peterson M, Nho RS, Bitterman PB, Henke CA. Fibrotic myofibroblasts manifest genome-wide derangements of translational control. PLoS One 2008; 3:e3220. [PMID: 18795102 PMCID: PMC2528966 DOI: 10.1371/journal.pone.0003220] [Citation(s) in RCA: 80] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2008] [Accepted: 08/20/2008] [Indexed: 11/19/2022] Open
Abstract
Background As a group, fibroproliferative disorders of the lung, liver, kidney, heart, vasculature and integument are common, progressive and refractory to therapy. They can emerge following toxic insults, but are frequently idiopathic. Their enigmatic propensity to resist therapy and progress to organ failure has focused attention on the myofibroblast–the primary effector of the fibroproliferative response. We have recently shown that aberrant beta 1 integrin signaling in fibrotic fibroblasts results in defective PTEN function, unrestrained Akt signaling and subsequent activation of the translation initiation machinery. How this pathological integrin signaling alters the gene expression pathway has not been elucidated. Results Using a systems approach to study this question in a prototype fibrotic disease, Idiopathic Pulmonary Fibrosis (IPF); here we show organized changes in the gene expression pathway of primary lung myofibroblasts that persist for up to 9 sub-cultivations in vitro. When comparing IPF and control myofibroblasts in a 3-dimensional type I collagen matrix, more genes differed at the level of ribosome recruitment than at the level of transcript abundance, indicating pathological translational control as a major characteristic of IPF myofibroblasts. To determine the effect of matrix state on translational control, myofibroblasts were permitted to contract the matrix. Ribosome recruitment in control myofibroblasts was relatively stable. In contrast, IPF cells manifested large alterations in the ribosome recruitment pattern. Pathological studies suggest an epithelial origin for IPF myofibroblasts through the epithelial to mesenchymal transition (EMT). In accord with this, we found systems-level indications for TGF-β -driven EMT as one source of IPF myofibroblasts. Conclusions These findings establish the power of systems level genome-wide analysis to provide mechanistic insights into fibrotic disorders such as IPF. Our data point to derangements of translational control downstream of aberrant beta 1 integrin signaling as a fundamental component of IPF pathobiology and indicates that TGF-β -driven EMT is one source for IPF myofibroblasts.
Collapse
Affiliation(s)
- Ola Larsson
- Pulmonary Division, Department of Medicine, University of Minnesota, Minneapolis, Minnesota, United States of America.
| | | | | | | | | | | | | |
Collapse
|
46
|
Conesa A, Bro R, García-García F, Prats JM, Götz S, Kjeldahl K, Montaner D, Dopazo J. Direct functional assessment of the composite phenotype through multivariate projection strategies. Genomics 2008; 92:373-83. [PMID: 18652888 DOI: 10.1016/j.ygeno.2008.05.015] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2008] [Revised: 05/26/2008] [Accepted: 05/28/2008] [Indexed: 01/11/2023]
Abstract
We present a novel approach for the analysis of transcriptomics data that integrates functional annotation of gene sets with expression values in a multivariate fashion, and directly assesses the relation of functional features to a multivariate space of response phenotypical variables. Multivariate projection methods are used to obtain new correlated variables for a set of genes that share a given function. These new functional variables are then related to the response variables of interest. The analysis of the principal directions of the multivariate regression allows for the identification of gene function features correlated with the phenotype. Two different transcriptomics studies are used to illustrate the statistical and interpretative aspects of the methodology. We demonstrate the superiority of the proposed method over equivalent approaches.
Collapse
Affiliation(s)
- Ana Conesa
- Bioinformatics Department, Centro de Investigación Principe Felipe, Valencia, Spain
| | | | | | | | | | | | | | | |
Collapse
|
47
|
Dopazo J. Formulating and testing hypotheses in functional genomics. Artif Intell Med 2008; 45:97-107. [PMID: 18789659 DOI: 10.1016/j.artmed.2008.08.003] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2008] [Revised: 08/04/2008] [Accepted: 08/04/2008] [Indexed: 01/08/2023]
Abstract
OBJECTIVE The ultimate goal of any genome-scale experiment is to provide a functional interpretation of the results, relating the available genomic information to the hypotheses that originated the experiment. METHODS AND RESULTS Initially, this interpretation has been made on a pre-selection of relevant genes, based on the experimental values, followed by the study of the enrichment in some functional properties. Nevertheless, functional enrichment methods, demonstrated to have a flaw: the first step of gene selection was too stringent given that the cooperation among genes was ignored. The assumption that modules of genes related by relevant biological properties (functionality, co-regulation, chromosomal location, etc.) are the real actors of the cell biology lead to the development of new procedures, inspired in systems biology criteria, generically known as gene-set methods. These methods have been successfully used to analyze transcriptomic and large-scale genotyping experiments as well as to test other different genome-scale hypothesis in other fields such as phylogenomics.
Collapse
Affiliation(s)
- Joaquin Dopazo
- Department of Bioinformatics, and Functional Genomics Node (INB), Valencia E-46013, Spain.
| |
Collapse
|
48
|
Agudelo-Romero P, Carbonell P, de la Iglesia F, Carrera J, Rodrigo G, Jaramillo A, Pérez-Amador MA, Elena SF. Changes in the gene expression profile of Arabidopsis thaliana after infection with Tobacco etch virus. Virol J 2008; 5:92. [PMID: 18684336 PMCID: PMC2518140 DOI: 10.1186/1743-422x-5-92] [Citation(s) in RCA: 50] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2008] [Accepted: 08/07/2008] [Indexed: 12/22/2022] Open
Abstract
BACKGROUND Tobacco etch potyvirus (TEV) has been extensively used as model system for the study of positive-sense RNA virus infecting plants. TEV ability to infect Arabidopsis thaliana varies among ecotypes. In this study, changes in gene expression of A. thaliana ecotype Ler infected with TEV have been explored using long-oligonucleotide arrays. A. thaliana Ler is a susceptible host that allows systemic movement, although the viral load is low and syndrome induced ranges from asymptomatic to mild. Gene expression profiles were monitored in whole plants 21 days post-inoculation (dpi). Microarrays contained 26,173 protein-coding genes and 87 miRNAs. RESULTS Expression analysis identified 1727 genes that displayed significant and consistent changes in expression levels either up or down, in infected plants. Identified TEV-responsive genes encode a diverse array of functional categories that include responses to biotic (such as the systemic acquired resistance pathway and hypersensitive responses) and abiotic stresses (droughtness, salinity, temperature, and wounding). The expression of many different transcription factors was also significantly affected, including members of the R2R3-MYB family and ABA-inducible TFs. In concordance with several other plant and animal viruses, the expression of heat-shock proteins (HSP) was also increased. Finally, we have associated functional GO categories with KEGG biochemical pathways, and found that many of the altered biological functions are controlled by changes in basal metabolism. CONCLUSION TEV infection significantly impacts a wide array of cellular processes, in particular, stress-response pathways, including the systemic acquired resistance and hypersensitive responses. However, many of the observed alterations may represent a global response to viral infection rather than being specific of TEV.
Collapse
Affiliation(s)
- Patricia Agudelo-Romero
- Instituto de Biología Molecular y Celular de Plantas, Consejo Superior de Investigaciones Científicas-UPV, 46022, València, Spain
| | - Pablo Carbonell
- Instituto de Biología Molecular y Celular de Plantas, Consejo Superior de Investigaciones Científicas-UPV, 46022, València, Spain
| | - Francisca de la Iglesia
- Instituto de Biología Molecular y Celular de Plantas, Consejo Superior de Investigaciones Científicas-UPV, 46022, València, Spain
| | - Javier Carrera
- Instituto de Biología Molecular y Celular de Plantas, Consejo Superior de Investigaciones Científicas-UPV, 46022, València, Spain
| | - Guillermo Rodrigo
- Instituto de Biología Molecular y Celular de Plantas, Consejo Superior de Investigaciones Científicas-UPV, 46022, València, Spain
| | - Alfonso Jaramillo
- Laboratoire de Biochimie, École Polytechnique, 91128, Palaiseau, France
| | - Miguel A Pérez-Amador
- Instituto de Biología Molecular y Celular de Plantas, Consejo Superior de Investigaciones Científicas-UPV, 46022, València, Spain
| | - Santiago F Elena
- Instituto de Biología Molecular y Celular de Plantas, Consejo Superior de Investigaciones Científicas-UPV, 46022, València, Spain
| |
Collapse
|
49
|
Wei P, Pan W. Incorporating gene functions into regression analysis of DNA-protein binding data and gene expression data to construct transcriptional networks. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2008; 5:401-415. [PMID: 18670043 DOI: 10.1109/tcbb.2007.1062] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/26/2023]
Abstract
Useful information on transcriptional networks has been extracted by regression analyses of gene expression data and DNA-protein binding data. However, a potential limitation of these approaches is their assumption on the common and constant activity level of a transcription factor (TF) on all the genes in any given experimental condition; for example, any TF is assumed to be either an activator or a repressor, but not both, while it is known that some TFs can be dual regulators. Rather than assuming a common linear regression model for all the genes, we propose using separate regression models for various gene groups; the genes can be grouped based on their functions or some clustering results. Furthermore, to take advantage of the hierarchical structure of many existing gene function annotation systems, such as Gene Ontology (GO), we propose a shrinkage method that borrows information from relevant gene groups. Applications to a yeast dataset and simulations lend support for our proposed methods. In particular, we find that the shrinkage method consistently works well under various scenarios. We recommend the use of the shrinkage method as a useful alternative to the existing methods.
Collapse
Affiliation(s)
- Peng Wei
- Division of Biostatistics, School of Public Health, University of Minnesota, A460 Mayo Building, MMC 303, Minneapolis, MN 55455-0378, USA.
| | | |
Collapse
|
50
|
Denton AM, Wu J, Townsend MK, Sule P, Prüss BM. Relating gene expression data on two-component systems to functional annotations in Escherichia coli. BMC Bioinformatics 2008; 9:294. [PMID: 18578884 PMCID: PMC2478693 DOI: 10.1186/1471-2105-9-294] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2007] [Accepted: 06/25/2008] [Indexed: 11/30/2022] Open
Abstract
Background Obtaining physiological insights from microarray experiments requires computational techniques that relate gene expression data to functional information. Traditionally, this has been done in two consecutive steps. The first step identifies important genes through clustering or statistical techniques, while the second step assigns biological functions to the identified groups. Recently, techniques have been developed that identify such relationships in a single step. Results We have developed an algorithm that relates patterns of gene expression in a set of microarray experiments to functional groups in one step. Our only assumption is that patterns co-occur frequently. The effectiveness of the algorithm is demonstrated as part of a study of regulation by two-component systems in Escherichia coli. The significance of the relationships between expression data and functional annotations is evaluated based on density histograms that are constructed using product similarity among expression vectors. We present a biological analysis of three of the resulting functional groups of proteins, develop hypotheses for further biological studies, and test one of these hypotheses experimentally. A comparison with other algorithms and a different data set is presented. Conclusion Our new algorithm is able to find interesting and biologically meaningful relationships, not found by other algorithms, in previously analyzed data sets. Scaling of the algorithm to large data sets can be achieved based on a theoretical model.
Collapse
Affiliation(s)
- Anne M Denton
- Department of Computer Science and Operations Research, North Dakota State University, Fargo, ND 58105, USA.
| | | | | | | | | |
Collapse
|