1
|
Azevedo CF, Ferrão LFV, Benevenuto J, de Resende MDV, Nascimento M, Nascimento ACC, Munoz PR. Using visual scores for genomic prediction of complex traits in breeding programs. TAG. THEORETICAL AND APPLIED GENETICS. THEORETISCHE UND ANGEWANDTE GENETIK 2023; 137:9. [PMID: 38102495 DOI: 10.1007/s00122-023-04512-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/09/2023] [Accepted: 11/21/2023] [Indexed: 12/17/2023]
Abstract
KEY MESSAGE An approach for handling visual scores with potential errors and subjectivity in scores was evaluated in simulated and blueberry recurrent selection breeding schemes to assist breeders in their decision-making. Most genomic prediction methods are based on assumptions of normality due to their simplicity and ease of implementation. However, in plant and animal breeding, continuous traits are often visually scored as categorical traits and analyzed as a Gaussian variable, thus violating the normality assumption, which could affect the prediction of breeding values and the estimation of genetic parameters. In this study, we examined the main challenges of visual scores for genomic prediction and genetic parameter estimation using mixed models, Bayesian, and machine learning methods. We evaluated these approaches using simulated and real breeding data sets. Our contribution in this study is a five-fold demonstration: (i) collecting data using an intermediate number of categories (1-3 and 1-5) is the best strategy, even considering errors associated with visual scores; (ii) Linear Mixed Models and Bayesian Linear Regression are robust to the normality violation, but marginal gains can be achieved when using Bayesian Ordinal Regression Models (BORM) and Random Forest Classification; (iii) genetic parameters are better estimated using BORM; (iv) our conclusions using simulated data are also applicable to real data in autotetraploid blueberry; and (v) a comparison of continuous and categorical phenotypes found that investing in the evaluation of 600-1000 categorical data points with low error, when it is not feasible to collect continuous phenotypes, is a strategy for improving predictive abilities. Our findings suggest the best approaches for effectively using visual scores traits to explore genetic information in breeding programs and highlight the importance of investing in the training of evaluator teams and in high-quality phenotyping.
Collapse
Affiliation(s)
- Camila Ferreira Azevedo
- Statistics Department, Federal University of Viçosa, Viçosa, Minas Gerais, Brazil
- Horticultural Sciences Department, Blueberry Breeding and Genomics Lab, University of Florida, Gainesville, FL, USA
| | - Luis Felipe Ventorim Ferrão
- Horticultural Sciences Department, Blueberry Breeding and Genomics Lab, University of Florida, Gainesville, FL, USA
| | - Juliana Benevenuto
- Horticultural Sciences Department, Blueberry Breeding and Genomics Lab, University of Florida, Gainesville, FL, USA
| | - Marcos Deon Vilela de Resende
- Statistics Department, Federal University of Viçosa, Viçosa, Minas Gerais, Brazil
- Department of Forestry Engineering, Federal University of Viçosa, Viçosa, Minas Gerais, Brazil
- Embrapa Café, Brasília, Distrito Federal, Brazil
| | - Moyses Nascimento
- Statistics Department, Federal University of Viçosa, Viçosa, Minas Gerais, Brazil
| | | | - Patricio R Munoz
- Horticultural Sciences Department, Blueberry Breeding and Genomics Lab, University of Florida, Gainesville, FL, USA.
| |
Collapse
|
2
|
Hornung R, Boulesteix AL. Interaction forests: Identifying and exploiting interpretable quantitative and qualitative interaction effects. Comput Stat Data Anal 2022. [DOI: 10.1016/j.csda.2022.107460] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
|
3
|
Acharjee A, Larkman J, Xu Y, Cardoso VR, Gkoutos GV. A random forest based biomarker discovery and power analysis framework for diagnostics research. BMC Med Genomics 2020; 13:178. [PMID: 33228632 PMCID: PMC7685541 DOI: 10.1186/s12920-020-00826-6] [Citation(s) in RCA: 39] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2020] [Accepted: 11/15/2020] [Indexed: 11/25/2022] Open
Abstract
Background Biomarker identification is one of the major and important goal of functional genomics and translational medicine studies. Large scale –omics data are increasingly being accumulated and can provide vital means for the identification of biomarkers for the early diagnosis of complex disease and/or for advanced patient/diseases stratification. These tasks are clearly interlinked, and it is essential that an unbiased and stable methodology is applied in order to address them. Although, recently, many, primarily machine learning based, biomarker identification approaches have been developed, the exploration of potential associations between biomarker identification and the design of future experiments remains a challenge. Methods In this study, using both simulated and published experimentally derived datasets, we assessed the performance of several state-of-the-art Random Forest (RF) based decision approaches, namely the Boruta method, the permutation based feature selection without correction method, the permutation based feature selection with correction method, and the backward elimination based feature selection method. Moreover, we conducted a power analysis to estimate the number of samples required for potential future studies. Results We present a number of different RF based stable feature selection methods and compare their performances using simulated, as well as published, experimentally derived, datasets. Across all of the scenarios considered, we found the Boruta method to be the most stable methodology, whilst the Permutation (Raw) approach offered the largest number of relevant features, when allowed to stabilise over a number of iterations. Finally, we developed and made available a web interface (https://joelarkman.shinyapps.io/PowerTools/) to streamline power calculations thereby aiding the design of potential future studies within a translational medicine context. Conclusions We developed a RF-based biomarker discovery framework and provide a web interface for our framework, termed PowerTools, that caters the design of appropriate and cost-effective subsequent future omics study.
Collapse
Affiliation(s)
- Animesh Acharjee
- College of Medical and Dental Sciences, Institute of Cancer and Genomic Sciences, Centre for Computational Biology, University of Birmingham, Birmingham, B15 2TT, UK. .,Institute of Translational Medicine, University Hospitals Birmingham NHS, Foundation Trust, Birmingham, B15 2TT, UK. .,NIHR Surgical Reconstruction and Microbiology Research Centre, University Hospital Birmingham, Birmingham, B15 2WB, UK.
| | - Joseph Larkman
- College of Medical and Dental Sciences, Institute of Cancer and Genomic Sciences, Centre for Computational Biology, University of Birmingham, Birmingham, B15 2TT, UK.,Institute of Translational Medicine, University Hospitals Birmingham NHS, Foundation Trust, Birmingham, B15 2TT, UK
| | - Yuanwei Xu
- College of Medical and Dental Sciences, Institute of Cancer and Genomic Sciences, Centre for Computational Biology, University of Birmingham, Birmingham, B15 2TT, UK.,Institute of Translational Medicine, University Hospitals Birmingham NHS, Foundation Trust, Birmingham, B15 2TT, UK
| | - Victor Roth Cardoso
- College of Medical and Dental Sciences, Institute of Cancer and Genomic Sciences, Centre for Computational Biology, University of Birmingham, Birmingham, B15 2TT, UK.,Institute of Translational Medicine, University Hospitals Birmingham NHS, Foundation Trust, Birmingham, B15 2TT, UK.,MRC Health Data Research UK (HDR UK), London, UK
| | - Georgios V Gkoutos
- College of Medical and Dental Sciences, Institute of Cancer and Genomic Sciences, Centre for Computational Biology, University of Birmingham, Birmingham, B15 2TT, UK.,Institute of Translational Medicine, University Hospitals Birmingham NHS, Foundation Trust, Birmingham, B15 2TT, UK.,NIHR Surgical Reconstruction and Microbiology Research Centre, University Hospital Birmingham, Birmingham, B15 2WB, UK.,MRC Health Data Research UK (HDR UK), London, UK.,NIHR Experimental Cancer Medicine Centre, Birmingham, B15 2TT, UK.,NIHR Biomedical Research Centre, University Hospital Birmingham, Birmingham, B15 2TT, UK
| |
Collapse
|
4
|
Harel T, Peshes-Yaloz N, Bacharach E, Gat-Viks I. Predicting Phenotypic Diversity from Molecular and Genetic Data. Genetics 2019; 213:297-311. [PMID: 31352366 PMCID: PMC6727812 DOI: 10.1534/genetics.119.302463] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2018] [Accepted: 07/04/2019] [Indexed: 01/03/2023] Open
Abstract
Despite the importance of complex phenotypes, an in-depth understanding of the combined molecular and genetic effects on a phenotype has yet to be achieved. Here, we introduce InPhenotype, a novel computational approach for complex phenotype prediction, where gene-expression data and genotyping data are integrated to yield quantitative predictions of complex physiological traits. Unlike existing computational methods, InPhenotype makes it possible to model potential regulatory interactions between gene expression and genomic loci without compromising the continuous nature of the molecular data. We applied InPhenotype to synthetic data, exemplifying its utility for different data parameters, as well as its superiority compared to current methods in both prediction quality and the ability to detect regulatory interactions of genes and genomic loci. Finally, we show that InPhenotype can provide biological insights into both mouse and yeast datasets.
Collapse
Affiliation(s)
- Tom Harel
- School of Molecular Cell Biology and Biotechnology, The George S. Wise Faculty of Life Sciences, Tel Aviv University, 6997801 Israe
| | - Naama Peshes-Yaloz
- School of Molecular Cell Biology and Biotechnology, The George S. Wise Faculty of Life Sciences, Tel Aviv University, 6997801 Israe
| | - Eran Bacharach
- School of Molecular Cell Biology and Biotechnology, The George S. Wise Faculty of Life Sciences, Tel Aviv University, 6997801 Israe
| | - Irit Gat-Viks
- School of Molecular Cell Biology and Biotechnology, The George S. Wise Faculty of Life Sciences, Tel Aviv University, 6997801 Israe
| |
Collapse
|
5
|
Degenhardt F, Seifert S, Szymczak S. Evaluation of variable selection methods for random forests and omics data sets. Brief Bioinform 2019; 20:492-503. [PMID: 29045534 PMCID: PMC6433899 DOI: 10.1093/bib/bbx124] [Citation(s) in RCA: 245] [Impact Index Per Article: 49.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2017] [Revised: 09/06/2017] [Indexed: 12/28/2022] Open
Abstract
Machine learning methods and in particular random forests are promising approaches for prediction based on high dimensional omics data sets. They provide variable importance measures to rank predictors according to their predictive power. If building a prediction model is the main goal of a study, often a minimal set of variables with good prediction performance is selected. However, if the objective is the identification of involved variables to find active networks and pathways, approaches that aim to select all relevant variables should be preferred. We evaluated several variable selection procedures based on simulated data as well as publicly available experimental methylation and gene expression data. Our comparison included the Boruta algorithm, the Vita method, recurrent relative variable importance, a permutation approach and its parametric variant (Altmann) as well as recursive feature elimination (RFE). In our simulation studies, Boruta was the most powerful approach, followed closely by the Vita method. Both approaches demonstrated similar stability in variable selection, while Vita was the most robust approach under a pure null model without any predictor variables related to the outcome. In the analysis of the different experimental data sets, Vita demonstrated slightly better stability in variable selection and was less computationally intensive than Boruta. In conclusion, we recommend the Boruta and Vita approaches for the analysis of high-dimensional data sets. Vita is considerably faster than Boruta and thus more suitable for large data sets, but only Boruta can also be applied in low-dimensional settings.
Collapse
Affiliation(s)
| | - Stephan Seifert
- Institute of Medical Informatics and Statistics, Kiel University, Germany
| | - Silke Szymczak
- Institute of Medical Informatics and Statistics, Kiel University, Germany
| |
Collapse
|
6
|
DeGregory KW, Kuiper P, DeSilvio T, Pleuss JD, Miller R, Roginski JW, Fisher CB, Harness D, Viswanath S, Heymsfield SB, Dungan I, Thomas DM. A review of machine learning in obesity. Obes Rev 2018; 19:668-685. [PMID: 29426065 PMCID: PMC8176949 DOI: 10.1111/obr.12667] [Citation(s) in RCA: 101] [Impact Index Per Article: 16.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/22/2017] [Revised: 11/18/2017] [Accepted: 11/28/2017] [Indexed: 12/15/2022]
Abstract
Rich sources of obesity-related data arising from sensors, smartphone apps, electronic medical health records and insurance data can bring new insights for understanding, preventing and treating obesity. For such large datasets, machine learning provides sophisticated and elegant tools to describe, classify and predict obesity-related risks and outcomes. Here, we review machine learning methods that predict and/or classify such as linear and logistic regression, artificial neural networks, deep learning and decision tree analysis. We also review methods that describe and characterize data such as cluster analysis, principal component analysis, network science and topological data analysis. We introduce each method with a high-level overview followed by examples of successful applications. The algorithms were then applied to National Health and Nutrition Examination Survey to demonstrate methodology, utility and outcomes. The strengths and limitations of each method were also evaluated. This summary of machine learning algorithms provides a unique overview of the state of data analysis applied specifically to obesity.
Collapse
Affiliation(s)
- K W DeGregory
- Department of Mathematical Sciences, United States Military Academy, West Point, NY, USA
| | - P Kuiper
- Department of Mathematical Sciences, United States Military Academy, West Point, NY, USA
| | - T DeSilvio
- Case Western Reserve University, Cleveland, OH, USA
| | - J D Pleuss
- Department of Mathematical Sciences, United States Military Academy, West Point, NY, USA
| | - R Miller
- Department of Mathematical Sciences, United States Military Academy, West Point, NY, USA
| | - J W Roginski
- Department of Mathematical Sciences, United States Military Academy, West Point, NY, USA
| | - C B Fisher
- Department of Mathematical Sciences, United States Military Academy, West Point, NY, USA
| | - D Harness
- Department of Mathematical Sciences, United States Military Academy, West Point, NY, USA
| | - S Viswanath
- Case Western Reserve University, Cleveland, OH, USA
| | - S B Heymsfield
- Pennington Biomedical Research Center, Baton Rouge, LA, USA
| | - I Dungan
- Department of Mathematical Sciences, United States Military Academy, West Point, NY, USA
| | - D M Thomas
- Department of Mathematical Sciences, United States Military Academy, West Point, NY, USA
| |
Collapse
|
7
|
Shi M, He J. SNRFCB: sub-network based random forest classifier for predicting chemotherapy benefit on survival for cancer treatment. MOLECULAR BIOSYSTEMS 2016; 12:1214-23. [PMID: 26864276 DOI: 10.1039/c5mb00399g] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]
Abstract
Adjuvant chemotherapy (CTX) should be individualized to provide potential survival benefit and avoid potential harm to cancer patients. Our goal was to establish a computational approach for making personalized estimates of the survival benefit from adjuvant CTX. We developed Sub-Network based Random Forest classifier for predicting Chemotherapy Benefit (SNRFCB) based gene expression datasets of lung cancer. The SNRFCB approach was then validated in independent test cohorts for identifying chemotherapy responder cohorts and chemotherapy non-responder cohorts. SNRFCB involved the pre-selection of gene sub-network signatures based on the mutations and on protein-protein interaction data as well as the application of the random forest algorithm to gene expression datasets. Adjuvant CTX was significantly associated with the prolonged overall survival of lung cancer patients in the chemotherapy responder group (P = 0.008), but it was not beneficial to patients in the chemotherapy non-responder group (P = 0.657). Adjuvant CTX was significantly associated with the prolonged overall survival of lung cancer squamous cell carcinoma (SQCC) subtype patients in the chemotherapy responder cohorts (P = 0.024), but it was not beneficial to patients in the chemotherapy non-responder cohorts (P = 0.383). SNRFCB improved prediction performance as compared to the machine learning method, support vector machine (SVM). To test the general applicability of the predictive model, we further applied the SNRFCB approach to human breast cancer datasets and also observed superior performance. SNRFCB could provide recurrent probability for individual patients and identify which patients may benefit from adjuvant CTX in clinical trials.
Collapse
Affiliation(s)
- Mingguang Shi
- School of Electric Engineering and Automation, Hefei University of Technology, Hefei, Anhui 230009, China.
| | | |
Collapse
|
8
|
Anděl M, Kléma J, Krejčík Z. Network-constrained forest for regularized classification of omics data. Methods 2015; 83:88-97. [PMID: 25872185 DOI: 10.1016/j.ymeth.2015.04.006] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2015] [Revised: 04/01/2015] [Accepted: 04/02/2015] [Indexed: 12/28/2022] Open
Abstract
Contemporary molecular biology deals with wide and heterogeneous sets of measurements to model and understand underlying biological processes including complex diseases. Machine learning provides a frequent approach to build such models. However, the models built solely from measured data often suffer from overfitting, as the sample size is typically much smaller than the number of measured features. In this paper, we propose a random forest-based classifier that reduces this overfitting with the aid of prior knowledge in the form of a feature interaction network. We illustrate the proposed method in the task of disease classification based on measured mRNA and miRNA profiles complemented by the interaction network composed of the miRNA-mRNA target relations and mRNA-mRNA interactions corresponding to the interactions between their encoded proteins. We demonstrate that the proposed network-constrained forest employs prior knowledge to increase learning bias and consequently to improve classification accuracy, stability and comprehensibility of the resulting model. The experiments are carried out in the domain of myelodysplastic syndrome that we are concerned about in the long term. We validate our approach in the public domain of ovarian carcinoma, with the same data form. We believe that the idea of a network-constrained forest can straightforwardly be generalized towards arbitrary omics data with an available and non-trivial feature interaction network. The proposed method is publicly available in terms of miXGENE system (http://mixgene.felk.cvut.cz), the workflow that implements the myelodysplastic syndrome experiments is presented as a dedicated case study.
Collapse
Affiliation(s)
- Michael Anděl
- Department of Computer Science, Czech Technical University, Technická 2, Prague, Czech Republic.
| | - Jiří Kléma
- Department of Computer Science, Czech Technical University, Technická 2, Prague, Czech Republic.
| | - Zdeněk Krejčík
- Department of Molecular Genetics, Institute of Hematology and Blood Transfusion, U Nemocnice 1, Prague, Czech Republic.
| |
Collapse
|
9
|
Žitnik M, Zupan B. Data Fusion by Matrix Factorization. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2015; 37:41-53. [PMID: 26353207 DOI: 10.1109/tpami.2014.2343973] [Citation(s) in RCA: 90] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
For most problems in science and engineering we can obtain data sets that describe the observed system from various perspectives and record the behavior of its individual components. Heterogeneous data sets can be collectively mined by data fusion. Fusion can focus on a specific target relation and exploit directly associated data together with contextual data and data about system's constraints. In the paper we describe a data fusion approach with penalized matrix tri-factorization (DFMF) that simultaneously factorizes data matrices to reveal hidden associations. The approach can directly consider any data that can be expressed in a matrix, including those from feature-based representations, ontologies, associations and networks. We demonstrate the utility of DFMF for gene function prediction task with eleven different data sources and for prediction of pharmacologic actions by fusing six data sources. Our data fusion algorithm compares favorably to alternative data integration approaches and achieves higher accuracy than can be obtained from any single data source alone.
Collapse
|
10
|
Predicting the phenotypic values of physiological traits using SNP genotype and gene expression data in mice. PLoS One 2014; 9:e115532. [PMID: 25541966 PMCID: PMC4277360 DOI: 10.1371/journal.pone.0115532] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2014] [Accepted: 11/25/2014] [Indexed: 01/22/2023] Open
Abstract
Predicting phenotypes using genome-wide genetic variation and gene expression data is useful in several fields, such as human biology and medicine, as well as in crop and livestock breeding. However, for phenotype prediction using gene expression data for mammals, studies remain scarce, as the available data on gene expression profiling are currently limited. By integrating a few sources of relevant data that are available in mice, this study investigated the accuracy of phenotype prediction for several physiological traits. Gene expression data from two tissues as well as single nucleotide polymorphisms (SNPs) were used. For the studied traits, the variance of the effects of the expression levels was more likely to differ among the genes than were the effects of SNPs. For the glucose concentration, the total cholesterol amount, and the total tidal volume, the accuracy by cross validation tended to be higher when the gene expression data rather than the SNP genotype data were used, and a statistically significant increase in the accuracy was obtained when the gene expression data from the liver were used alone or jointly with the SNP genotype data. For these traits, there were no additional gains in accuracy from using the gene expression data of both the liver and lung compared to that of individual use. The accuracy of prediction using genes that were selected differently was examined; the use of genes with a higher tissue specificity tended to result in an accuracy that was similar to or greater than that associated with the use of all of the available genes for traits such as the glucose concentration and total cholesterol amount. Although relatively few animals were evaluated, the current results suggest that gene expression levels could be used as explanatory variables. However, further studies are essential to confirm our findings using additional animal samples.
Collapse
|
11
|
Borland AM, Hartwell J, Weston DJ, Schlauch KA, Tschaplinski TJ, Tuskan GA, Yang X, Cushman JC. Engineering crassulacean acid metabolism to improve water-use efficiency. TRENDS IN PLANT SCIENCE 2014; 19:327-38. [PMID: 24559590 PMCID: PMC4065858 DOI: 10.1016/j.tplants.2014.01.006] [Citation(s) in RCA: 122] [Impact Index Per Article: 12.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/06/2013] [Revised: 01/01/2014] [Accepted: 01/13/2014] [Indexed: 05/19/2023]
Abstract
Climatic extremes threaten agricultural sustainability worldwide. One approach to increase plant water-use efficiency (WUE) is to introduce crassulacean acid metabolism (CAM) into C3 crops. Such a task requires comprehensive systems-level understanding of the enzymatic and regulatory pathways underpinning this temporal CO2 pump. Here we review the progress that has been made in achieving this goal. Given that CAM arose through multiple independent evolutionary origins, comparative transcriptomics and genomics of taxonomically diverse CAM species are being used to define the genetic 'parts list' required to operate the core CAM functional modules of nocturnal carboxylation, diurnal decarboxylation, and inverse stomatal regulation. Engineered CAM offers the potential to sustain plant productivity for food, feed, fiber, and biofuel production in hotter and drier climates.
Collapse
Affiliation(s)
- Anne M Borland
- School of Biology, Newcastle University, Newcastle upon Tyne NE1 7RU, UK; Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831-6407, USA
| | - James Hartwell
- Department of Plant Sciences, Institute of Integrative Biology, University of Liverpool, Liverpool L69 7ZB, UK
| | - David J Weston
- Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831-6407, USA
| | - Karen A Schlauch
- Department of Biochemistry and Molecular Biology, MS330, University of Nevada, Reno, NV 89557-0330, USA
| | | | - Gerald A Tuskan
- Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831-6407, USA
| | - Xiaohan Yang
- Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831-6407, USA
| | - John C Cushman
- Department of Biochemistry and Molecular Biology, MS330, University of Nevada, Reno, NV 89557-0330, USA.
| |
Collapse
|
12
|
Tomescu OA, Mattanovich D, Thallinger GG. Integrative omics analysis. A study based on Plasmodium falciparum mRNA and protein data. BMC SYSTEMS BIOLOGY 2014; 8 Suppl 2:S4. [PMID: 25033389 PMCID: PMC4101701 DOI: 10.1186/1752-0509-8-s2-s4] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
Background Technological improvements have shifted the focus from data generation to data analysis. The availability of large amounts of data from transcriptomics, protemics and metabolomics experiments raise new questions concerning suitable integrative analysis methods. We compare three integrative analysis techniques (co-inertia analysis, generalized singular value decomposition and integrative biclustering) by applying them to gene and protein abundance data from the six life cycle stages of Plasmodium falciparum. Co-inertia analysis is an analysis method used to visualize and explore gene and protein data. The generalized singular value decomposition has shown its potential in the analysis of two transcriptome data sets. Integrative Biclustering applies biclustering to gene and protein data. Results Using CIA, we visualize the six life cycle stages of Plasmodium falciparum, as well as GO terms in a 2D plane and interpret the spatial configuration. With GSVD, we decompose the transcriptomic and proteomic data sets into matrices with biologically meaningful interpretations and explore the processes captured by the data sets. IBC identifies groups of genes, proteins, GO Terms and life cycle stages of Plasmodium falciparum. We show method-specific results as well as a network view of the life cycle stages based on the results common to all three methods. Additionally, by combining the results of the three methods, we create a three-fold validated network of life cycle stage specific GO terms: Sporozoites are associated with transcription and transport; merozoites with entry into host cell as well as biosynthetic and metabolic processes; rings with oxidation-reduction processes; trophozoites with glycolysis and energy production; schizonts with antigenic variation and immune response; gametocyctes with DNA packaging and mitochondrial transport. Furthermore, the network connectivity underlines the separation of the intraerythrocytic cycle from the gametocyte and sporozoite stages. Conclusion Using integrative analysis techniques, we can integrate knowledge from different levels and obtain a wider view of the system under study. The overlap between method-specific and common results is considerable, even if the basic mathematical assumptions are very different. The three-fold validated network of life cycle stage characteristics of Plasmodium falciparum could identify a large amount of the known associations from literature in only one study.
Collapse
|
13
|
Echeverry-Galvis MA, Peterson JK, Sulo-Caceres R. The social nestwork: tree structure determines nest placement in Kenyan weaverbird colonies. PLoS One 2014; 9:e88761. [PMID: 24551157 PMCID: PMC3923826 DOI: 10.1371/journal.pone.0088761] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2013] [Accepted: 01/11/2014] [Indexed: 11/19/2022] Open
Abstract
Group living is a life history strategy employed by many organisms. This strategy is often difficult to study because the exact boundaries of a group can be unclear. Weaverbirds present an ideal model for the study of group living, because their colonies occupy a space with discrete boundaries: a single tree. We examined one aspect of group living. nest placement, in three Kenyan weaverbird species: the Black-capped Weaver (Pseudonigrita cabanisi), Grey-capped Weaver (P. arnaudi) and White-browed Sparrow Weaver (Ploceropasser mahali). We asked which environmental, biological, and/or abiotic factors influenced their nest arrangement and location in a given tree. We used machine learning to analyze measurements taken from 16 trees and 516 nests outside the breeding season at the Mpala Research Station in Laikipia Kenya, along with climate data for the area. We found that tree architecture, number of nests per tree, and nest-specific characteristics were the main variables driving nest placement. Our results suggest that different Kenyan weaverbird species have similar priorities driving the selection of where a nest is placed within a given tree. Our work illustrates the advantage of using machine learning techniques to investigate biological questions.
Collapse
Affiliation(s)
- Maria Angela Echeverry-Galvis
- Department of Ecology and Evolutionary Biology, Princeton University, Princeton, New Jersey, United States of America
- Departamento de Ecologia y Territorio, Pontificia Universidad Javeriana, Bogotá Colombia
- * E-mail:
| | - Jennifer K. Peterson
- Department of Ecology and Evolutionary Biology, Princeton University, Princeton, New Jersey, United States of America
| | - Rajmonda Sulo-Caceres
- Department of Mathematics, Statistics, and Computer Science, University of Illinois at Chicago, Chicago, Illinois, United States of America
| |
Collapse
|
14
|
Discovering disease-disease associations by fusing systems-level molecular data. Sci Rep 2013; 3:3202. [PMID: 24232732 PMCID: PMC3828568 DOI: 10.1038/srep03202] [Citation(s) in RCA: 81] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2013] [Accepted: 10/23/2013] [Indexed: 12/12/2022] Open
Abstract
The advent of genome-scale genetic and genomic studies allows new insight into disease classification. Recently, a shift was made from linking diseases simply based on their shared genes towards systems-level integration of molecular data. Here, we aim to find relationships between diseases based on evidence from fusing all available molecular interaction and ontology data. We propose a multi-level hierarchy of disease classes that significantly overlaps with existing disease classification. In it, we find 14 disease-disease associations currently not present in Disease Ontology and provide evidence for their relationships through comorbidity data and literature curation. Interestingly, even though the number of known human genetic interactions is currently very small, we find they are the most important predictor of a link between diseases. Finally, we show that omission of any one of the included data sources reduces prediction quality, further highlighting the importance in the paradigm shift towards systems-level data fusion.
Collapse
|
15
|
Seoane JA, Day INM, Gaunt TR, Campbell C. A pathway-based data integration framework for prediction of disease progression. ACTA ACUST UNITED AC 2013; 30:838-45. [PMID: 24162466 PMCID: PMC3957070 DOI: 10.1093/bioinformatics/btt610] [Citation(s) in RCA: 56] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Motivation: Within medical research there is an increasing trend toward deriving multiple types of data from the same individual. The most effective prognostic prediction methods should use all available data, as this maximizes the amount of information used. In this article, we consider a variety of learning strategies to boost prediction performance based on the use of all available data. Implementation: We consider data integration via the use of multiple kernel learning supervised learning methods. We propose a scheme in which feature selection by statistical score is performed separately per data type and by pathway membership. We further consider the introduction of a confidence measure for the class assignment, both to remove some ambiguously labeled datapoints from the training data and to implement a cautious classifier that only makes predictions when the associated confidence is high. Results: We use the METABRIC dataset for breast cancer, with prediction of survival at 2000 days from diagnosis. Predictive accuracy is improved by using kernels that exclusively use those genes, as features, which are known members of particular pathways. We show that yet further improvements can be made by using a range of additional kernels based on clinical covariates such as Estrogen Receptor (ER) status. Using this range of measures to improve prediction performance, we show that the test accuracy on new instances is nearly 80%, though predictions are only made on 69.2% of the patient cohort. Availability:https://github.com/jseoane/FSMKL Contact:J.Seoane@bristol.ac.uk Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- José A Seoane
- MRC Centre for Causal Analyses in Translational Epidemiology, MRC Integrative Epidemiology Unit, School of Social and Community Medicine, University of Bristol, Clifton BS8 2BN, UK and Intelligent Systems Laboratory, University of Bristol, Bristol BS8 1UB, UK
| | | | | | | |
Collapse
|