1
|
Aponte JD, Katz DC, Roth DM, Vidal-García M, Liu W, Andrade F, Roseman CC, Murray SA, Cheverud J, Graf D, Marcucio RS, Hallgrímsson B. Relating multivariate shapes to genescapes using phenotype-biological process associations for craniofacial shape. eLife 2021; 10:68623. [PMID: 34779766 PMCID: PMC8631940 DOI: 10.7554/elife.68623] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2021] [Accepted: 11/12/2021] [Indexed: 12/20/2022] Open
Abstract
Realistic mappings of genes to morphology are inherently multivariate on both sides of the equation. The importance of coordinated gene effects on morphological phenotypes is clear from the intertwining of gene actions in signaling pathways, gene regulatory networks, and developmental processes underlying the development of shape and size. Yet, current approaches tend to focus on identifying and localizing the effects of individual genes and rarely leverage the information content of high-dimensional phenotypes. Here, we explicitly model the joint effects of biologically coherent collections of genes on a multivariate trait – craniofacial shape – in a sample of n = 1145 mice from the Diversity Outbred (DO) experimental line. We use biological process Gene Ontology (GO) annotations to select skeletal and facial development gene sets and solve for the axis of shape variation that maximally covaries with gene set marker variation. We use our process-centered, multivariate genotype-phenotype (process MGP) approach to determine the overall contributions to craniofacial variation of genes involved in relevant processes and how variation in different processes corresponds to multivariate axes of shape variation. Further, we compare the directions of effect in phenotype space of mutations to the primary axis of shape variation associated with broader pathways within which they are thought to function. Finally, we leverage the relationship between mutational and pathway-level effects to predict phenotypic effects beyond craniofacial shape in specific mutants. We also introduce an online application that provides users the means to customize their own process-centered craniofacial shape analyses in the DO. The process-centered approach is generally applicable to any continuously varying phenotype and thus has wide-reaching implications for complex trait genetics.
Collapse
Affiliation(s)
- Jose D Aponte
- Department of Cell Biology & Anatomy, Alberta Children's Hospital Research Institute and McCaig Bone and Joint Institute, Cumming School of Medicine, University of Calgary, Calgary, Canada
| | - David C Katz
- Department of Cell Biology & Anatomy, Alberta Children's Hospital Research Institute and McCaig Bone and Joint Institute, Cumming School of Medicine, University of Calgary, Calgary, Canada
| | - Daniela M Roth
- School of Dentistry, Faculty of Medicine and Dentistry, University of Alberta, Edmonton, Canada
| | - Marta Vidal-García
- Department of Cell Biology & Anatomy, Alberta Children's Hospital Research Institute and McCaig Bone and Joint Institute, Cumming School of Medicine, University of Calgary, Calgary, Canada
| | - Wei Liu
- Department of Cell Biology & Anatomy, Alberta Children's Hospital Research Institute and McCaig Bone and Joint Institute, Cumming School of Medicine, University of Calgary, Calgary, Canada
| | - Fernando Andrade
- Department of Biology, Loyola University Chicago, Chicago, United States
| | - Charles C Roseman
- Department of Biology, Loyola University Chicago, Chicago, United States
| | | | - James Cheverud
- Department of Biology, Loyola University Chicago, Chicago, United States
| | - Daniel Graf
- School of Dentistry, Faculty of Medicine and Dentistry, University of Alberta, Edmonton, Canada.,Department of Medical Genetics, Faculty of Medicine and Dentistry, University of Alberta, Edmonton, Canada
| | - Ralph S Marcucio
- Department of Orthopaedic Surgery, School of Medicine, University of California, San Francisco, San Francisco, United States
| | - Benedikt Hallgrímsson
- Department of Cell Biology & Anatomy, Alberta Children's Hospital Research Institute and McCaig Bone and Joint Institute, Cumming School of Medicine, University of Calgary, Calgary, Canada.,Department of Animal Biology, University of Illinois Urbana Champaign, Urbana, United States
| |
Collapse
|
2
|
Guo C, Wang H, Feng G, Li J, Su C, Zhang J, Wang Z, Du W, Zhang B. Spatiotemporal predictions of obesity prevalence in Chinese children and adolescents: based on analyses of obesogenic environmental variability and Bayesian model. Int J Obes (Lond) 2019; 43:1380-1390. [PMID: 30568273 PMCID: PMC6584073 DOI: 10.1038/s41366-018-0301-0] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/08/2018] [Revised: 11/03/2018] [Accepted: 11/30/2018] [Indexed: 01/22/2023]
Abstract
OBJECTIVE To find variations in Chinese obesogenic environmental priorities from 2000 to 2011, predict spatiotemporal distribution of obesity prevalence aged 7-17 years in 31 provinces, and provide foundations for policy-makers to reduce obesity in children and adolescents. METHODS Based on data examination of provincial obesity prevalence aged 7-17 years from three rounds of China Health and Nutrition Surveys (in 9 [2000], 9 [2006], and 12 [2011] provinces) and corresponding years' environments in 31 provinces from China Statistical Yearbooks and other sources, 12 predictors were selected. We used 30 surveyed provinces in three rounds as training samples to fit three analytic models with partial least-square regressions and prioritized predictors by variable importance projection to find variations. And fitted a spatiotemporal prediction model with Bayesian analysis to infer in space-time. RESULTS Variations of obesogenic environmental priorities were found at different times. A Bayesian spatiotemporal prediction model with deviance information criterion of 155.60 and statistically significant (P < 0.05) parameter estimates of intercept (-717.0400, 95% confidence intervals [CI]: -1186.0300, -248.0480), year (0.3584, CI: 0.1245, 0.5924), square of food industry level (0.0003, CI: 0.0002, 0.0004), and log (healthcare) (5.3742, CI: 2.5138, 8.2347) was optimized. Totally inferred average obesity prevalence among children and adolescents were 2.23%, 5.11%, 10.77%, 12.20%, 13.99%, and 17.58% in 31 provinces in China in 2000, 2006, 2011, 2015, 2020, and 2030, respectively. Obesity in north and east of China clusters on predicted maps. CONCLUSIONS Obesity prevalence in children and adolescents in China is rapidly increasing, growing at 0.3584% annually from 2000 to 2011. From longitudinal observation, prevalence was significantly influenced by food industry ("Amplifier") and healthcare service ("Balancer"). Targeted interventions in north and east of China are pressing. Further researches on the mechanisms underlying the influence of food industry, healthcare service, and so on in children and adolescents are needed.
Collapse
Affiliation(s)
- C Guo
- National Institute for Nutrition and Health, Chinese Center for Disease Control and Prevention, No. 29 Nanwei Road, Xicheng District, Beijing, 100050, China
| | - H Wang
- National Institute for Nutrition and Health, Chinese Center for Disease Control and Prevention, No. 29 Nanwei Road, Xicheng District, Beijing, 100050, China
| | - G Feng
- Center for Clinical Epidemiology & Evidence-based Medicine of Beijing Children Hospital, Capital Medical University, National Center for Children's Health, No. 56 Nanlishi Road, Xicheng District, Beijing, 100045, China
| | - J Li
- School of Statistics, Shanxi University of Finance and Economics, No. 696 Wucheng Road, Taiyuan, 030006, Shanxi, China
| | - C Su
- National Institute for Nutrition and Health, Chinese Center for Disease Control and Prevention, No. 29 Nanwei Road, Xicheng District, Beijing, 100050, China
| | - J Zhang
- National Institute for Nutrition and Health, Chinese Center for Disease Control and Prevention, No. 29 Nanwei Road, Xicheng District, Beijing, 100050, China
| | - Z Wang
- National Institute for Nutrition and Health, Chinese Center for Disease Control and Prevention, No. 29 Nanwei Road, Xicheng District, Beijing, 100050, China
| | - W Du
- National Institute for Nutrition and Health, Chinese Center for Disease Control and Prevention, No. 29 Nanwei Road, Xicheng District, Beijing, 100050, China
| | - B Zhang
- National Institute for Nutrition and Health, Chinese Center for Disease Control and Prevention, No. 29 Nanwei Road, Xicheng District, Beijing, 100050, China.
| |
Collapse
|
3
|
Hanssen EN, Liland KH, Gill P, Snipen L. Optimizing body fluid recognition from microbial taxonomic profiles. Forensic Sci Int Genet 2018; 37:13-20. [PMID: 30071492 DOI: 10.1016/j.fsigen.2018.07.012] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2018] [Revised: 07/12/2018] [Accepted: 07/13/2018] [Indexed: 12/17/2022]
Abstract
In forensics the DNA-profile is used to identify the person who left a biological trace, but information on body fluid can also be essential in the evidence evaluation process. Microbial composition data could potentially be used for body fluid recognition as an improved alternative to the currently used presumptive tests. We have developed a customized workflow for interpretation of bacterial 16S sequence data based on a model composed of Partial Least Squares (PLS) in combination with Linear Discriminant Analysis (LDA). Large data sets from the Human Microbiome Project (HMP) and the American Gut Project (AGP) were used to test different settings in order to optimize performance. From the initial cross-validation of body fluid recognition within the HMP data, the optimal overall accuracy was close to 98%. Sensitivity values for the fecal and oral samples were ≥0.99, followed by the vaginal samples with 0.98 and the skin and nasal samples with 0.96 and 0.81 respectively. Specificity values were high for all 5 categories, mostly >0.99. This optimal performance was achieved by using the following settings: Taxonomic profiles based on operational taxonomic units (OTUs) with 0.98 identity (OTU98), Aitchisons simplex transform with C = 1 pseudo-count and no regularization (r = 1) in the PLS step. Variable selection did not improve the performance further. To test for robustness across sequencing platforms, we also trained the classifier on HMP data and tested on the AGP data set. In this case, the standard OTU based approach showed moderately decline in accuracy. However, by using taxonomic profiles made by direct assignment of reads to a genus, we were able to nearly maintain the high accuracy levels. The optimal combination of settings was still used, except the taxonomic level being genus instead of OTU98. The performance may be improved even further by using higher resolution taxonomic bins.
Collapse
Affiliation(s)
- Eirik Nataas Hanssen
- Department of Forensic Biology, Oslo University Hospital, P.O. Box 4950 Nydalen, N-0424 Oslo, Norway; Department of Forensic Medicine, University of Oslo, P.O. Box 4950 Nydalen, N-0424 Oslo, Norway.
| | - Kristian Hovde Liland
- Faculty of Science and Technology, Norwegian University of Life Sciences, P.O. Box 5003, N-1432 Ås, Norway; Faculty of Chemistry, Biotechnology and Food Sciences, Norwegian University of Life Sciences, P.O. Box 5003, N-1432 Ås, Norway
| | - Peter Gill
- Department of Forensic Biology, Oslo University Hospital, P.O. Box 4950 Nydalen, N-0424 Oslo, Norway; Department of Forensic Medicine, University of Oslo, P.O. Box 4950 Nydalen, N-0424 Oslo, Norway
| | - Lars Snipen
- Faculty of Chemistry, Biotechnology and Food Sciences, Norwegian University of Life Sciences, P.O. Box 5003, N-1432 Ås, Norway.
| |
Collapse
|
4
|
Li C, Gong W, Zhang L, Yang Z, Nong W, Bian Y, Kwan HS, Cheung MK, Xiao Y. Association Mapping Reveals Genetic Loci Associated with Important Agronomic Traits in Lentinula edodes, Shiitake Mushroom. Front Microbiol 2017; 8:237. [PMID: 28261189 PMCID: PMC5314409 DOI: 10.3389/fmicb.2017.00237] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2016] [Accepted: 02/03/2017] [Indexed: 12/28/2022] Open
Abstract
Association mapping is a robust approach for the detection of quantitative trait loci (QTLs). Here, by genotyping 297 genome-wide molecular markers of 89 Lentinula edodes cultivars in China, the genetic diversity, population structure and genetic loci associated with 11 agronomic traits were examined. A total of 873 alleles were detected in the tested strains with a mean of 2.939 alleles per locus, and the Shannon's information index was 0.734. Population structure analysis revealed two robustly differentiated groups among the Chinese L. edodes cultivars (FST = 0.247). Using the mixed linear model, a total of 43 markers were detected to be significantly associated with four traits. The number of markers associated with traits ranged from 9 to 26, and the phenotypic variations explained by each marker varied from 12.07% to 31.32%. Apart from five previously reported markers, the remaining 38 markers were newly reported here. Twenty-one markers were identified as simultaneously linked to two to four traits, and five markers were associated with the same traits in cultivation tests performed in two consecutive years. The 43 traits-associated markers were related to 97 genes, and 24 of them were related to 10 traits-associated markers detected in both years or identified previously, 13 of which had a >2-fold expression change between the mycelium and primordium stages. Our study has provided candidate markers for marker-assisted selection (MAS) and useful clues for understanding the genetic architecture of agronomic traits in the shiitake mushroom.
Collapse
Affiliation(s)
- Chuang Li
- Institute of Applied Mycology, Huazhong Agricultural University Hubei, China
| | - Wenbing Gong
- Institute of Applied Mycology, Huazhong Agricultural UniversityHubei, China; Institute of Bast Fiber Crops, Chinese Academy of Agricultural SciencesChangsha, China
| | - Lin Zhang
- Institute of Applied Mycology, Huazhong Agricultural University Hubei, China
| | - Zhiquan Yang
- College of Informatics, Huazhong Agricultural University Hubei, China
| | - Wenyan Nong
- School of Life Sciences, The Chinese University of Hong Kong Hong Kong, Hong Kong
| | - Yinbing Bian
- Institute of Applied Mycology, Huazhong Agricultural University Hubei, China
| | - Hoi-Shan Kwan
- School of Life Sciences, The Chinese University of Hong Kong Hong Kong, Hong Kong
| | - Man-Kit Cheung
- School of Life Sciences, The Chinese University of Hong Kong Hong Kong, Hong Kong
| | - Yang Xiao
- Institute of Applied Mycology, Huazhong Agricultural University Hubei, China
| |
Collapse
|
5
|
Märtens K, Hallin J, Warringer J, Liti G, Parts L. Predicting quantitative traits from genome and phenome with near perfect accuracy. Nat Commun 2016; 7:11512. [PMID: 27160605 PMCID: PMC4866306 DOI: 10.1038/ncomms11512] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2015] [Accepted: 04/01/2016] [Indexed: 12/20/2022] Open
Abstract
In spite of decades of linkage and association studies and its potential impact on human health, reliable prediction of an individual's risk for heritable disease remains difficult. Large numbers of mapped loci do not explain substantial fractions of heritable variation, leaving an open question of whether accurate complex trait predictions can be achieved in practice. Here, we use a genome sequenced population of ∼7,000 yeast strains of high but varying relatedness, and predict growth traits from family information, effects of segregating genetic variants and growth in other environments with an average coefficient of determination R(2) of 0.91. This accuracy exceeds narrow-sense heritability, approaches limits imposed by measurement repeatability and is higher than achieved with a single assay in the laboratory. Our results prove that very accurate prediction of complex traits is possible, and suggest that additional data from families rather than reference cohorts may be more useful for this purpose.
Collapse
Affiliation(s)
- Kaspar Märtens
- Institute of Computer Science, University of Tartu, Tartu 50409, Estonia
| | - Johan Hallin
- Institute for Research on Cancer and Aging, University of Sophia Antipolis, Nice 02 06107, France
| | - Jonas Warringer
- Department of Chemistry and Molecular Biology, Gothenburg University, Gothenburg 40530, Sweden
- Centre for Integrative Genetics (CIGENE), Department of Animal and Aquacultural Sciences, Norwegian University of Life Sciences, Ås N-1432, Norway
| | - Gianni Liti
- Institute for Research on Cancer and Aging, University of Sophia Antipolis, Nice 02 06107, France
| | - Leopold Parts
- Institute of Computer Science, University of Tartu, Tartu 50409, Estonia
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton CB101SA, UK
| |
Collapse
|
6
|
Multivariate Analysis of Genotype-Phenotype Association. Genetics 2016; 202:1345-63. [PMID: 26896328 DOI: 10.1534/genetics.115.181339] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2015] [Accepted: 02/15/2016] [Indexed: 11/18/2022] Open
Abstract
With the advent of modern imaging and measurement technology, complex phenotypes are increasingly represented by large numbers of measurements, which may not bear biological meaning one by one. For such multivariate phenotypes, studying the pairwise associations between all measurements and all alleles is highly inefficient and prevents insight into the genetic pattern underlying the observed phenotypes. We present a new method for identifying patterns of allelic variation (genetic latent variables) that are maximally associated-in terms of effect size-with patterns of phenotypic variation (phenotypic latent variables). This multivariate genotype-phenotype mapping (MGP) separates phenotypic features under strong genetic control from less genetically determined features and thus permits an analysis of the multivariate structure of genotype-phenotype association, including its dimensionality and the clustering of genetic and phenotypic variables within this association. Different variants of MGP maximize different measures of genotype-phenotype association: genetic effect, genetic variance, or heritability. In an application to a mouse sample, scored for 353 SNPs and 11 phenotypic traits, the first dimension of genetic and phenotypic latent variables accounted for >70% of genetic variation present in all 11 measurements; 43% of variation in this phenotypic pattern was explained by the corresponding genetic latent variable. The first three dimensions together sufficed to account for almost 90% of genetic variation in the measurements and for all the interpretable genotype-phenotype association. Each dimension can be tested as a whole against the hypothesis of no association, thereby reducing the number of statistical tests from 7766 to 3-the maximal number of meaningful independent tests. Important alleles can be selected based on their effect size (additive or nonadditive effect on the phenotypic latent variable). This low dimensionality of the genotype-phenotype map has important consequences for gene identification and may shed light on the evolvability of organisms.
Collapse
|
7
|
Dumancas GG, Ramasahayam S, Bello G, Hughes J, Kramer R. Chemometric regression techniques as emerging, powerful tools in genetic association studies. Trends Analyt Chem 2015. [DOI: 10.1016/j.trac.2015.05.007] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
|
8
|
Mehmood T, Rasheed Z. Multivariate Procedure for Variable Selection and Classification of High Dimensional Heterogeneous Data. COMMUNICATIONS FOR STATISTICAL APPLICATIONS AND METHODS 2015. [DOI: 10.5351/csam.2015.22.6.575] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Affiliation(s)
- Tahir Mehmood
- Statistics, Department of Basic Sciences, Riphah International University, Pakistan
- Biostatistics, Department of Chemistry, Biotechnology and Food Sciences, Norwegian University of Life Sciences, Norway
| | - Zahid Rasheed
- Statistics, Department of Basic Sciences, Riphah International University, Pakistan
| |
Collapse
|
9
|
Comparing K-mer based methods for improved classification of 16S sequences. BMC Bioinformatics 2015; 16:205. [PMID: 26130333 PMCID: PMC4487979 DOI: 10.1186/s12859-015-0647-4] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2015] [Accepted: 06/06/2015] [Indexed: 11/10/2022] Open
Abstract
Background The need for precise and stable taxonomic classification is highly relevant in modern microbiology. Parallel to the explosion in the amount of sequence data accessible, there has also been a shift in focus for classification methods. Previously, alignment-based methods were the most applicable tools. Now, methods based on counting K-mers by sliding windows are the most interesting classification approach with respect to both speed and accuracy. Here, we present a systematic comparison on five different K-mer based classification methods for the 16S rRNA gene. The methods differ from each other both in data usage and modelling strategies. We have based our study on the commonly known and well-used naïve Bayes classifier from the RDP project, and four other methods were implemented and tested on two different data sets, on full-length sequences as well as fragments of typical read-length. Results The difference in classification error obtained by the methods seemed to be small, but they were stable and for both data sets tested. The Preprocessed nearest-neighbour (PLSNN) method performed best for full-length 16S rRNA sequences, significantly better than the naïve Bayes RDP method. On fragmented sequences the naïve Bayes Multinomial method performed best, significantly better than all other methods. For both data sets explored, and on both full-length and fragmented sequences, all the five methods reached an error-plateau. Conclusions We conclude that no K-mer based method is universally best for classifying both full-length sequences and fragments (reads). All methods approach an error plateau indicating improved training data is needed to improve classification from here. Classification errors occur most frequent for genera with few sequences present. For improving the taxonomy and testing new classification methods, the need for a better and more universal and robust training data set is crucial.
Collapse
|
10
|
Franco-Duarte R, Mendes I, Umek L, Drumonde-Neves J, Zupan B, Schuller D. Computational models reveal genotype-phenotype associations in Saccharomyces cerevisiae. Yeast 2014; 31:265-77. [PMID: 24752995 DOI: 10.1002/yea.3016] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2013] [Revised: 04/09/2014] [Accepted: 04/10/2014] [Indexed: 11/11/2022] Open
Abstract
Genome sequencing is essential to understand individual variation and to study the mechanisms that explain relations between genotype and phenotype. The accumulated knowledge from large-scale genome sequencing projects of Saccharomyces cerevisiae isolates is being used to study the mechanisms that explain such relations. Our objective was to undertake genetic characterization of 172 S. cerevisiae strains from different geographical origins and technological groups, using 11 polymorphic microsatellites, and computationally relate these data with the results of 30 phenotypic tests. Genetic characterization revealed 280 alleles, with the microsatellite ScAAT1 contributing most to intrastrain variability, together with alleles 20, 9 and 16 from the microsatellites ScAAT4, ScAAT5 and ScAAT6. These microsatellite allelic profiles are characteristic for both the phenotype and origin of yeast strains. We confirm the strength of these associations by construction and cross-validation of computational models that can predict the technological application and origin of a strain from the microsatellite allelic profile. Associations between microsatellites and specific phenotypes were scored using information gain ratios, and significant findings were confirmed by permutation tests and estimation of false discovery rates. The phenotypes associated with higher number of alleles were the capacity to resist to sulphur dioxide (tested by the capacity to grow in the presence of potassium bisulphite) and the presence of galactosidase activity. Our study demonstrates the utility of computational modelling to estimate a strain technological group and phenotype from microsatellite allelic combinations as tools for preliminary yeast strain selection.
Collapse
Affiliation(s)
- Ricardo Franco-Duarte
- Centre of Molecular and Environmental Biology (CBMA), Department of Biology, University of Minho, Braga, Portugal
| | | | | | | | | | | |
Collapse
|
11
|
Vinje H, Almøy T, Liland KH, Snipen L. A systematic search for discriminating sites in the 16S ribosomal RNA gene. MICROBIAL INFORMATICS AND EXPERIMENTATION 2014; 4:2. [PMID: 24467869 PMCID: PMC3910680 DOI: 10.1186/2042-5783-4-2] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/16/2013] [Accepted: 12/16/2013] [Indexed: 02/01/2023]
Abstract
Background The 16S rRNA is by far the most common genomic marker used for prokaryotic classification, and has been used extensively in metagenomic studies over recent years. Along the 16S gene there are regions with more or less variation across the kingdom of bacteria. Nine variable regions have been identified, flanked by more conserved parts of the sequence. It has been stated that the discriminatory power of the 16S marker lies in these variable regions. In the present study we wanted to examine this more closely, and used a supervised learning method to search systematically for sites that contribute to correct classification at either the phylum or genus level. Results When classifying phyla the site selection algorithm located 50 discriminative sites. These were scattered over most of the alignments and only around half of them were located in the variable regions. The selected sites did, however, have an entropy significantly larger than expected, meaning they are sites of large variation. We found that the discriminative sites typically have a large entropy compared to their closest neighbours along the alignments. When classifying genera the site selection algorithm needed around 80% of the sites in the 16S gene before the classification error reached a minimum. This means that all variation, in both variable and conserved regions, is needed in order to separate genera. Conclusions Our findings does not support the statement that the discriminative power of the 16S gene is located only in the variable regions. Variable regions are important, but just as many discriminative sites are found in the more conserved parts. The discriminative power is typically found in sites of large variation located inside shorter regions of higher conservation.
Collapse
Affiliation(s)
- Hilde Vinje
- Department of Chemistry, Biotechnology and Food Sciences, Norwegian University of Life Sciences, Ås N-1432, Norway.
| | | | | | | |
Collapse
|
12
|
Mehmood T, Warringer J, Snipen L, Sæbø S. Improving stability and understandability of genotype-phenotype mapping in Saccharomyces using regularized variable selection in L-PLS regression. BMC Bioinformatics 2012; 13:327. [PMID: 23216988 PMCID: PMC3598729 DOI: 10.1186/1471-2105-13-327] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2012] [Accepted: 12/05/2012] [Indexed: 11/26/2022] Open
Abstract
Background Multivariate approaches have been successfully applied to genome wide association studies. Recently, a Partial Least Squares (PLS) based approach was introduced for mapping yeast genotype-phenotype relations, where background information such as gene function classification, gene dispensability, recent or ancient gene copy number variations and the presence of premature stop codons or frameshift mutations in reading frames, were used post hoc to explain selected genes. One of the latest advancement in PLS named L-Partial Least Squares (L-PLS), where ‘L’ presents the used data structure, enables the use of background information at the modeling level. Here, a modification of L-PLS with variable importance on projection (VIP) was implemented using a stepwise regularized procedure for gene and background information selection. Results were compared to PLS-based procedures, where no background information was used. Results Applying the proposed methodology to yeast Saccharomyces cerevisiae data, we found the relationship between genotype-phenotype to have improved understandability. Phenotypic variations were explained by the variations of relatively stable genes and stable background variations. The suggested procedure provides an automatic way for genotype-phenotype mapping. The selected phenotype influencing genes were evolving 29% faster than non-influential genes, and the current results are supported by a recently conducted study. Further power analysis on simulated data verified that the proposed methodology selects relevant variables. Conclusions A modification of L-PLS with VIP in a stepwise regularized elimination procedure can improve the understandability and stability of selected genes and background information. The approach is recommended for genome wide association studies where background information is available.
Collapse
Affiliation(s)
- Tahir Mehmood
- Biostatistics, Department of Chemistry, Biotechnology and Food Sciences, Norwegian University of Life Sciences, Ås, Norway.
| | | | | | | |
Collapse
|
13
|
Abstract
Advances in sequencing technology have enabled whole-genome sequences to be obtained from multiple individuals within species, particularly in model organisms with compact genomes. For example, 36 genome sequences of Saccharomyces cerevisiae are now publicly available, and SNP data are available for even larger collections of strains. One potential use of these resources is mapping the genetic basis of phenotypic variation through genome-wide association (GWA) studies, with the benefit that associated variants can be studied experimentally with greater ease than in outbred populations such as humans. Here, we evaluate the prospects of GWA studies in S. cerevisiae strains through extensive simulations and a GWA study of mitochondrial copy number. We demonstrate that the complex and heterogeneous patterns of population structure present in yeast populations can lead to a high type I error rate in GWA studies of quantitative traits, and that methods typically used to control for population stratification do not provide adequate control of the type I error rate. Moreover, we show that while GWA studies of quantitative traits in S. cerevisiae may be difficult depending on the particular set of strains studied, association studies to map cis-acting quantitative trait loci (QTL) and Mendelian phenotypes are more feasible. We also discuss sampling strategies that could enable GWA studies in yeast and illustrate the utility of this approach in Saccharomyces paradoxus. Thus, our results provide important practical insights into the design and interpretation of GWA studies in yeast, and other model organisms that possess complex patterns of population structure.
Collapse
|
14
|
Mehmood T, Martens H, Sæbø S, Warringer J, Snipen L. A Partial Least Squares based algorithm for parsimonious variable selection. Algorithms Mol Biol 2011; 6:27. [PMID: 22142365 PMCID: PMC3287970 DOI: 10.1186/1748-7188-6-27] [Citation(s) in RCA: 61] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2011] [Accepted: 12/05/2011] [Indexed: 11/15/2022] Open
Abstract
Background In genomics, a commonly encountered problem is to extract a subset of variables out of a large set of explanatory variables associated with one or several quantitative or qualitative response variables. An example is to identify associations between codon-usage and phylogeny based definitions of taxonomic groups at different taxonomic levels. Maximum understandability with the smallest number of selected variables, consistency of the selected variables, as well as variation of model performance on test data, are issues to be addressed for such problems. Results We present an algorithm balancing the parsimony and the predictive performance of a model. The algorithm is based on variable selection using reduced-rank Partial Least Squares with a regularized elimination. Allowing a marginal decrease in model performance results in a substantial decrease in the number of selected variables. This significantly improves the understandability of the model. Within the approach we have tested and compared three different criteria commonly used in the Partial Least Square modeling paradigm for variable selection; loading weights, regression coefficients and variable importance on projections. The algorithm is applied to a problem of identifying codon variations discriminating different bacterial taxa, which is of particular interest in classifying metagenomics samples. The results are compared with a classical forward selection algorithm, the much used Lasso algorithm as well as Soft-threshold Partial Least Squares variable selection. Conclusions A regularized elimination algorithm based on Partial Least Squares produces results that increase understandability and consistency and reduces the classification error on test data compared to standard approaches.
Collapse
|