1
|
Zhai S, Guo B, Wu B, Mehrotra DV, Shen J. Integrating multiple traits for improving polygenic risk prediction in disease and pharmacogenomics GWAS. Brief Bioinform 2023:7169140. [PMID: 37200155 DOI: 10.1093/bib/bbad181] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2023] [Revised: 03/30/2023] [Accepted: 04/21/2023] [Indexed: 05/20/2023] Open
Abstract
Polygenic risk score (PRS) has been recently developed for predicting complex traits and drug responses. It remains unknown whether multi-trait PRS (mtPRS) methods, by integrating information from multiple genetically correlated traits, can improve prediction accuracy and power for PRS analysis compared with single-trait PRS (stPRS) methods. In this paper, we first review commonly used mtPRS methods and find that they do not directly model the underlying genetic correlations among traits, which has been shown to be useful in guiding multi-trait association analysis in the literature. To overcome this limitation, we propose a mtPRS-PCA method to combine PRSs from multiple traits with weights obtained from performing principal component analysis (PCA) on the genetic correlation matrix. To accommodate various genetic architectures covering different effect directions, signal sparseness and across-trait correlation structures, we further propose an omnibus mtPRS method (mtPRS-O) by combining P values from mtPRS-PCA, mtPRS-ML (mtPRS based on machine learning) and stPRSs using Cauchy Combination Test. Our extensive simulation studies show that mtPRS-PCA outperforms other mtPRS methods in both disease and pharmacogenomics (PGx) genome-wide association studies (GWAS) contexts when traits are similarly correlated, with dense signal effects and in similar effect directions, and mtPRS-O is consistently superior to most other methods due to its robustness under various genetic architectures. We further apply mtPRS-PCA, mtPRS-O and other methods to PGx GWAS data from a randomized clinical trial in the cardiovascular domain and demonstrate performance improvement of mtPRS-PCA in both prediction accuracy and patient stratification as well as the robustness of mtPRS-O in PRS association test.
Collapse
Affiliation(s)
- Song Zhai
- Biostatistics and Research Decision Sciences, Merck & Co., Inc., Rahway, NJ 07065, USA
| | - Bin Guo
- Data and Genome Science, Merck & Co., Inc., Cambridge, MA 02141, USA
| | - Baolin Wu
- Department of Epidemiology and Biostatistics, University of California Irvine, Irvine, CA 92697, USA
| | - Devan V Mehrotra
- Biostatistics and Research Decision Sciences, Merck & Co., Inc., North Wales, PA 19454, USA
| | - Judong Shen
- Biostatistics and Research Decision Sciences, Merck & Co., Inc., Rahway, NJ 07065, USA
| |
Collapse
|
2
|
Jalilvand A, Yari K, Heydarpour F. Role of polymorphisms on the Recurrent Pregnancy Loss: A systematic review, Meta-analysis and bioinformatic analysis. Gene 2022; 844:146804. [PMID: 35998845 DOI: 10.1016/j.gene.2022.146804] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2022] [Revised: 07/16/2022] [Accepted: 08/06/2022] [Indexed: 02/08/2023]
Abstract
Recurrent miscarriage (RM) is a major reproductive health issue. RM is a multi-factorial disease, and is affected by environmental, genetic, and epigenetic factors. Genetics has a common role in recurrent miscarriage occurrence. It seems that molecular genetics has a great role in RSA incidence. So, in these years, RM has become for a major subject of genetics research. There are many genes that are involved in each phase for successful reproduction. This research aimed to evaluate the effect of all studied polymorphisms in studies on RSA that have not been included in any meta-analysis. PubMed, Scopus, and Web of Science databases were recruited to investigate the related articles. The systematic review results identified 143 studies worldwide. Thirteen genes have been included in assessing the case-control studies. Sixty-four SNPs were recruited to assess the association between genetic factors and RSA susceptibility. Ninety-two studies containing twenty two SNPs (from 10 genes) were included in the quantitative analysis. Bioinformatic analysis indicated that rs12722482 showed "Damaging Status" by double servers, and rs315952 and rs854560 had "Possibly damaging" status in the PolyPhen-2 server. MethPrimer server indicated that there is "CpG Island" in the rs10895068, rs1130355, and rs41557518 variants, and rs10895068-G allele makes a CpG dinucleotide which can change the gene methylation and result in altering the gene expression. So, further studies on rs12722482 and rs10895068 can demonstrate valuable results. To the best of our knowledge, this systematic review has covered the all studied polymorphisms of HLA-C, HLA-G, PON1, AGTR1, TAFI, FAS, FAS-L, ESR1, PGR, CTLA-4, MMP-2, MMP-3, MMP-9, and IL1RN for the first time. Also, we did a novel meta-analysis for AGTR1 rs5186, TAFI rs1926447, rs3742264, HLA-G rs1063320, rs1233334, rs1736936, rs2249863, PON1 rs662, rs854560, FAS rs2234767, rs1800682, FAS-L rs763110, ESR1, rs9340799, rs3798759, PGR rs1042838, CTLA4 rs4553808, rs5742909, rs231775, rs3087243, and MMP-2 rs243865 and updated statistical finding for rs2234693 and rs371194629. Rs2234693, rs9340799, rs231775, and rs371194629 demonstrated a significant association with RSA risk. Some variations showed significant association, while further studies are suggested to confirm the results. Finally, Rs4553808 and rs5742909 revealed no significant deviation in the results. It is suggested that these SNPs may be excluded from subsequent case-control studies or other analyses.
Collapse
Affiliation(s)
- Amin Jalilvand
- Researcher in Molecular Genetics, Kermanshah ACECR Institute of Higher Education, Kermanshah, Iran
| | - Kheirollah Yari
- Medical Biology Research Center, Health Technology Institute, Kermanshah University of Medical Sciences, Kermanshah, Iran.
| | - Fatemeh Heydarpour
- Social Development and Health Promotion Research Center, Health Institute, Kermanshah University of Medical Sciences, Kermanshah, Iran
| |
Collapse
|
3
|
Caudai C, Galizia A, Geraci F, Le Pera L, Morea V, Salerno E, Via A, Colombo T. AI applications in functional genomics. Comput Struct Biotechnol J 2021; 19:5762-5790. [PMID: 34765093 PMCID: PMC8566780 DOI: 10.1016/j.csbj.2021.10.009] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2021] [Revised: 10/05/2021] [Accepted: 10/05/2021] [Indexed: 12/13/2022] Open
Abstract
We review the current applications of artificial intelligence (AI) in functional genomics. The recent explosion of AI follows the remarkable achievements made possible by "deep learning", along with a burst of "big data" that can meet its hunger. Biology is about to overthrow astronomy as the paradigmatic representative of big data producer. This has been made possible by huge advancements in the field of high throughput technologies, applied to determine how the individual components of a biological system work together to accomplish different processes. The disciplines contributing to this bulk of data are collectively known as functional genomics. They consist in studies of: i) the information contained in the DNA (genomics); ii) the modifications that DNA can reversibly undergo (epigenomics); iii) the RNA transcripts originated by a genome (transcriptomics); iv) the ensemble of chemical modifications decorating different types of RNA transcripts (epitranscriptomics); v) the products of protein-coding transcripts (proteomics); and vi) the small molecules produced from cell metabolism (metabolomics) present in an organism or system at a given time, in physiological or pathological conditions. After reviewing main applications of AI in functional genomics, we discuss important accompanying issues, including ethical, legal and economic issues and the importance of explainability.
Collapse
Affiliation(s)
- Claudia Caudai
- CNR, Institute of Information Science and Technologies “A. Faedo” (ISTI), Pisa, Italy
| | - Antonella Galizia
- CNR, Institute of Applied Mathematics and Information Technologies (IMATI), Genoa, Italy
| | - Filippo Geraci
- CNR, Institute for Informatics and Telematics (IIT), Pisa, Italy
| | - Loredana Le Pera
- CNR, Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies (IBIOM), Bari, Italy
- CNR, Institute of Molecular Biology and Pathology (IBPM), Rome, Italy
| | - Veronica Morea
- CNR, Institute of Molecular Biology and Pathology (IBPM), Rome, Italy
| | - Emanuele Salerno
- CNR, Institute of Information Science and Technologies “A. Faedo” (ISTI), Pisa, Italy
| | - Allegra Via
- CNR, Institute of Molecular Biology and Pathology (IBPM), Rome, Italy
| | - Teresa Colombo
- CNR, Institute of Molecular Biology and Pathology (IBPM), Rome, Italy
| |
Collapse
|
4
|
Liu L, Zhu A, Shu C, Zeng Y, Ji JS. Gene-Environment Interaction of FOXO and Residential Greenness on Mortality Among Older Adults. Rejuvenation Res 2020; 24:49-61. [PMID: 32364002 DOI: 10.1089/rej.2019.2301] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Residential greenness is an important environmental factor that is strongly associated with mortality. To our knowledge, there was no previous study on the gene-environment interaction analysis between residential greenness and forkhead box O (FOXO) gene, a candidate longevity gene. Our sample consisted of 3179 participants aged 65 and older from the Chinese Longitudinal Healthy Longevity Survey. Residential greenness was measured by satellite-derived normalized difference vegetation index (NDVI) using a 500-m radius around each residential location. Contemporaneous NDVI, cumulative NDVI, and changes in NDVI over time were calculated. We used Cox-proportional hazard regression models to assess the main effect and gene-environment interaction effect of FOXO single nucleotide polymorphism (SNP) and residential greenness on mortality. We found that participants carrying two minor alleles of the three studied FOXO3A SNPs had lower mortality risk than those without minor allele (hazard ratio [HR]: 0.803 95% confidence interval [CI]: 0.654-0.987 for rs4946936, HR: 0.807 95% CI: 0.669-0.974 for rs2802292, HR: 0.803 95% CI: 0.666-0.968 for rs2253310). We found no difference in mortality among the genotypes of the other three FOXO1A SNPs (rs17630266, rs2755213, or rs2755209). Higher contemporaneous NDVI was associated with lower mortality risk (HR: 0.887 95% CI: 0.863-0.911 for 0.1-U of NDVI). The protective effect of both contemporaneous NDVI and cumulative NDVI was stronger for two minor allele carriers compared with zero minor allele carriers of the three FOXO3A SNPs. Compared with the zero minor allele genotype of the three FOXO3A SNPs, the protective effect on the mortality risk of minor allele homozygotes also increased with the increasing NDVI level at percentile 25, 50, and 75 (interaction term coefficient p < 0.05). We found gene-environment interaction between FOXO and residential greenness on mortality in this population study. A higher level of greenness may interact with FOXO pathways.
Collapse
Affiliation(s)
- Linxin Liu
- Environmental Research Center, Duke Kunshan University, Kunshan, China
| | - Anna Zhu
- Environmental Research Center, Duke Kunshan University, Kunshan, China
| | - Chang Shu
- School of Medicine, Yale University, New Haven, Connecticut, USA
| | - Yi Zeng
- Center for the Study of Aging and Human Development, Duke Medical School, Durham, North Carolina, USA.,Center for Healthy Aging and Development Studies, National School of Development, Peking University, Beijing, China
| | - John S Ji
- Environmental Research Center, Duke Kunshan University, Kunshan, China.,Nicholas School of the Environment, Duke University, Durham, North Carolina, USA
| |
Collapse
|
5
|
Grinberg NF, Orhobor OI, King RD. An evaluation of machine-learning for predicting phenotype: studies in yeast, rice, and wheat. Mach Learn 2019; 109:251-277. [PMID: 32174648 PMCID: PMC7048706 DOI: 10.1007/s10994-019-05848-5] [Citation(s) in RCA: 47] [Impact Index Per Article: 9.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2015] [Revised: 09/17/2019] [Accepted: 09/19/2019] [Indexed: 11/01/2022]
Abstract
In phenotype prediction the physical characteristics of an organism are predicted from knowledge of its genotype and environment. Such studies, often called genome-wide association studies, are of the highest societal importance, as they are of central importance to medicine, crop-breeding, etc. We investigated three phenotype prediction problems: one simple and clean (yeast), and the other two complex and real-world (rice and wheat). We compared standard machine learning methods; elastic net, ridge regression, lasso regression, random forest, gradient boosting machines (GBM), and support vector machines (SVM), with two state-of-the-art classical statistical genetics methods; genomic BLUP and a two-step sequential method based on linear regression. Additionally, using the clean yeast data, we investigated how performance varied with the complexity of the biological mechanism, the amount of observational noise, the number of examples, the amount of missing data, and the use of different data representations. We found that for almost all the phenotypes considered, standard machine learning methods outperformed the methods from classical statistical genetics. On the yeast problem, the most successful method was GBM, followed by lasso regression, and the two statistical genetics methods; with greater mechanistic complexity GBM was best, while in simpler cases lasso was superior. In the wheat and rice studies the best two methods were SVM and BLUP. The most robust method in the presence of noise, missing data, etc. was random forests. The classical statistical genetics method of genomic BLUP was found to perform well on problems where there was population structure. This suggests that standard machine learning methods need to be refined to include population structure information when this is present. We conclude that the application of machine learning methods to phenotype prediction problems holds great promise, but that determining which methods is likely to perform well on any given problem is elusive and non-trivial.
Collapse
Affiliation(s)
- Nastasiya F. Grinberg
- School of Computer Science, University of Manchester, Oxford Road, Manchester, M13 9PL UK
- Present Address: Department of Medicine, Cambridge Institute of Therapeutic Immunology & Infectious Disease, Jeffrey Cheah Biomedical Centre, Cambridge Biomedical Campus, University of Cambridge, Cambridge, CB2 0AW UK
| | | | - Ross D. King
- Department of Biology and Biological Engineering, Division of Systems and Synthetic Biology, Chalmers University of Technology, Kemivägen 10, SE-412 96 Gothenburg, Sweden
| |
Collapse
|
6
|
A novel linkage-disequilibrium corrected genomic relationship matrix for SNP-heritability estimation and genomic prediction. Heredity (Edinb) 2017; 120:356-368. [PMID: 29238077 PMCID: PMC5842222 DOI: 10.1038/s41437-017-0023-4] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2017] [Revised: 10/13/2017] [Accepted: 10/23/2017] [Indexed: 12/15/2022] Open
Abstract
Single nucleotide polymorphism (SNP)-heritability estimation is an important topic in several research fields, including animal, plant and human genetics, as well as in ecology. Linear mixed model estimation of SNP-heritability uses the structures of genomic relationships between individuals, which is constructed from genome-wide sets of SNP-markers that are generally weighted equally in their contributions. Proposed methods to handle dependence between SNPs include, “thinning” the marker set by linkage disequilibrium (LD)-pruning, the use of haplotype-tagging of SNPs, and LD-weighting of the SNP-contributions. For improved estimation, we propose a new conceptual framework for genomic relationship matrix, in which Mahalanobis distance-based LD-correction is used in a linear mixed model estimation of SNP-heritability. The superiority of the presented method is illustrated and compared to mixed-model analyses using a VanRaden genomic relationship matrix, a matrix used by GCTA and a matrix employing LD-weighting (as implemented in the LDAK software) in simulated (using real human, rice and cattle genotypes) and real (maize, rice and mice) datasets. Despite of the computational difficulties, our results suggest that by using the proposed method one can improve the accuracy of SNP-heritability estimates in datasets with high LD.
Collapse
|
7
|
Hayete B, Wuest D, Laramie J, McDonagh P, Church B, Eberly S, Lang A, Marek K, Runge K, Shoulson I, Singleton A, Tanner C, Khalil I, Verma A, Ravina B. A Bayesian mathematical model of motor and cognitive outcomes in Parkinson's disease. PLoS One 2017; 12:e0178982. [PMID: 28604798 PMCID: PMC5467836 DOI: 10.1371/journal.pone.0178982] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2016] [Accepted: 05/22/2017] [Indexed: 12/15/2022] Open
Abstract
BACKGROUND There are few established predictors of the clinical course of PD. Prognostic markers would be useful for clinical care and research. OBJECTIVE To identify predictors of long-term motor and cognitive outcomes and rate of progression in PD. METHODS Newly diagnosed PD participants were followed for 7 years in a prospective study, conducted at 55 centers in the United States and Canada. Analyses were conducted in 244 participants with complete demographic, clinical, genetic, and dopamine transporter imaging data. Machine learning dynamic Bayesian graphical models were used to identify and simulate predictors and outcomes. The outcomes rate of cognition changes are assessed by the Montreal Cognitive Assessment scores, and rate of motor changes are assessed by UPDRS part-III. RESULTS The most robust and consistent longitudinal predictors of cognitive function included older age, baseline Unified Parkinson's Disease Rating Scale (UPDRS) parts I and II, Schwab and England activities of daily living scale, striatal dopamine transporter binding, and SNP rs11724635 in the gene BST1. The most consistent predictor of UPDRS part III was baseline level of activities of daily living (part II). Key findings were replicated using long-term data from an independent cohort study. CONCLUSIONS Baseline function near the time of Parkinson's disease diagnosis, as measured by activities of daily living, is a consistent predictor of long-term motor and cognitive outcomes. Additional predictors identified may further characterize the expected course of Parkinson's disease and suggest mechanisms underlying disease progression. The prognostic model developed in this study can be used to simulate the effects of the prognostic variables on motor and cognitive outcomes, and can be replicated and refined with data from independent longitudinal studies.
Collapse
Affiliation(s)
- Boris Hayete
- GNS Healthcare, Cambridge, Massachusetts, United States of America
| | - Diane Wuest
- GNS Healthcare, Cambridge, Massachusetts, United States of America
| | - Jason Laramie
- Novartis, Cambridge, Massachusetts, United States of America
| | - Paul McDonagh
- Alexion Pharmaceuticals, Cambridge, Massachusetts, United States of America
| | - Bruce Church
- GNS Healthcare, Cambridge, Massachusetts, United States of America
| | - Shirley Eberly
- University of Rochester, Rochester, New York, United States of America
| | - Anthony Lang
- Morton and Gloria Movement Disorders Clinic and the Edmond J. Safra Program in Parkinson’s Disease, Toronto Western Hospital and the University of Toronto, Toronto, Ontario, Canada
| | - Kenneth Marek
- Institute for Neurodegenerative Disorders, New Haven, Connecticut, United States of America
| | - Karl Runge
- GNS Healthcare, Cambridge, Massachusetts, United States of America
| | - Ira Shoulson
- Georgetown University, Washington, DC, United States of America
| | - Andrew Singleton
- National Institute on Aging, NIH, Bethesda, Maryland, United States of America
| | - Caroline Tanner
- University of San Francisco & San Francisco Veterans Affairs Medical Center, San Francisco, California, United States of America
| | - Iya Khalil
- GNS Healthcare, Cambridge, Massachusetts, United States of America
| | - Ajay Verma
- Biogen Idec, Cambridge, Massachusetts, United States of America
| | - Bernard Ravina
- Voyager Therapeutics, Cambridge, Massachusetts, United States of America
| |
Collapse
|
8
|
Karvanen J, Sillanpää MJ. Prioritizing covariates in the planning of future studies in the meta-analytic framework. Biom J 2016; 59:110-125. [PMID: 27740692 DOI: 10.1002/bimj.201600067] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2016] [Revised: 05/30/2016] [Accepted: 07/13/2016] [Indexed: 11/08/2022]
Abstract
Science can be seen as a sequential process where each new study augments evidence to the existing knowledge. To have the best prospects to make an impact in this process, a new study should be designed optimally taking into account the previous studies and other prior information. We propose a formal approach for the covariate prioritization, that is the decision about the covariates to be measured in a new study. The decision criteria can be based on conditional power, change of the p-value, change in lower confidence limit, Kullback-Leibler divergence, Bayes factors, Bayesian false discovery rate or difference between prior and posterior expectation. The criteria can be also used for decisions on the sample size. As an illustration, we consider covariate prioritization based on genome-wide association studies for C-reactive protein levels and make suggestions on the genes to be studied further.
Collapse
Affiliation(s)
- Juha Karvanen
- Department of Mathematics and Statistics, University of Jyvaskyla, Jyväskylä, Finland
| | - Mikko J Sillanpää
- Department of Mathematical Sciences and Biocenter Oulu, University of Oulu, Oulu, Finland
| |
Collapse
|
9
|
Chaves JA, Cooper EA, Hendry AP, Podos J, De León LF, Raeymaekers JAM, MacMillan W, Uy JAC. Genomic variation at the tips of the adaptive radiation of Darwin's finches. Mol Ecol 2016; 25:5282-5295. [DOI: 10.1111/mec.13743] [Citation(s) in RCA: 77] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2015] [Revised: 06/27/2016] [Accepted: 06/28/2016] [Indexed: 01/22/2023]
Affiliation(s)
- Jaime A. Chaves
- Department of Biology; University of Miami; Coral Gables FL 33146 USA
- Universidad San Francisco de Quito, USFQ; Colegio de Ciencias Biológicas y Ambientales; y Extensión Galápagos Campus Cumbayá Quito Ecuador
| | - Elizabeth A. Cooper
- Department of Biology; University of Miami; Coral Gables FL 33146 USA
- Department of Genetics and Biochemistry; Clemson University; Clemson SC 29634 USA
| | - Andrew P. Hendry
- Redpath Museum; Department of Biology; McGill University; Montréal QC Canada
| | - Jeffrey Podos
- Department of Biology; University of Massachusetts Amherst; Amherst MA 01003 USA
| | - Luis F. De León
- Centro de Biodiversidad y Descubrimiento de Drogas; Instituto de Investigaciones Científicas y Servicios de Alta Tecnología (INDICASAT-AIP); Ciudad del Saber Panama Panama
- Department of Biology; University of Massachusetts Boston; 100 Morrissey Blvd Boston MA 02125 USA
| | - Joost A. M. Raeymaekers
- Laboratory of Biodiversity and Evolutionary Genomics; University of Leuven; B-3000 Leuven Belgium
- Center for Biodiversity Dynamics; Department of Biology; Norwegian University of Science and Technology; N-7491 Trondheim Norway
| | | | - J. Albert C. Uy
- Department of Biology; University of Miami; Coral Gables FL 33146 USA
| |
Collapse
|
10
|
Wei B, Zhao J. Haplotype inference using a novel binary particle swarm optimization algorithm. Appl Soft Comput 2014. [DOI: 10.1016/j.asoc.2014.03.034] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
11
|
Kostem E, Eskin E. Efficiently identifying significant associations in genome-wide association studies. J Comput Biol 2013; 20:817-30. [PMID: 24033261 DOI: 10.1089/cmb.2013.0087] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Over the past several years, genome-wide association studies (GWAS) have implicated hundreds of genes in common disease. More recently, the GWAS approach has been utilized to identify regions of the genome that harbor variation affecting gene expression or expression quantitative trait loci (eQTLs). Unlike GWAS applied to clinical traits, where only a handful of phenotypes are analyzed per study, in eQTL studies, tens of thousands of gene expression levels are measured, and the GWAS approach is applied to each gene expression level. This leads to computing billions of statistical tests and requires substantial computational resources, particularly when applying novel statistical methods such as mixed models. We introduce a novel two-stage testing procedure that identifies all of the significant associations more efficiently than testing all the single nucleotide polymorphisms (SNPs). In the first stage, a small number of informative SNPs, or proxies, across the genome are tested. Based on their observed associations, our approach locates the regions that may contain significant SNPs and only tests additional SNPs from those regions. We show through simulations and analysis of real GWAS datasets that the proposed two-stage procedure increases the computational speed by a factor of 10. Additionally, efficient implementation of our software increases the computational speed relative to the state-of-the-art testing approaches by a factor of 75.
Collapse
Affiliation(s)
- Emrah Kostem
- 1 Computer Science Department, University of California , Los Angeles, California
| | | |
Collapse
|
12
|
Jindrová A, Poláčková J. Dimensionality reduction of quality of life indicators. ACTA UNIVERSITATIS AGRICULTURAE ET SILVICULTURAE MENDELIANAE BRUNENSIS 2013. [DOI: 10.11118/actaun201260070147] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
|
13
|
İlhan İ, Tezel G. How to Select Tag SNPs in Genetic Association Studies? The CLONTagger Method with Parameter Optimization. OMICS-A JOURNAL OF INTEGRATIVE BIOLOGY 2013; 17:368-83. [DOI: 10.1089/omi.2012.0100] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Affiliation(s)
- İlhan İlhan
- Akören Vocational School, Selçuk University, Konya, Turkey
| | - Gülay Tezel
- Department of Computer Engineering Faculty of Engineering and Architecture, Selçuk University, Konya, Turkey
| |
Collapse
|
14
|
Guttery DS, Blighe K, Page K, Marchese SD, Hills A, Coombes RC, Stebbing J, Shaw JA. Hide and seek: tell-tale signs of breast cancer lurking in the blood. Cancer Metastasis Rev 2013; 32:289-302. [PMID: 23108389 DOI: 10.1007/s10555-012-9414-4] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
Breast cancer treatment is improving due to the introduction of new drugs, guided by molecular testing of the primary tumour for mutations/oncogenic drivers (e.g. HER2 gene amplification). However, tumour tissue is not always available for molecular analysis, intra-tumoural heterogeneity is common and the "cancer genome" is known to evolve with time, particularly following treatment as resistance develops. After resection, those patients with only residual micrometastases are likely to be cured but those with radiologically detectable overt disease are not. Thus, the discovery of blood test(s) that could (1) alert clinicians to early primary or recurrent disease and (2) monitor response to treatment could impact significantly on mortality. Towards this, we and others have focused on molecular profiling of circulating nucleic acids isolated from plasma, both cell-free DNA (cfDNA) and microRNAs, and the relationship of these to circulating tumour cells (CTCs). This review considers the utility of each as circulating biomarkers in breast cancer with particular emphasis on the bioinformatic tools available to support molecular profiling.
Collapse
Affiliation(s)
- David S Guttery
- Department of Cancer Studies and Molecular Medicine, Leicester Royal Infirmary, Leicester, UK.
| | | | | | | | | | | | | | | |
Collapse
|
15
|
İlhan İ, Tezel G. A genetic algorithm–support vector machine method with parameter optimization for selecting the tag SNPs. J Biomed Inform 2013; 46:328-40. [DOI: 10.1016/j.jbi.2012.12.002] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2012] [Revised: 10/13/2012] [Accepted: 12/11/2012] [Indexed: 01/06/2023]
|
16
|
MAHONEY MICHAEL. Randomized Algorithms for Matrices and Data. ADVANCES IN MACHINE LEARNING AND DATA MINING FOR ASTRONOMY 2012. [DOI: 10.1201/b11822-37] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
|
17
|
Javed A, Drineas P, Mahoney MW, Paschou P. Efficient genomewide selection of PCA-correlated tSNPs for genotype imputation. Ann Hum Genet 2011; 75:707-22. [PMID: 21902678 DOI: 10.1111/j.1469-1809.2011.00673.x] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
The linkage disequilibrium structure of the human genome allows identification of small sets of single nucleotide polymorphisms (SNPs) (tSNPs) that efficiently represent dense sets of markers. This structure can be translated into linear algebraic terms as evidenced by the well documented principal components analysis (PCA)-based methods. Here we apply, for the first time, PCA-based methodology for efficient genomewide tSNP selection; and explore the linear algebraic structure of the human genome. Our algorithm divides the genome into contiguous nonoverlapping windows of high linear structure. Coupling this novel window definition with a PCA-based tSNP selection method, we analyze 2.5 million SNPs from the HapMap phase 2 dataset. We show that 10-25% of these SNPs suffice to predict the remaining genotypes with over 95% accuracy. A comparison with other popular methods in the ENCODE regions indicates significant genotyping savings. We evaluate the portability of genome-wide tSNPs across a diverse set of populations (HapMap phase 3 dataset). Interestingly, African populations are good reference populations for the rest of the world. Finally, we demonstrate the applicability of our approach in a real genome-wide disease association study. The chosen tSNP panels can be used toward genotype imputation using either a simple regression-based algorithm or more sophisticated genotype imputation methods.
Collapse
Affiliation(s)
- Asif Javed
- Computational Biology Center, IBM TJ Watson Research, Yorktown Heights, NY 10598, USA
| | | | | | | |
Collapse
|
18
|
Sillanpää MJ, Pikkuhookana P, Abrahamsson S, Knürr T, Fries A, Lerceteau E, Waldmann P, García-Gil MR. Simultaneous estimation of multiple quantitative trait loci and growth curve parameters through hierarchical Bayesian modeling. Heredity (Edinb) 2011; 108:134-46. [PMID: 21792229 DOI: 10.1038/hdy.2011.56] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2023] Open
Abstract
A novel hierarchical quantitative trait locus (QTL) mapping method using a polynomial growth function and a multiple-QTL model (with no dependence in time) in a multitrait framework is presented. The method considers a population-based sample where individuals have been phenotyped (over time) with respect to some dynamic trait and genotyped at a given set of loci. A specific feature of the proposed approach is that, instead of an average functional curve, each individual has its own functional curve. Moreover, each QTL can modify the dynamic characteristics of the trait value of an individual through its influence on one or more growth curve parameters. Apparent advantages of the approach include: (1) assumption of time-independent QTL and environmental effects, (2) alleviating the necessity for an autoregressive covariance structure for residuals and (3) the flexibility to use variable selection methods. As a by-product of the method, heritabilities and genetic correlations can also be estimated for individual growth curve parameters, which are considered as latent traits. For selecting trait-associated loci in the model, we use a modified version of the well-known Bayesian adaptive shrinkage technique. We illustrate our approach by analysing a sub sample of 500 individuals from the simulated QTLMAS 2009 data set, as well as simulation replicates and a real Scots pine (Pinus sylvestris) data set, using temporal measurements of height as dynamic trait of interest.
Collapse
Affiliation(s)
- M J Sillanpää
- Department of Mathematics and Statistics, University of Helsinki, Helsinki, Finland.
| | | | | | | | | | | | | | | |
Collapse
|
19
|
Liang Y, Kelemen A. Sequential Support Vector Regression with Embedded Entropy for SNP Selection and Disease Classification. Stat Anal Data Min 2011; 4:301-312. [PMID: 21666834 DOI: 10.1002/sam.10110] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023]
Abstract
Comprehensive evaluation of common genetic variations through association of SNP structure with common diseases on the genome-wide scale is currently a hot area in human genome research. For less costly and faster diagnostics, advanced computational approaches are needed to select the minimum SNPs with the highest prediction accuracy for common complex diseases. In this paper, we present a sequential support vector regression model with embedded entropy algorithm to deal with the redundancy for the selection of the SNPs that have best prediction performance of diseases. We implemented our proposed method for both SNP selection and disease classification, and applied it to simulation data sets and two real disease data sets. Results show that on the average, our proposed method outperforms the well known methods of Support Vector Machine Recursive Feature Elimination, logistic regression, CART, and logic regression based SNP selections for disease classification.
Collapse
Affiliation(s)
- Yulan Liang
- Department of Family and Community Health, University of Maryland, Baltimore 655 W. Lombard Street, Baltimore, MD 21201-1579
| | | |
Collapse
|
20
|
Increasing power of genome-wide association studies by collecting additional single-nucleotide polymorphisms. Genetics 2011; 188:449-60. [PMID: 21467568 DOI: 10.1534/genetics.111.128595] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/24/2023] Open
Abstract
Genome-wide association studies (GWASs) have been effectively identifying the genomic regions associated with a disease trait. In a typical GWAS, an informative subset of the single-nucleotide polymorphisms (SNPs), called tag SNPs, is genotyped in case/control individuals. Once the tag SNP statistics are computed, the genomic regions that are in linkage disequilibrium (LD) with the most significantly associated tag SNPs are believed to contain the causal polymorphisms. However, such LD regions are often large and contain many additional polymorphisms. Following up all the SNPs included in these regions is costly and infeasible for biological validation. In this article we address how to characterize these regions cost effectively with the goal of providing investigators a clear direction for biological validation. We introduce a follow-up study approach for identifying all untyped associated SNPs by selecting additional SNPs, called follow-up SNPs, from the associated regions and genotyping them in the original case/control individuals. We introduce a novel SNP selection method with the goal of maximizing the number of associated SNPs among the chosen follow-up SNPs. We show how the observed statistics of the original tag SNPs and human genetic variation reference data such as the HapMap Project can be utilized to identify the follow-up SNPs. We use simulated and real association studies based on the HapMap data and the Wellcome Trust Case Control Consortium to demonstrate that our method shows superior performance to the correlation- and distance-based traditional follow-up SNP selection approaches. Our method is publicly available at http://genetics.cs.ucla.edu/followupSNPs.
Collapse
|
21
|
Mourad R, Sinoquet C, Leray P. Probabilistic graphical models for genetic association studies. Brief Bioinform 2011; 13:20-33. [PMID: 21450805 DOI: 10.1093/bib/bbr015] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Probabilistic graphical models have been widely recognized as a powerful formalism in the bioinformatics field, especially in gene expression studies and linkage analysis. Although less well known in association genetics, many successful methods have recently emerged to dissect the genetic architecture of complex diseases. In this review article, we cover the applications of these models to the population association studies' context, such as linkage disequilibrium modeling, fine mapping and candidate gene studies, and genome-scale association studies. Significant breakthroughs of the corresponding methods are highlighted, but emphasis is also given to their current limitations, in particular, to the issue of scalability. Finally, we give promising directions for future research in this field.
Collapse
Affiliation(s)
- Raphaël Mourad
- Ecole Polytechnique de l'Université de Nantes, rue Christian Pauc, BP 50609, 44306 Nantes Cedex 3, France.
| | | | | |
Collapse
|
22
|
Nahlawi LI, Mousavi P. Fast orthogonal search for genetic feature selection. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2011; 2010:1077-80. [PMID: 21096555 DOI: 10.1109/iembs.2010.5627300] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
In this paper, we present the application of a multivariate regression approach, fast orthogonal search, to select the most informative features in Single Nucleotide Polymorphism data, and to use these features to accurately model the entire data. Our results on two published datasets show very high accuracies in capturing the hidden information in the sequence of studied SNPs. The execution time for our developed methodology is very short and paves the way for its application to large-scale genome wide datasets.
Collapse
|
23
|
Abstract
Complex diseases such as hypertension are inherently multifactorial and involve many factors of mild-to-minute effect sizes. A genome-wide association study (GWAS) typically tests hundreds of thousands of single-nucleotide polymorphisms (SNPs), and offers opportunity to evaluate aggregated effects of many genetic variants with effects that are too small to detect individually. The gene-set-enrichment analysis (GSEA) is a pathway-based approach that tests for such aggregated effects of genes that are linked by biological functions. A key step in GSEA is the summary statistic (gene score) used to measure the overall relevance of a gene based on all SNPs tested in the gene. Existing GSEA methods use maximum statistics sensitive to gene size and linkage equilibrium. We propose the approach of variable set enrichment analysis (VSEA) and study new gene score methods that are less dependent on gene size. The new method treats groups of variables (SNPs or other variants) as base units for summarizing gene scores and relies less on gene definition itself. The power of VSEA is analyzed by simulation studies modeling various scenarios of complex multiloci interactions. Results show that the new gene scores generally performed better, some substantially so, than existing GSEA extension to GWAS. The new methods are implemented in an R package and when applied to a real GWAS data set demonstrated its practical utility in a GWAS setting.
Collapse
|
24
|
Nahlawi LI, Mousavi P. Single nucleotide polymorphism selection using independent component analysis. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2011; 2010:6186-9. [PMID: 21097155 DOI: 10.1109/iembs.2010.5627753] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Bioinformatics research in genome wide association studies necessitates the development of algorithms capable of manipulating very-large datasets of Single Nucleotide Polymorphisms (SNP). To facilitate such association studies, we propose a novel framework for SNP selection using Independent Component Analysis (ICA). Compared to previous ICA-based methods, our framework works as a filtering technique to reduce the number of SNPs in a dataset, without the need for any class labels. We evaluate the proposed method by applying it on three published SNP datasets, and comparing the results to SNP selection methods based on Principal Component Analysis (PCA). Our results show the capability of ICA to capture an increased or matching amount of information from the datasets.
Collapse
|
25
|
Jiang J, Tang NLS, Ohlsson C, Eriksson AL, Vandenput L, Liao C, Wang X, Chan FWK, Kwok A, Orwoll E, Kwok TCY, Woo J, Leung PC. Association of SRD5A2 Variants and Serum Androstane-3α,17β-Diol Glucuronide Concentration in Chinese Elderly Men. Clin Chem 2010; 56:1742-9. [DOI: 10.1373/clinchem.2010.150607] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023]
Abstract
BACKGROUND
Results of recent studies have demonstrated that genetic variants of the enzyme steroid 5α reductase type II (SRD5A2) are associated with serum concentrations of major androgen metabolites such as conjugates of androstane-3α,17β-diol-glucuronide (3α-diol-G). However, this association was not consistently found among different ethnic groups. Thus, we aimed to determine whether the association with SRD5A2 genetic variations exists in a cohort of healthy Chinese elderly men, by examining 2 metabolite conjugates: androstane-3α,l7β-diol-3-glucuronide (3α-diol-3G) and androstane-3α,17β-diol-17-glucuronide (3α-diol-17G).
METHODS
We used GC-MS and LC-MS to measure serum sex steroid concentrations, including testosterone and dihydrotestosterone, and 3α-diol-3G and 3α-diol-17G in 1182 Chinese elderly men age 65 and older. Genotyping of the 3 SRD5A2 tagSNPs [rs3731586, rs12470143, and rs523349 (V89L)] was performed by using melting-temperature–shift allele-specific PCR.
RESULTS
The well-described SRD5A2 missense variant rs523349 (V89L) was modestly associated with the 3α-diol-17G concentration (P = 0.040). On the other hand, SNP rs12470143 was found to be significantly correlated with 3α-diol-3G concentration (P = 0.021). Results of haplotype analysis suggested that the presence of an A-C-G haplotype leads to an increased 3α-diol-3G concentration, a finding consistent with results of single SNP analysis.
CONCLUSIONS
The genetic variation of SRD5A2 is associated with circulating 3α-diol-3G and 3α-diol-17G concentrations in Chinese elderly men. In addition, we showed that SRD5A2 haplotypic association, rather than a single SNP alone, might be a better predictor of the 3α-diol-G concentration. Thus, the effect of either the haplotype itself or of other ungenotyped SNPs in linkage disequilibrium with the haplotype is responsible for the interindividual variation of 3α-diol-G.
Collapse
Affiliation(s)
| | - Nelson LS Tang
- Departments of Chemical Pathology
- Laboratory of Genetics of Disease Susceptibility, Li Ka Shing Institute of Health Sciences, Hong Kong SAR, China
| | - Claes Ohlsson
- Center for Bone Research at the Sahlgrenska Academy, Institute of Medicine and
| | - Anna L Eriksson
- Center for Bone Research at the Sahlgrenska Academy, Departments of Internal Medicine and Geriatrics, Gothenburg University, Gothenburg, Sweden
| | - Liesbeth Vandenput
- Center for Bone Research at the Sahlgrenska Academy, Departments of Internal Medicine and Geriatrics, Gothenburg University, Gothenburg, Sweden
| | | | | | | | - Anthony Kwok
- Jockey Club Centre for Osteoporosis Care and Control, Faculty of Medicine, The Chinese University of Hong Kong, Hong Kong SAR, China
| | - Eric Orwoll
- Oregon Health and Science University, Portland, OR
| | | | - Jean Woo
- Medicine and Therapeutics, Faculty of Medicine, and
- Community and Family Medicine, Faculty of Medicine, and
| | - Ping Chung Leung
- Jockey Club Centre for Osteoporosis Care and Control, Faculty of Medicine, The Chinese University of Hong Kong, Hong Kong SAR, China
| |
Collapse
|
26
|
Vounou M, Nichols TE, Montana G. Discovering genetic associations with high-dimensional neuroimaging phenotypes: A sparse reduced-rank regression approach. Neuroimage 2010; 53:1147-59. [PMID: 20624472 DOI: 10.1016/j.neuroimage.2010.07.002] [Citation(s) in RCA: 142] [Impact Index Per Article: 10.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2010] [Revised: 06/24/2010] [Accepted: 07/01/2010] [Indexed: 10/19/2022] Open
Abstract
There is growing interest in performing genome-wide searches for associations between genetic variants and brain imaging phenotypes. While much work has focused on single scalar valued summaries of brain phenotype, accounting for the richness of imaging data requires a brain-wide, genome-wide search. In particular, the standard approach based on mass-univariate linear modelling (MULM) does not account for the structured patterns of correlations present in each domain. In this work, we propose sparse reduced rank regression (sRRR), a strategy for multivariate modelling of high-dimensional imaging responses (measurements taken over regions of interest or individual voxels) and genetic covariates (single nucleotide polymorphisms or copy number variations), which enforces sparsity in the regression coefficients. Such sparsity constraints ensure that the model performs simultaneous genotype and phenotype selection. Using simulation procedures that accurately reflect realistic human genetic variation and imaging correlations, we present detailed evaluations of the sRRR method in comparison with the more traditional MULM approach. In all settings considered, sRRR has better power to detect deleterious genetic variants compared to MULM. Important issues concerning model selection and connections to existing latent variable models are also discussed. This work shows that sRRR offers a promising alternative for detecting brain-wide, genome-wide associations.
Collapse
Affiliation(s)
- Maria Vounou
- Statistics Section, Department of Mathematics, Imperial College London, UK
| | | | | | | |
Collapse
|
27
|
Liu L, Wu Y, Lonardi S, Jiang T. Efficient genome-wide TagSNP selection across populations via the linkage disequilibrium criterion. J Comput Biol 2010; 17:21-37. [PMID: 20078395 DOI: 10.1089/cmb.2007.0228] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
In this article, we studied the tag single-nucleotide polymorphism (tagSNP) selection problem on multiple populations using the pairwise r(2) linkage disequilibrium criterion. We proposed a novel combinatorial optimization model for the tagSNP selection problem, called the minimum common tagSNP selection (MCTS) problem, and presented efficient solutions for MCTS. Our approach consists of the following three main steps: (i) partitioning the SNP markers into small disjoint components, (ii) applying some data reduction rules to simplify the problem, and (iii) applying either a fast greedy algorithm or a Lagrangian relaxation algorithm to solve the remaining (general) MCTS. These algorithms also provide lower bounds on tagging (i.e., the minimum number of tagSNPs needed). The lower bounds allow us to evaluate how far our solution is from the optimum. To the best of our knowledge, it is the first time the tagging lower bounds are discussed in the literature. We assessed the performance of our algorithms on real HapMap data for genome-wide tagging. The experiments demonstrated that our algorithms run 3-4 orders of magnitude faster than the existing single-population tagging programs such as FESTA, LD-Select, and the multiple-population tagging method MultiPop-TagSelect. Our method also greatly reduced the required tagSNPs compared with LD-Select on a single population and MultiPop-TagSelect on multiple populations. Moreover, the numbers of tagSNPs selected by our algorithms are almost optimal because they are very close to the corresponding lower bounds obtained by our method.
Collapse
Affiliation(s)
- Lan Liu
- Department of Computer Science and Engineering, University of California, Riverside, California, USA.
| | | | | | | |
Collapse
|
28
|
Tag SNP selection based on clustering according to dominant sets found using replicator dynamics. ADV DATA ANAL CLASSI 2010. [DOI: 10.1007/s11634-010-0059-2] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
|
29
|
Kim ED, Buckley R, Learman S, Richard J, Parke C, Worthylake DK, Wojcik EJ, Walker RA, Kim S. Allosteric drug discrimination is coupled to mechanochemical changes in the kinesin-5 motor core. J Biol Chem 2010; 285:18650-61. [PMID: 20299460 DOI: 10.1074/jbc.m109.092072] [Citation(s) in RCA: 42] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022] Open
Abstract
Essential in mitosis, the human Kinesin-5 protein is a target for >80 classes of allosteric compounds that bind to a surface-exposed site formed by the L5 loop. Not established is why there are differing efficacies in drug inhibition. Here we compare the ligand-bound states of two L5-directed inhibitors against 15 Kinesin-5 mutants by ATPase assays and IR spectroscopy. Biochemical kinetics uncovers functional differences between individual residues at the N or C termini of the L5 loop. Infrared evaluation of solution structures and multivariate analysis of the vibrational spectra reveal that mutation and/or ligand binding not only can remodel the allosteric binding surface but also can transmit long range effects. Changes in L5-localized 3(10) helix and disordered content, regardless of substitution or drug potency, are experimentally detected. Principal component analysis couples these local structural events to two types of rearrangements in beta-sheet hydrogen bonding. These transformations in beta-sheet contacts are correlated with inhibitory drug response and are corroborated by wild type Kinesin-5 crystal structures. Despite considerable evolutionary divergence, our data directly support a theorized conserved element for long distance mechanochemical coupling in kinesin, myosin, and F(1)-ATPase. These findings also suggest that these relatively rapid IR approaches can provide structural biomarkers for clinical determination of drug sensitivity and drug efficacy in nucleotide triphosphatases.
Collapse
Affiliation(s)
- Elizabeth D Kim
- Department of Biochemistry and Molecular Biology, Louisiana State University Health Sciences Center, New Orleans, Louisiana 70112, USA
| | | | | | | | | | | | | | | | | |
Collapse
|
30
|
VAN HEERWAARDEN JOOST, ROSS-IBARRA JEFFREY, DOEBLEY JOHN, GLAUBITZ JEFFREYC, DE JESÚS SÁNCHEZ GONZÁLEZ JOSE, GAUT BRANDONS, EGUIARTE LUISE. Fine scale genetic structure in the wild ancestor of maize (Zea maysssp.parviglumis). Mol Ecol 2010; 19:1162-73. [DOI: 10.1111/j.1365-294x.2010.04559.x] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
|
31
|
Tang NLS, Liao CD, Ching JKL, Suen EWC, Chan IHS, Orwoll E, Ho SC, Chan FWK, Kwok AWL, Kwok T, Woo J, Leung PC. Sex-specific effect of Pirin gene on bone mineral density in a cohort of 4000 Chinese. Bone 2010; 46:543-50. [PMID: 19766747 DOI: 10.1016/j.bone.2009.09.012] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/24/2009] [Revised: 08/22/2009] [Accepted: 09/12/2009] [Indexed: 11/30/2022]
Abstract
BACKGROUND Osteoporosis is a common condition among elderly. Genetic mapping studies repeatedly located the distal short arms of X-chromosome as the quantitative trait loci (QTL) for BMD in mice. Fine mapping of a syntenic segment on Xp22 in a Caucasian female population suggested a moderate association between lumbar spine (LS) BMD and 2 intronic SNPs in the Pirin (PIR) gene, which encodes an iron-binding nuclear protein. This study aimed to examine genetic variations in the PIR gene by a comprehensive tagging method and its sex-specific effects on BMD and osteoporotic risk. METHODS Two thousand men and 2000 women aged 65 or above were recruited from the community. BMDs at the LS, femoral neck, total hip and whole body were measured and followed up at 4-year. Genotyping was performed for tagSNPs of PIR gene including adjacent regions, and the PIR haplotypes were inferred using PHASE program. RESULTS Analysis by linear regression showed a significant association between SNP rs5935970 and LS-BMD, while haplotype T-T-A was significantly associated with BMD of all measured sites. However, none of such associations were found in men. Linear Mixed Model also confirmed the same sex-specific and site-specific effect for longitudinal BMD changes. CONCLUSION In addition to confirming the association between BMDs and the PIR gene, we also revealed that this finding is sex-specific, possibly due to an X-linked effect. This study demonstrated the importance of considering sex and genetic interactions in studies of disease predisposition and complex traits.
Collapse
Affiliation(s)
- Nelson L S Tang
- Department of Chemical Pathology, Faculty of Medicine, The Chinese University of Hong Kong, Hong Kong SAR, China.
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
32
|
Rebaï M, Kharrat N, Ayadi I, Rebaï A. Haplotype structure of five SNPs within the ACE gene in the Tunisian population. Ann Hum Biol 2009; 33:319-29. [PMID: 17092869 DOI: 10.1080/03014460600621977] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
Abstract
BACKGROUND The Angiotensin-Converting Enzyme (ACE) is a candidate gene in the aetiology of several common diseases. The study of the haplotype structure of this gene is of interest in diagnosis and in pharmacogenomics. AIM The study investigated the haplotype profile of single nucleotide polymorphisms (SNPs) within the ACE gene in the Tunisian population and compared it with other populations. SUBJECTS AND METHODS Five SNPs (rs1800764, rs4291, rs4309, rs4331, rs4340) covering a region of 15.6 kb of the ACE gene were typed by PCR-digestion in a sample of 100 healthy subjects. RESULTS All SNPs were polymorphic and in Hardy-Weinberg equilibrium. A total of 21 haplotypes were identified but only eight had a frequency of more than 1%. The four most common haplotypes had a cumulative frequency of 87.4%. The 'Yin-Yang' phenomenon (the two major haplotypes are complementary at all sites) was found. Linkage disequilibrium between all pairs of loci was highly significant (p<10-5). A simple and efficient statistical procedure was used to identify three important SNPs. CONCLUSION The Tunisian population showed a different haplotype structure from the European one for the ACE gene and three important SNPs were identified. These will be very helpful in future association studies in the Tunisian and North African populations.
Collapse
Affiliation(s)
- Maha Rebaï
- Bioinformatics Unit, Centre of Biotechnology of Sfax, Sfax, Tunisia
| | | | | | | |
Collapse
|
33
|
Kelemen A, Vasilakos AV, Liang Y. Computational intelligence in bioinformatics: SNP/haplotype data in genetic association study for common diseases. ACTA ACUST UNITED AC 2009; 13:841-7. [PMID: 19556205 DOI: 10.1109/titb.2009.2024144] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Comprehensive evaluation of common genetic variations through association of single-nucleotide polymorphism (SNP) structure with common complex disease in the genome-wide scale is currently a hot area in human genome research due to the recent development of the Human Genome Project and HapMap Project. Computational science, which includes computational intelligence (CI), has recently become the third method of scientific enquiry besides theory and experimentation. There have been fast growing interests in developing and applying CI in disease mapping using SNP and haplotype data. Some of the recent studies have demonstrated the promise and importance of CI for common complex diseases in genomic association study using SNP/haplotype data, especially for tackling challenges, such as gene-gene and gene-environment interactions, and the notorious "curse of dimensionality" problem. This review provides coverage of recent developments of CI approaches for complex diseases in genetic association study with SNP/haplotype data.
Collapse
Affiliation(s)
- Arpad Kelemen
- Department of Organizational Systems and Adult Health, University of Maryland, Baltimore, MD 21201, USA.
| | | | | |
Collapse
|
34
|
|
35
|
Liu J, Pearlson G, Windemuth A, Ruano G, Perrone-Bizzozero NI, Calhoun V. Combining fMRI and SNP data to investigate connections between brain function and genetics using parallel ICA. Hum Brain Mapp 2009; 30:241-55. [PMID: 18072279 DOI: 10.1002/hbm.20508] [Citation(s) in RCA: 169] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/14/2023] Open
Abstract
There is current interest in understanding genetic influences on both healthy and disordered brain function. We assessed brain function with functional magnetic resonance imaging (fMRI) data collected during an auditory oddball task--detecting an infrequent sound within a series of frequent sounds. Then, task-related imaging findings were utilized as potential intermediate phenotypes (endophenotypes) to investigate genomic factors derived from a single nucleotide polymorphism (SNP) array. Our target is the linkage of these genomic factors to normal/abnormal brain functionality. We explored parallel independent component analysis (paraICA) as a new method for analyzing multimodal data. The method was aimed to identify simultaneously independent components of each modality and the relationships between them. When 43 healthy controls and 20 schizophrenia patients, all Caucasian, were studied, we found a correlation of 0.38 between one fMRI component and one SNP component. This fMRI component consisted mainly of parietal lobe activations. The relevant SNP component was contributed to significantly by 10 SNPs located in genes, including those coding for the nicotinic alpha-7 cholinergic receptor, aromatic amino acid decarboxylase, disrupted in schizophrenia 1, among others. Both fMRI and SNP components showed significant differences in loading parameters between the schizophrenia and control groups (P = 0.0006 for the fMRI component; P = 0.001 for the SNP component). In summary, we constructed a framework to identify interactions between brain functional and genetic information; our findings provide a proof-of-concept that genomic SNP factors can be investigated by using endophenotypic imaging findings in a multivariate format.
Collapse
Affiliation(s)
- Jingyu Liu
- The Mind Research Network, Albuquerque, New Mexico, USA.
| | | | | | | | | | | |
Collapse
|
36
|
Abstract
Background Recent studies have shown genetic variation is the basis of the genome-wide disease association research. However, due to the high cost on genotyping large number of single nucleotide polymorphisms (SNPs), it is essential to choose a small subset of informative SNPs (tagSNPs), which are able to capture most variation in a population, to represent the rest SNPs. Several methods have been proposed to find the minimum set of tagSNPs, but most of them still have some disadvantages such as information loss and block-partition limit. Results This paper proposes a new hybrid method named CGTS which combines the ideas of the clustering and the graph algorithms to select tagSNPs on genotype data. This method aims to maximize the number of the discarding nontagSNPs in the given set. CGTS integrates the information of the LD association and the genotype diversity using the site graphs, discards redundant SNPs using the algorithm based on these graph structures. The clustering algorithm is used to reduce the running time of CGTS. The efficiency of the algorithm and quality of solutions are evaluated on biological data and the comparisons with three popular selecting methods are shown in the paper. Conclusion Our theoretical analysis and experimental results show that our algorithm CGTS is not only more efficient than other methods but also can be get higher accuracy in tagSNP selection.
Collapse
|
37
|
Deng L, Zhang Y, Kang J, Liu T, Zhao H, Gao Y, Li C, Pan H, Tang X, Wang D, Niu T, Yang H, Zeng C. An unusual haplotype structure on human chromosome 8p23 derived from the inversion polymorphism. Hum Mutat 2008; 29:1209-16. [PMID: 18473345 DOI: 10.1002/humu.20775] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
Chromosomal inversion is an important type of genomic variations involved in both evolution and disease pathogenesis. Here, we describe the refined genetic structure of a 3.8-Mb inversion polymorphism at chromosome 8p23. Using HapMap data of 1,073 SNPs generated from 209 unrelated samples from CEPH-Utah residents with ancestry from northern and western Europe (CEU); Yoruba in Ibadan, Nigeria (YRI); and Asian (ASN) samples, which were comprised of Han Chinese from Beijing, China (CHB) and Japanese from Tokyo, Japan (JPT)-we successfully deduced the inversion orientations of all their 418 haplotypes. In particular, distinct haplotype subgroups were identified based on principal component analysis (PCA). Such genetic substructures were consistent with clustering patterns based on neighbor-joining tree reconstruction, which revealed a total of four haplotype clades across all samples. Metaphase fluorescence in situ hybridization (FISH) in a subset of 10 HapMap samples verified their inversion orientations predicted by PCA or phylogenetic tree reconstruction. Positioning of the outgroup haplotype within one of YRI clades suggested that Human NCBI Build 36-inverted order is most likely the ancestral orientation. Furthermore, the population differentiation test and the relative extended haplotype homozygosity (REHH) analysis in this region discovered multiple selection signals, also in a population-specific manner. A positive selection signal was detected at XKR6 in the ASN population. These results revealed the correlation of inversion polymorphisms to population-specific genetic structures, and various selection patterns as possible mechanisms for the maintenance of a large chromosomal rearrangement at 8p23 region during evolution. In addition, our study also showed that haplotype-based clustering methods, such as PCA, can be applied in scanning for cryptic inversion polymorphisms at a genome-wide scale.
Collapse
Affiliation(s)
- Libin Deng
- Beijing Institute of Genomics, Chinese Academy of Sciences, P.R. China
| | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
38
|
Nott SL, Huang Y, Fluharty BR, Sokolov AM, Huang M, Cox C, Muyan M. Do Estrogen Receptor beta Polymorphisms Play A Role in the Pharmacogenetics of Estrogen Signaling? ACTA ACUST UNITED AC 2008; 6:239-259. [PMID: 19337586 DOI: 10.2174/187569208786733820] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Abstract
Estrogen hormones play critical roles in the regulation of many tissue functions. The effects of estrogens are primarily mediated by the estrogen receptors (ER) alpha and beta. ERs are ligand-activated transcription factors that regulate a complex array of genomic events that orchestrate cellular growth, differentiation and death. Although many factors contribute to their etiology, estrogens are thought to be the primary agents for the development and/or progression of target tissue malignancies. Many of the current modalities for the treatment of estrogen target tissue malignancies are based on agents with diverse pharmacology that alter or prevent ER functions by acting as estrogen competitors. Although these compounds have been successfully used in clinical settings, the efficacy of treatment shows variability. An increasing body of evidence implicates ERalpha polymorphisms as one of the contributory factors for differential responses to estrogen competitors. This review aims to highlight the recent findings on polymorphisms of the lately identified ERbeta in order to provide a functional perspective with potential pharmacogenomic implications.
Collapse
Affiliation(s)
- Stephanie L Nott
- Department of Biochemistry & Biophysics, University of Rochester Medical School, Rochester, NY, 14642, USA
| | | | | | | | | | | | | |
Collapse
|
39
|
Han B, Kang HM, Seo MS, Zaitlen N, Eskin E. Efficient association study design via power-optimized tag SNP selection. Ann Hum Genet 2008; 72:834-47. [PMID: 18702637 DOI: 10.1111/j.1469-1809.2008.00469.x] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
Abstract
Discovering statistical correlation between causal genetic variation and clinical traits through association studies is an important method for identifying the genetic basis of human diseases. Since fully resequencing a cohort is prohibitively costly, genetic association studies take advantage of local correlation structure (or linkage disequilibrium) between single nucleotide polymorphisms (SNPs) by selecting a subset of SNPs to be genotyped (tag SNPs). While many current association studies are performed using commercially available high-throughput genotyping products that define a set of tag SNPs, choosing tag SNPs remains an important problem for both custom follow-up studies as well as designing the high-throughput genotyping products themselves. The most widely used tag SNP selection method optimizes the correlation between SNPs (r(2)). However, tag SNPs chosen based on an r(2) criterion do not necessarily maximize the statistical power of an association study. We propose a study design framework that chooses SNPs to maximize power and efficiently measures the power through empirical simulation. Empirical results based on the HapMap data show that our method gains considerable power over a widely used r(2)-based method, or equivalently reduces the number of tag SNPs required to attain the desired power of a study. Our power-optimized 100k whole genome tag set provides equivalent power to the Affymetrix 500k chip for the CEU population. For the design of custom follow-up studies, our method provides up to twice the power increase using the same number of tag SNPs as r(2)-based methods. Our method is publicly available via web server at http://design.cs.ucla.edu.
Collapse
Affiliation(s)
- B Han
- Department of Computer Science and Engineering, University of California, San Diego, La Jolla, CA 92093, USA
| | | | | | | | | |
Collapse
|
40
|
Zhou N, Wang L. A modified T-test feature selection method and its application on the HapMap genotype data. GENOMICS PROTEOMICS & BIOINFORMATICS 2008; 5:242-9. [PMID: 18267305 PMCID: PMC5054219 DOI: 10.1016/s1672-0229(08)60011-x] [Citation(s) in RCA: 54] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Single nucleotide polymorphisms (SNPs) are genetic variations that determine the differences between any two unrelated individuals. Various population groups can be distinguished from each other using SNPs. For instance, the HapMap dataset has four population groups with about ten million SNPs. For more insights on human evolution, ethnic variation, and population assignment, we propose to find out which SNPs are significant in determining the population groups and then to classify different populations using these relevant SNPs as input features. In this study, we developed a modified t-test ranking measure and applied it to the HapMap genotype data. Firstly, we rank all SNPs in comparison with other feature importance measures including F-statistics and the informativeness for assignment. Secondly, we select different numbers of the most highly ranked SNPs as the input to a classifier, such as the support vector machine, so as to find the best feature subset corresponding to the best classification accuracy. Experimental results showed that the proposed method is very effective in finding SNPs that are significant in determining the population groups, with reduced computational burden and better classification accuracy.
Collapse
Affiliation(s)
- Nina Zhou
- School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore
| | | |
Collapse
|
41
|
Tracing sub-structure in the European American population with PCA-informative markers. PLoS Genet 2008; 4:e1000114. [PMID: 18797516 PMCID: PMC2537989 DOI: 10.1371/journal.pgen.1000114] [Citation(s) in RCA: 56] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2008] [Accepted: 06/04/2008] [Indexed: 11/20/2022] Open
Abstract
Genetic structure in the European American population reflects waves of migration and recent gene flow among different populations. This complex structure can introduce bias in genetic association studies. Using Principal Components Analysis (PCA), we analyze the structure of two independent European American datasets (1,521 individuals–307,315 autosomal SNPs). Individual variation lies across a continuum with some individuals showing high degrees of admixture with non-European populations, as demonstrated through joint analysis with HapMap data. The CEPH Europeans only represent a small fraction of the variation encountered in the larger European American datasets we studied. We interpret the first eigenvector of this data as correlated with ancestry, and we apply an algorithm that we have previously described to select PCA-informative markers (PCAIMs) that can reproduce this structure. Importantly, we develop a novel method that can remove redundancy from the selected SNP panels and show that we can effectively remove correlated markers, thus increasing genotyping savings. Only 150–200 PCAIMs suffice to accurately predict fine structure in European American datasets, as identified by PCA. Simulating association studies, we couple our method with a PCA-based stratification correction tool and demonstrate that a small number of PCAIMs can efficiently remove false correlations with almost no loss in power. The structure informative SNPs that we propose are an important resource for genetic association studies of European Americans. Furthermore, our redundancy removal algorithm can be applied on sets of ancestry informative markers selected with any method in order to select the most uncorrelated SNPs, and significantly decreases genotyping costs. Genetic association studies search to identify disease susceptibility genes through the analysis of genetic markers such as single nucleotide polymorphisms (SNPs) in large numbers of cases and controls. In such settings, the existence of sub-structure in the population under study (i.e. differences in ancestry among cases and controls) may lead to spurious results. It is therefore imperative to control for this possible bias. Such biases may arise for example when studying the European American population, which consists of individuals of diverse ancestry proportions from different European countries and to some degree also from African and Native American populations. Here, we study the genetic sub-structure of the European American population, analyzing 1,521 individuals for over 300,000 SNPs across the entire genome. Applying a powerful method that is based on dimensionality reduction (Principal Components Analysis), we are able to identify 200 SNPs that successfully represent the complete dataset. Importantly, we introduce a novel method that effectively removes redundancy from any set of genetic markers, and may prove extremely useful in a variety of different research scenarios, in order to significantly reduce the cost of a study.
Collapse
|
42
|
Liu Q, Yang J, Chen Z, Yang MQ, Sung AH, Huang X. Supervised learning-based tagSNP selection for genome-wide disease classifications. BMC Genomics 2008; 9 Suppl 1:S6. [PMID: 18366619 PMCID: PMC2386071 DOI: 10.1186/1471-2164-9-s1-s6] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022] Open
Abstract
Background Comprehensive evaluation of common genetic variations through association of single nucleotide polymorphisms (SNPs) with complex human diseases on the genome-wide scale is an active area in human genome research. One of the fundamental questions in a SNP-disease association study is to find an optimal subset of SNPs with predicting power for disease status. To find that subset while reducing study burden in terms of time and costs, one can potentially reconcile information redundancy from associations between SNP markers. Results We have developed a feature selection method named Supervised Recursive Feature Addition (SRFA). This method combines supervised learning and statistical measures for the chosen candidate features/SNPs to reconcile the redundancy information and, in doing so, improve the classification performance in association studies. Additionally, we have proposed a Support Vector based Recursive Feature Addition (SVRFA) scheme in SNP-disease association analysis. Conclusions We have proposed using SRFA with different statistical learning classifiers and SVRFA for both SNP selection and disease classification and then applying them to two complex disease data sets. In general, our approaches outperform the well-known feature selection method of Support Vector Machine Recursive Feature Elimination and logic regression-based SNP selection for disease classification in genetic association studies. Our study further indicates that both genetic and environmental variables should be taken into account when doing disease predictions and classifications for the most complex human diseases that have gene-environment interactions.
Collapse
Affiliation(s)
- Qingzhong Liu
- Department of Computer Science, New Mexico Institute of Mining and Technology, Socorro, NM 87801, USA.
| | | | | | | | | | | |
Collapse
|
43
|
Abstract
The question of tagging single nucleotide polymorphism (tagSNP) transferability is an important one because many ongoing and upcoming Genome-Wide Association studies rely critically upon the validity, and practical feasibility of using a universal core set of tagSNPs. A series of recent studies analyzed performance of tagSNPs selected based on the HapMap. While these studies showed largely satisfactory transferability of the tagSNPs, they also reported that the level of transferability varies, substantively sometimes, especially when tagSNPs selected in one population were used in another distant population. We present a review of the literature about where and why tagSNP transferability may become a problem and suggest research directions that may help the resolution.
Collapse
Affiliation(s)
- C Charles Gu
- Division of Biostatistics, Washington University School of Medicine, St Louis, MO 63110, USA.
| | | | | | | | | |
Collapse
|
44
|
Gao X, Starmer J, Martin ER. A multiple testing correction method for genetic association studies using correlated single nucleotide polymorphisms. Genet Epidemiol 2008; 32:361-9. [DOI: 10.1002/gepi.20310] [Citation(s) in RCA: 508] [Impact Index Per Article: 31.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
|
45
|
Liang Y, Kelemen A. Statistical advances and challenges for analyzing correlated high dimensional SNP data in genomic study for complex diseases. STATISTICS SURVEYS 2008. [DOI: 10.1214/07-ss026] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
46
|
Gu CC, Yu K, Rao DC. Characterization of LD structures and the utility of HapMap in genetic association studies. ADVANCES IN GENETICS 2008; 60:407-35. [PMID: 18358328 DOI: 10.1016/s0065-2660(07)00415-4] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
Observed distribution of and variation in linkage disequilibrium (LD) with respect to the evolution history and disease transmission in a population is the driving force behind the current wave of genome-wide association (GWA) studies of complex human diseases. An extensive literature covers topics from haplotype analysis that utilizes local LD structures in candidate genes and regions to genome-wide organization of LD blocks (neighborhood) that led to the development of International HapMap Project and panels of "tagSNPs" used by current GWA studies. In this chapter, we examine the scenarios where each of the major types of analysis methods may be applicable and where the current popular genotyping platforms for GWA might come short. We discuss current association analysis methods by emphasizing their reliance on the local LD structures or the global organization of the LD structures, and highlight the need to consider individual marker information content in large-scale association mapping.
Collapse
Affiliation(s)
- C Charles Gu
- Division of Biostatistics and Department of Genetics, Washington University School of Medicine, St. Louis, MO 63110, USA
| | | | | |
Collapse
|
47
|
Abstract
Haplotypes are composed of specific combinations of alleles at the several loci on the same chromosome. Because haplotypes incorporate linkage disequilibrium (LD) information from multiple loci, haplotype-based association analyses can provide greater powers than the single-marker analysis in the association studies. However, when we construct haplotypes using many markers simultaneously, we may be confronted with a sparseness problem due to a large number of haplotypes. In this paper, we propose the principal-component (PC) association test as an alternative to the haplotype-based association test. We define the PC scores from the LD blocks and perform the association test using logistic regression. The proposed PC test was applied to the analysis of the Genetic Analysis Workshop 15 simulated data set. By knowing the answers of Problem 3, we evaluated the performance of the PC test and the haplotype-based association test using Akaike Information Criterion (AIC), power, and type I error. The PC test performed better than the haplotype-based association test in the sense that the former tends to have smaller AIC values and slightly greater power than the latter.
Collapse
Affiliation(s)
- Sohee Oh
- Department of Statistics, Seoul National University, 56-1 Shillim-Dong, Kownak-Gu, Seoul 151-747, South Korea.
| | | |
Collapse
|
48
|
|
49
|
Paschou P, Ziv E, Burchard EG, Choudhry S, Rodriguez-Cintron W, Mahoney MW, Drineas P. PCA-correlated SNPs for structure identification in worldwide human populations. PLoS Genet 2007; 3:1672-86. [PMID: 17892327 PMCID: PMC1988848 DOI: 10.1371/journal.pgen.0030160] [Citation(s) in RCA: 163] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2007] [Accepted: 08/01/2007] [Indexed: 12/12/2022] Open
Abstract
Existing methods to ascertain small sets of markers for the identification of human population structure require prior knowledge of individual ancestry. Based on Principal Components Analysis (PCA), and recent results in theoretical computer science, we present a novel algorithm that, applied on genomewide data, selects small subsets of SNPs (PCA-correlated SNPs) to reproduce the structure found by PCA on the complete dataset, without use of ancestry information. Evaluating our method on a previously described dataset (10,805 SNPs, 11 populations), we demonstrate that a very small set of PCA-correlated SNPs can be effectively employed to assign individuals to particular continents or populations, using a simple clustering algorithm. We validate our methods on the HapMap populations and achieve perfect intercontinental differentiation with 14 PCA-correlated SNPs. The Chinese and Japanese populations can be easily differentiated using less than 100 PCA-correlated SNPs ascertained after evaluating 1.7 million SNPs from HapMap. We show that, in general, structure informative SNPs are not portable across geographic regions. However, we manage to identify a general set of 50 PCA-correlated SNPs that effectively assigns individuals to one of nine different populations. Compared to analysis with the measure of informativeness, our methods, although unsupervised, achieved similar results. We proceed to demonstrate that our algorithm can be effectively used for the analysis of admixed populations without having to trace the origin of individuals. Analyzing a Puerto Rican dataset (192 individuals, 7,257 SNPs), we show that PCA-correlated SNPs can be used to successfully predict structure and ancestry proportions. We subsequently validate these SNPs for structure identification in an independent Puerto Rican dataset. The algorithm that we introduce runs in seconds and can be easily applied on large genome-wide datasets, facilitating the identification of population substructure, stratification assessment in multi-stage whole-genome association studies, and the study of demographic history in human populations. Genetic markers can be used to infer population structure, a task that remains a central challenge in many areas of genetics such as population genetics, and the search for susceptibility genes for common disorders. In such settings, it is often desirable to reduce the number of markers needed for structure identification. Existing methods to identify structure informative markers demand prior knowledge of the membership of the studied individuals to predefined populations. In this paper, based on the properties of a powerful dimensionality reduction technique (Principal Components Analysis), we develop a novel algorithm that does not depend on any prior assumptions and can be used to identify a small set of structure informative markers. Our method is very fast even when applied to datasets of hundreds of individuals and millions of markers. We evaluate this method on a large dataset of 11 populations from around the world, as well as data from the HapMap project. We show that, in most cases, we can achieve 99% genotyping savings while at the same time recovering the structure of the studied populations. Finally, we show that our algorithm can also be successfully applied for the identification of structure informative markers when studying populations of complex ancestry.
Collapse
Affiliation(s)
- Peristera Paschou
- Department of Molecular Biology and Genetics, Democritus University of Thrace, Alexandroupoli, Greece.
| | | | | | | | | | | | | |
Collapse
|
50
|
Descatha A, Roquelaure Y, Evanoff B, Niedhammer I, Chastang JF, Mariot C, Ha C, Imbernon E, Goldberg M, Leclerc A. Selected questions on biomechanical exposures for surveillance of upper-limb work-related musculoskeletal disorders. Int Arch Occup Environ Health 2007; 81:1-8. [PMID: 17476519 PMCID: PMC2080671 DOI: 10.1007/s00420-007-0180-5] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2006] [Accepted: 03/01/2007] [Indexed: 11/25/2022]
Abstract
OBJECTIVE Questionnaires for assessment of biomechanical exposure are frequently used in surveillance programs, though few studies have evaluated which key questions are needed. We sought to reduce the number of variables on a surveillance questionnaire by identifying which variables best summarized biomechanical exposure in a survey of the French working population. METHODS We used data from 2002 to 2003 French experimental network of Upper-limb work-related musculoskeletal disorders (UWMSD), performed on 2,685 subjects in which 37 variables assessing biomechanical exposures were available (divided into four ordinal categories, according to the task frequency or duration). Principal Component Analysis (PCA) with orthogonal rotation was performed on these variables. Variables closely associated with factors issued from PCA were retained, except those highly correlated to another variable (rho > 0.70). In order to study the relevance of the final list of variables, correlations between a score based on retained variables (PCA score) and the exposure score suggested by the SALTSA group were calculated. The associations between the PCA score and the prevalence of UWMSD were also studied. In a final step, we added back to the list a few variables not retained by PCA, because of their established recognition as risk factors. RESULTS According to the results of the PCA, seven interpretable factors were identified: posture exposures, repetitiveness, handling of heavy loads, distal biomechanical exposures, computer use, forklift operator specific task, and recovery time. About 20 variables strongly correlated with the factors obtained from PCA were retained. The PCA score was strongly correlated both with the SALTSA score and with UWMSD prevalence (P < 0.0001). In the final step, six variables were reintegrated. CONCLUSION Twenty-six variables of 37 were efficiently selected according to their ability to summarize major biomechanical constraints in a working population, with an approach combining statistical analyses and existing knowledge.
Collapse
Affiliation(s)
- Alexis Descatha
- INSERM U687-IFR69, HNSM, 14 rue du Val d'Osne, 94415 St-Maurice Cedex, France.
| | | | | | | | | | | | | | | | | | | |
Collapse
|