1
|
Wang X, Jiang X, Vaidya J. Efficient verification for outsourced genome-wide association studies. J Biomed Inform 2021; 117:103714. [PMID: 33711538 DOI: 10.1016/j.jbi.2021.103714] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2019] [Revised: 02/09/2021] [Accepted: 02/10/2021] [Indexed: 11/17/2022]
Abstract
With cloud computing is being widely adopted in conducting genome-wide association studies (GWAS), how to verify the integrity of outsourced GWAS computation remains to be accomplished. Here, we propose two novel algorithms to generate synthetic SNPs that are indistinguishable from real SNPs. The first method creates synthetic SNPs based on the phenotype vector, while the second approach creates synthetic SNPs based on real SNPs that are most similar to the phenotype vector. The time complexity of the first approach and the second approach is Om and Omlogn2, respectively, where m is the number of subjects while n is the number of SNPs. Furthermore, through a game theoretic analysis, we demonstrate that it is possible to incentivize honest behavior by the server by coupling appropriate payoffs with randomized verification. We conduct extensive experiments of our proposed methods, and the results show that beyond a formal adversarial model, when only a few synthetic SNPs are generated and mixed into the real data they cannot be distinguished from the real SNPs even by a variety of predictive machine learning models. We demonstrate that the proposed approach can ensure that logistic regression for GWAS can be outsourced in an efficient and trustworthy way.
Collapse
Affiliation(s)
| | - Xiaoqian Jiang
- University of Texas Health Science Center at Houston, TX, USA
| | | |
Collapse
|
2
|
Lau A, So HC. Turning genome-wide association study findings into opportunities for drug repositioning. Comput Struct Biotechnol J 2020; 18:1639-1650. [PMID: 32670504 PMCID: PMC7334463 DOI: 10.1016/j.csbj.2020.06.015] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2019] [Revised: 06/05/2020] [Accepted: 06/05/2020] [Indexed: 02/02/2023] Open
Abstract
Drug development is a very costly and lengthy process, while repositioned or repurposed drugs could be brought into clinical practice within a shorter time-frame and at a much reduced cost. Numerous computational approaches to drug repositioning have been developed, but methods utilizing genome-wide association studies (GWASs) data are less explored. The past decade has observed a massive growth in the amount of data from GWAS; the rich information contained in GWAS has great potential to guide drug repositioning or discovery. While multiple tools are available for finding the most relevant genes from GWAS hits, searching for top susceptibility genes is only one way to guide repositioning, which has its own limitations. Here we provide a comprehensive review of different computational approaches that employ GWAS data to guide drug repositioning. These methods include selecting top candidate genes from GWAS as drug targets, deducing drug candidates based on drug-drug and disease-disease similarities, searching for reversed expression profiles between drugs and diseases, pathway-based methods as well as approaches based on analysis of biological networks. Each method is illustrated with examples, and their respective strengths and limitations are discussed. We also discussed several areas for future research.
Collapse
Affiliation(s)
- Alexandria Lau
- School of Biomedical Sciences, Faculty of Medicine, The Chinese University of Hong Kong, Hong Kong SAR, China
| | - Hon-Cheong So
- School of Biomedical Sciences, Faculty of Medicine, The Chinese University of Hong Kong, Hong Kong SAR, China
- KIZ-CUHK Joint Laboratory of Bioresources and Molecular Research of Common Diseases, Kunming Zoology Institute of Zoology and The Chinese University of Hong Kong, Hong Kong SAR, China
- Department of Psychiatry, The Chinese University of Hong Kong, Hong Kong SAR, China
- Margaret K.L. Cheung Research Centre for Management of Parkinsonism, The Chinese University of Hong Kong, Hong Kong SAR, China
- Shenzhen Research Institute, The Chinese University of Hong Kong, Shenzhen, China
- Brain and Mind Institute, The Chinese University of Hong Kong, Hong Kong SAR, China
- Hong Kong Branch of the Chinese Academy of Sciences Center for Excellence in Animal Evolution and Genetics, The Chinese University of Hong Kong, Hong Kong SAR, China
- Corresponding author at: School of Biomedical Sciences, Faculty of Medicine, The Chinese University of Hong Kong, Hong Kong SAR, China.
| |
Collapse
|
3
|
Chanda P, Costa E, Hu J, Sukumar S, Van Hemert J, Walia R. Information Theory in Computational Biology: Where We Stand Today. ENTROPY (BASEL, SWITZERLAND) 2020; 22:E627. [PMID: 33286399 PMCID: PMC7517167 DOI: 10.3390/e22060627] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/30/2020] [Revised: 05/31/2020] [Accepted: 06/03/2020] [Indexed: 12/30/2022]
Abstract
"A Mathematical Theory of Communication" was published in 1948 by Claude Shannon to address the problems in the field of data compression and communication over (noisy) communication channels. Since then, the concepts and ideas developed in Shannon's work have formed the basis of information theory, a cornerstone of statistical learning and inference, and has been playing a key role in disciplines such as physics and thermodynamics, probability and statistics, computational sciences and biological sciences. In this article we review the basic information theory based concepts and describe their key applications in multiple major areas of research in computational biology-gene expression and transcriptomics, alignment-free sequence comparison, sequencing and error correction, genome-wide disease-gene association mapping, metabolic networks and metabolomics, and protein sequence, structure and interaction analysis.
Collapse
Affiliation(s)
- Pritam Chanda
- Corteva Agriscience™, Indianapolis, IN 46268, USA
- Computer and Information Science, Indiana University-Purdue University, Indianapolis, IN 46202, USA
| | - Eduardo Costa
- Corteva Agriscience™, Mogi Mirim, Sao Paulo 13801-540, Brazil
| | - Jie Hu
- Corteva Agriscience™, Indianapolis, IN 46268, USA
| | | | | | - Rasna Walia
- Corteva Agriscience™, Johnston, IA 50131, USA
| |
Collapse
|
4
|
Dehghanzadeh H, Ghaderi-Zefrehei M, Mirhoseini SZ, Esmaeilkhaniyan S, Haruna IL, Amirpour Najafabadi H. A new DNA sequence entropy-based Kullback-Leibler algorithm for gene clustering. J Appl Genet 2020; 61:231-238. [PMID: 31981184 DOI: 10.1007/s13353-020-00543-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2019] [Revised: 09/07/2019] [Accepted: 01/08/2020] [Indexed: 11/29/2022]
Abstract
Information theory is a branch of mathematics that overlaps with communications, biology, and medical engineering. Entropy is a measure of uncertainty in the set of information. In this study, for each gene and its exons sets, the entropy was calculated in orders one to four. Based on the relative entropy of genes and exons, Kullback-Leibler divergence was calculated. After obtaining the Kullback-Leibler distance for genes and exons sets, the results were entered as input into 7 clustering algorithms: single, complete, average, weighted, centroid, median, and K-means. To aggregate the results of clustering, the AdaBoost algorithm was used. Finally, the results of the AdaBoost algorithm were investigated by GeneMANIA prediction server to explore the results from gene annotation point of view. All calculations were performed using the MATLAB Engineering Software (2015). Following our findings on investigating the results of genes metabolic pathways based on the gene annotations, it was revealed that our proposed clustering method yielded correct, logical, and fast results. This method at the same that had not had the disadvantages of aligning allowed the genes with actual length and content to be considered and also did not require high memory for large-length sequences. We believe that the performance of the proposed method could be used with other competitive gene clustering methods to group biologically relevant set of genes. Also, the proposed method can be seen as a predictive method for those genes bearing up weak genomic annotations.
Collapse
Affiliation(s)
- Houshang Dehghanzadeh
- Department of Animal Science Research, Guilan Agricultural and Natural Resources Research and Education Center, AREEO, Rasht, Iran
| | - Mostafa Ghaderi-Zefrehei
- Department of Animal Science, Faculty of Agriculture, University of Yasouj, P. O. Box: 75914, Yasouj, Iran.
| | | | - Saeid Esmaeilkhaniyan
- Animal Science Research Institute of Iran, Agricultural Research, Education and Extension Organization (AREEO), Karaj, Iran
| | - Ishaku Lemu Haruna
- Faculty of Agriculture and Life Sciences, Lincoln University, Lincoln, New Zealand
| | | |
Collapse
|
5
|
Bi W, Kang G, Pounds SB. Statistical selection of biological models for genome-wide association analyses. Methods 2018; 145:67-75. [PMID: 29803781 DOI: 10.1016/j.ymeth.2018.05.019] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2018] [Revised: 04/13/2018] [Accepted: 05/22/2018] [Indexed: 01/23/2023] Open
Abstract
Genome-wide association studies have discovered many biologically important associations of genes with phenotypes. Typically, genome-wide association analyses formally test the association of each genetic feature (SNP, CNV, etc) with the phenotype of interest and summarize the results with multiplicity-adjusted p-values. However, very small p-values only provide evidence against the null hypothesis of no association without indicating which biological model best explains the observed data. Correctly identifying a specific biological model may improve the scientific interpretation and can be used to more effectively select and design a follow-up validation study. Thus, statistical methodology to identify the correct biological model for a particular genotype-phenotype association can be very useful to investigators. Here, we propose a general statistical method to summarize how accurately each of five biological models (null, additive, dominant, recessive, co-dominant) represents the data observed for each variant in a GWAS study. We show that the new method stringently controls the false discovery rate and asymptotically selects the correct biological model. Simulations of two-stage discovery-validation studies show that the new method has these properties and that its validation power is similar to or exceeds that of simple methods that use the same statistical model for all SNPs. Example analyses of three data sets also highlight these advantages of the new method. An R package is freely available at www.stjuderesearch.org/site/depts/biostats/maew.
Collapse
Affiliation(s)
- Wenjian Bi
- Department of Biostatistics, St. Jude Children's Research Hospital, Memphis, TN 38105, USA
| | - Guolian Kang
- Department of Biostatistics, St. Jude Children's Research Hospital, Memphis, TN 38105, USA
| | - Stanley B Pounds
- Department of Biostatistics, St. Jude Children's Research Hospital, Memphis, TN 38105, USA.
| |
Collapse
|
6
|
Borowska A, Szwaczkowski T, Kamiński S, Hering DM, Kordan W, Lecewicz M. Identification of genome regions determining semen quality in Holstein-Friesian bulls using information theory. Anim Reprod Sci 2018; 192:206-215. [PMID: 29572044 DOI: 10.1016/j.anireprosci.2018.03.012] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2018] [Revised: 02/16/2018] [Accepted: 03/09/2018] [Indexed: 10/17/2022]
Abstract
Use of information theory can be an alternative statistical approach to detect genome regions and candidate genes that are associated with livestock traits. The aim of this study was to verify the validity of the SNPs effects on some semen quality variables of bulls using entropy analysis. Records from 288 Holstein-Friesian bulls from one AI station were included. The following semen quality variables were analyzed: CASA kinematic variables of sperm (total motility, average path velocity, straight line velocity, curvilinear velocity, amplitude of lateral head displacement, beat cross frequency, straightness, linearity), sperm membrane integrity (plazmolema, mitochondrial function), sperm ATP content. Molecular data included 48,192 SNPs. After filtering (call rate = 0.95 and MAF = 0.05), 34,794 SNPs were included in the entropy analysis. The entropy and conditional entropy were estimated for each SNP. Conditional entropy quantifies the remaining uncertainty about values of the variable with the knowledge of SNP. The most informative SNPs for each variable were determined. The computations were performed using the R statistical package. A majority of the loci had relatively small contributions. The most informative SNPs for all variables were mainly located on chromosomes: 3, 4, 5 and 16. The results from the study indicate that important genome regions and candidate genes that determine semen quality variables in bulls are located on a number of chromosomes. Some detected clusters of SNPs were located in RNA (U6 and 5S_rRNA) for all the variables for which analysis occurred. Associations between PARK2 as well GALNT13 genes and some semen characteristics were also detected.
Collapse
Affiliation(s)
- Alicja Borowska
- Division of Horse Breeding, Poznan University of Life Sciences, Wolynska st. 33, 60-637 Poznan, Poland
| | - Tomasz Szwaczkowski
- Department of Genetics and Animal Breeding, Poznan University of Life Sciences, Wolynska st. 33, 60-637 Poznan, Poland.
| | - Stanisław Kamiński
- Department of Animal Genetics, University of Warmia and Mazury in Olsztyn, M. Oczapowski st. 5, 10-718 Olsztyn, Poland
| | - Dorota M Hering
- Department of Animal Genetics, University of Warmia and Mazury in Olsztyn, M. Oczapowski st. 5, 10-718 Olsztyn, Poland
| | - Władysław Kordan
- Department of Animal Biochemistry and Biotechnology, University of Warmia and Mazury in Olsztyn, M. Oczapowski st. 5, 10-718 Olsztyn, Poland
| | - Marek Lecewicz
- Department of Animal Biochemistry and Biotechnology, University of Warmia and Mazury in Olsztyn, M. Oczapowski st. 5, 10-718 Olsztyn, Poland
| |
Collapse
|
7
|
Porfiri M, Ruiz Marín M. Symbolic dynamics of animal interaction. J Theor Biol 2017; 435:145-156. [PMID: 28916452 DOI: 10.1016/j.jtbi.2017.09.005] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2017] [Revised: 09/01/2017] [Accepted: 09/07/2017] [Indexed: 10/18/2022]
Abstract
Since its introduction nearly two decades ago, transfer entropy has contributed to an improved understanding of cause-and-effect relationships in coupled dynamical systems from raw time series. In the context of animal behavior, transfer entropy might help explain the determinants of leadership in social groups and elucidate escape response to predator attacks. Despite its promise, the potential of transfer entropy in animal behavior is yet to be fully tested, and a number of technical challenges in information theory and statistics remain open. Here, we examine an alternative approach to the computation of transfer entropy based on symbolic dynamics. In this context, a symbol is associated with a specific locomotory bout across two or more consecutive time instants, such as reversing the swimming direction. Symbols encapsulate salient locomotory patterns and the associated permutation transfer entropy quantifies the ability to predict the patterns of an individual given those of another individual. We demonstrate this framework on an existing dataset on fish, for which we have knowledge of the underlying cause-and-effect relationship between the focal subject and the stimulus. Symbolic dynamics offers an intuitive and robust approach to study animal behavior, which could enable the inference of causal relationship from noisy experimental observations of limited duration.
Collapse
Affiliation(s)
- Maurizio Porfiri
- Department of Mechanical and Aerospace Engineering, New York University Tandon School of Engineering, Six MetroTech Center, Brooklyn, NY 11201, USA.
| | - Manuel Ruiz Marín
- Department of Quantitative Methods and Informatics, Technical University of Cartagena, Murcia, Spain.
| |
Collapse
|
8
|
Liu J, Beyene J. Entropy-based method for assessing the influence of genetic markers and covariates on hypertension: application to Genetic Analysis Workshop 18 data. BMC Proc 2014; 8:S97. [PMID: 25519419 PMCID: PMC4143731 DOI: 10.1186/1753-6561-8-s1-s97] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open
Abstract
Many complex diseases are related to genetics, and it is of great interest to evaluate the association between single-nucleotide polymorphisms (SNPs) and disease outcome. The association of genetics with outcome can be modified by covariates such as age, sex, smoking status, and membership to the same pedigree. In this paper, we propose a block entropy method to separate two classes of SNPs, for which the association with hypertension is either sensitive or insensitive to the covariates. We also propose a consistency entropy method to further reduce the number of SNPs that might be associated with the outcome. Based on the data provided by the organizers of Genetic Analysis Workshop 18, we calculated the block entropies for six different blocking strategies. Using block entropy and consistency entropy, we identified 230 SNPs on chromosome 9 that are most likely to be associated with the outcome and whose associations with hypertension are sensitive to the covariates.
Collapse
Affiliation(s)
- Jun Liu
- Department of Clinical Epidemiology & Biostatistics, McMaster University, Hamilton, Ontario, Canada L8S 4K1
| | - Joseph Beyene
- Department of Clinical Epidemiology & Biostatistics, McMaster University, Hamilton, Ontario, Canada L8S 4K1
| |
Collapse
|
9
|
Kang G, Bi W, Zhao Y, Zhang JF, Yang JJ, Xu H, Loh ML, Hunger SP, Relling MV, Pounds S, Cheng C. A new system identification approach to identify genetic variants in sequencing studies for a binary phenotype. Hum Hered 2014; 78:104-16. [PMID: 25096228 DOI: 10.1159/000363660] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2013] [Accepted: 05/16/2014] [Indexed: 12/24/2022] Open
Abstract
We propose in this paper a set-valued (SV) system model, which is a generalized form of logistic (LG) and Probit (Probit) regression, to be considered as a method for discovering genetic variants, especially rare genetic variants in next-generation sequencing studies, for a binary phenotype. We propose a new SV system identification method to estimate all underlying key system parameters for the Probit model and compare it with the LG model in the setting of genetic association studies. Across an extensive series of simulation studies, the Probit method maintained type I error control and had similar or greater power than the LG method, which is robust to different distributions of noise: logistic, normal, or t distributions. Additionally, the Probit association parameter estimate was 2.7-46.8-fold less variable than the LG log-odds ratio association parameter estimate. Less variability in the association parameter estimate translates to greater power and robustness across the spectrum of minor allele frequencies (MAFs), and these advantages are the most pronounced for rare variants. For instance, in a simulation that generated data from an additive logistic model with an odds ratio of 7.4 for a rare single nucleotide polymorphism with a MAF of 0.005 and a sample size of 2,300, the Probit method had 60% power whereas the LG method had 25% power at the α = 10(-6) level. Consistent with these simulation results, the set of variants identified by the LG method was a subset of those identified by the Probit method in two example analyses. Thus, we suggest the Probit method may be a competitive alternative to the LG method in genetic association studies such as candidate gene, genome-wide, or next-generation sequencing studies for a binary phenotype.
Collapse
Affiliation(s)
- Guolian Kang
- Department of Biostatistics and Pharmaceutical Sciences, St. Jude Children's Research Hospital, Memphis, Tenn., USA
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
10
|
Kwon MS, Park M, Park T. IGENT: efficient entropy based algorithm for genome-wide gene-gene interaction analysis. BMC Med Genomics 2014; 7 Suppl 1:S6. [PMID: 25077411 PMCID: PMC4101351 DOI: 10.1186/1755-8794-7-s1-s6] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/04/2023] Open
Abstract
Background With the development of high-throughput genotyping and sequencing technology, there are growing evidences of association with genetic variants and complex traits. In spite of thousands of genetic variants discovered, such genetic markers have been shown to explain only a very small proportion of the underlying genetic variance of complex traits. Gene-gene interaction (GGI) analysis is expected to unveil a large portion of unexplained heritability of complex traits. Methods In this work, we propose IGENT, Information theory-based GEnome-wide gene-gene iNTeraction method. IGENT is an efficient algorithm for identifying genome-wide gene-gene interactions (GGI) and gene-environment interaction (GEI). For detecting significant GGIs in genome-wide scale, it is important to reduce computational burden significantly. Our method uses information gain (IG) and evaluates its significance without resampling. Results Through our simulation studies, the power of the IGENT is shown to be better than or equivalent to that of that of BOOST. The proposed method successfully detected GGI for bipolar disorder in the Wellcome Trust Case Control Consortium (WTCCC) and age-related macular degeneration (AMD). Conclusions The proposed method is implemented by C++ and available on Windows, Linux and MacOSX.
Collapse
|
11
|
Kang G, Jiang B, Cui Y. Gene-based Genomewide Association Analysis: A Comparison Study. Curr Genomics 2013; 14:250-5. [PMID: 24294105 PMCID: PMC3731815 DOI: 10.2174/13892029113149990001] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2013] [Revised: 05/01/2013] [Accepted: 05/07/2013] [Indexed: 11/22/2022] Open
Abstract
The study of gene-based genetic associations has gained conceptual popularity recently. Biologic insight into the etiology of a complex disease can be gained by focusing on genes as testing units. Several gene-based methods (e.g., minimum p-value (or maximum test statistic) or entropy-based method) have been developed and have more power than a single nucleotide polymorphism (SNP)-based analysis. The objective of this study is to compare the performance of the entropy-based method with the minimum p-value and single SNP–based analysis and to explore their strengths and weaknesses. Simulation studies show that: 1) all three methods can reasonably control the false-positive rate; 2) the minimum p-value method outperforms the entropy-based and the single SNP–based method when only one disease-related SNP occurs within the gene; 3) the entropy-based method outperforms the other methods when there are more than two disease-related SNPs in the gene; and 4) the entropy-based method is computationally more efficient than the minimum p-value method. Application to a real data set shows that more significant genes were identified by the entropy-based method than by the other two methods.
Collapse
Affiliation(s)
- Guolian Kang
- Department of Biostatistics, St. Jude Children's Research Hospital, Memphis, TN 38105
| | | | | |
Collapse
|
12
|
Abstract
In the analysis of dependencies between nominal traits entropy and its function, mutual information seems to be a proper descriptive statistic. This is shown by characterizing the relationships between the prolificacy of dams and selected genetic attributes: the genotype of transferrin, the genotype of hemoglobin, and the type of birth, as well as the environmental attribute, i.e., year of birth. The entropy method may improve the exactitude of investigations concerning the influence of different factors on production trait. The index of relative uniformity, introduced in this study, proved to be an adequate tool for the determination of similarity in the examined flocks. The application of mutual information in the determination of values of the dependence measures in the analyzed experiment was justified.
Collapse
Affiliation(s)
- Anita Dobek
- Department of Mathematical and Statistical Methods, Poznań University of Life Sciences, Wojska Polskiego 28, 60-637, Poznań, Poland.
| | | | | | | |
Collapse
|
13
|
Wu C, Li S, Cui Y. Genetic association studies: an information content perspective. Curr Genomics 2012; 13:566-73. [PMID: 23633916 PMCID: PMC3468889 DOI: 10.2174/138920212803251382] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2012] [Revised: 06/04/2012] [Accepted: 06/18/2012] [Indexed: 01/02/2023] Open
Abstract
The availability of high-density single nucleotide polymorphisms (SNPs) data has made the human genetic association studies possible to identify common and rare variants underlying complex diseases in a genome-wide scale. A handful of novel genetic variants have been identified, which gives much hope and prospects for the future of genetic association studies. In this process, statistical and computational methods play key roles, among which information-based association tests have gained large popularity. This paper is intended to give a comprehensive review of the current literature in genetic association analysis casted in the framework of information theory. We focus our review on the following topics: (1) information theoretic approaches in genetic linkage and association studies; (2) entropy-based strategies for optimal SNP subset selection; and (3) the usage of theoretic information criteria in gene clustering and gene regulatory network construction.
Collapse
Affiliation(s)
- Cen Wu
- Department of Statistics and Probability, Michigan State University, East Lansing, Michigan 48824
| | - Shaoyu Li
- Department of Biostatistics, St. Jude Children's Research Hospital, Memphis, TN 38105
| | - Yuehua Cui
- Department of Statistics and Probability, Michigan State University, East Lansing, Michigan 48824
- Center for Computational Biology, Beijing Forestry University, Beijing, China 100083
| |
Collapse
|
14
|
Abstract
The field of genomics has entered a new era in which the ability to identify genetic variants that impact complex human traits and disease in an unbiased fashion using genome-wide approaches is widely accessible. To date, the workhorse of these efforts has been the genome-wide association study (GWAS), which has quickly moved from novel to routine, and has provided key insights into aspects of the underlying allelic architecture of complex traits. The main lesson learned from the early GWAS efforts is that though many disease-associated variants are often discovered, most have only a minor effect on disease, and in total explain only a small amount of the apparent heritability. Here we provide a brief overview of the genetic variation classes that may harbor the heritability missing from GWAS, and touch on approaches that will be leveraged in the coming years as genomics-and by extension medicine-becomes increasingly personalized.
Collapse
Affiliation(s)
- Brian D. Juran
- Division of Gastroenterology and Hepatology, Center for Basic Research in Digestive Diseases, Mayo Clinic College of Medicine, Rochester, Minnesota
| | - Konstantinos N. Lazaridis
- Division of Gastroenterology and Hepatology, Center for Basic Research in Digestive Diseases, Mayo Clinic College of Medicine, Rochester, Minnesota
| |
Collapse
|