1
|
Pevzner P, Vingron M, Reidys C, Sun F, Istrail S. Michael Waterman's Contributions to Computational Biology and Bioinformatics. J Comput Biol 2022; 29:601-615. [PMID: 35727100 DOI: 10.1089/cmb.2022.29066.pp] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
On the occasion of Dr. Michael Waterman's 80th birthday, we review his major contributions to the field of computational biology and bioinformatics including the famous Smith-Waterman algorithm for sequence alignment, the probability and statistics theory related to sequence alignment, algorithms for sequence assembly, the Lander-Waterman model for genome physical mapping, combinatorics and predictions of ribonucleic acid structures, word counting statistics in molecular sequences, alignment-free sequence comparison, and algorithms for haplotype block partition and tagSNP selection related to the International HapMap Project. His books Introduction to Computational Biology: Maps, Sequences and Genomes for graduate students and Computational Genome Analysis: An Introduction geared toward undergraduate students played key roles in computational biology and bioinformatics education. We also highlight his efforts of building the computational biology and bioinformatics community as the founding editor of the Journal of Computational Biology and a founding member of the International Conference on Research in Computational Molecular Biology (RECOMB).
Collapse
Affiliation(s)
- Pavel Pevzner
- Department of Computer Science and Engineering, University of California San Diego, San Diego, California, USA
| | - Martin Vingron
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - Christian Reidys
- Department of Mathematics, Biocomplexity Institute & Initiative, University of Virginia, Charlottesville, Virginia, USA
| | - Fengzhu Sun
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, California, USA
| | - Sorin Istrail
- Department of Computer Science, Center for Computational Molecular Biology, Brown University, Providence, Rhode Island, USA
| |
Collapse
|
2
|
Ho Jang G, Christie JD, Feng R. A method for calling copy number polymorphism using haplotypes. Front Genet 2013; 4:165. [PMID: 24069028 PMCID: PMC3780619 DOI: 10.3389/fgene.2013.00165] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2013] [Accepted: 08/07/2013] [Indexed: 12/15/2022] Open
Abstract
Single nucleotide polymorphism (SNP) and copy number variation (CNV) are both widespread characteristic of the human genome, but are often called separately on common genotyping platforms. To capture integrated SNP and CNV information, methods have been developed for calling allelic specific copy numbers or so called copy number polymorphism (CNP), using limited inter-marker correlation. In this paper, we proposed a haplotype-based maximum likelihood method to call CNP, which takes advantage of the valuable multi-locus linkage disequilibrium (LD) information in the population. We also developed a computationally efficient algorithm to estimate haplotype frequencies and optimize individual CNP calls iteratively, even at presence of missing data. Through simulations, we demonstrated our model is more sensitive and accurate in detecting various CNV regions, compared with commonly-used CNV calling methods including PennCNV, another hidden Markov model (HMM) using CNP, a scan statistic, segCNV, and cnvHap. Our method often performs better in the regions with higher LD, in longer CNV regions, and in common CNV than the opposite. We implemented our method on the genotypes of 90 HapMap CEU samples and 23 patients with acute lung injury (ALI). For each ALI patient the genotyping was performed twice. The CNPs from our method show good consistency and accuracy comparable to others.
Collapse
Affiliation(s)
- Gun Ho Jang
- Department of Biostatistics and Epidemiology, Center for Clinical Epidemiology and Biostatistics, University of Pennsylvania Philadelphia, PA, USA
| | | | | |
Collapse
|
3
|
İlhan İ, Tezel G. How to Select Tag SNPs in Genetic Association Studies? The CLONTagger Method with Parameter Optimization. OMICS-A JOURNAL OF INTEGRATIVE BIOLOGY 2013; 17:368-83. [DOI: 10.1089/omi.2012.0100] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Affiliation(s)
- İlhan İlhan
- Akören Vocational School, Selçuk University, Konya, Turkey
| | - Gülay Tezel
- Department of Computer Engineering Faculty of Engineering and Architecture, Selçuk University, Konya, Turkey
| |
Collapse
|
4
|
İlhan İ, Tezel G. A genetic algorithm–support vector machine method with parameter optimization for selecting the tag SNPs. J Biomed Inform 2013; 46:328-40. [DOI: 10.1016/j.jbi.2012.12.002] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2012] [Revised: 10/13/2012] [Accepted: 12/11/2012] [Indexed: 01/06/2023]
|
5
|
Chuang LY, Yang CS, Ho CH, Yang CH. Tag SNP selection using particle swarm optimization. Biotechnol Prog 2010; 26:580-8. [PMID: 20039435 DOI: 10.1002/btpr.350] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Single nucleotide polymorphisms (SNPs) are the most abundant form of genetic variations amongst species. With the genome-wide SNP discovery, many genome-wide association studies are likely to identify multiple genetic variants that are associated with complex diseases. However, genotyping all existing SNPs for a large number of samples is still challenging even though SNP arrays have been developed to facilitate the task. Therefore, it is essential to select only informative SNPs representing the original SNP distributions in the genome (tag SNP selection) for genome-wide association studies. These SNPs are usually chosen from haplotypes and called haplotype tag SNPs (htSNPs). Accordingly, the scale and cost of genotyping are expected to be largely reduced. We introduce binary particle swarm optimization (BPSO) with local search capability to improve the prediction accuracy of STAMPA. The proposed method does not rely on block partitioning of the genomic region, and consistently identified tag SNPs with higher prediction accuracy than either STAMPA or SVM/STSA. We compared the prediction accuracy and time complexity of BPSO to STAMPA and an SVM-based (SVM/STSA) method using publicly available data sets. For STAMPA and SVM/STSA, BPSO effective improved prediction accuracy for smaller and larger scale data sets. These results demonstrate that the BPSO method selects tag SNP with higher accuracy no matter the scale of data sets is used.
Collapse
Affiliation(s)
- Li-Yeh Chuang
- Dept. of Chemical Engineering, I-Shou University, Kaohsiung, Taiwan
| | | | | | | |
Collapse
|
6
|
Qanbari S, Pimentel ECG, Tetens J, Thaller G, Lichtner P, Sharifi AR, Simianer H. The pattern of linkage disequilibrium in German Holstein cattle. Anim Genet 2010; 41:346-56. [PMID: 20055813 DOI: 10.1111/j.1365-2052.2009.02011.x] [Citation(s) in RCA: 86] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
This study presents a second generation of linkage disequilibrium (LD) map statistics for the whole genome of the Holstein-Friesian population, which has a four times higher resolution compared with that of the maps available so far. We used DNA samples of 810 German Holstein-Friesian cattle genotyped by the Illumina Bovine SNP50K BeadChip to analyse LD structure. A panel of 40 854 (75.6%) markers was included in the final analysis. The pairwise r(2) statistic of SNPs up to 5 Mb apart across the genome was estimated. A mean value of r(2) = 0.30 +/- 0.32 was observed in pairwise distances of <25 kb and it dropped to 0.20 +/- 0.24 at 50-75 kb, which is nearly the average inter-marker space in this study. The proportion of SNPs in useful LD (r(2) > or = 0.25) was 26% for the distance of 50 and 75 kb between SNPs. We found a lower level of LD for SNP pairs at the distance < or =100 kb than previously thought. Analysis revealed 712 haplo-blocks spanning 4.7% of the genome and containing 8.0% of all SNPs. Mean and median block length were estimated as 164 +/- 117 kb and 144 kb respectively. Allele frequencies of the SNPs have a considerable and systematic impact on the estimate of r(2). It is shown that minimizing the allele frequency difference between SNPs reduces the influence of frequency on r(2) estimates. Analysis of past effective population size based on the direct estimates of recombination rates from SNP data showed a decline in effective population size to N(e) = 103 up to approximately 4 generations ago. Systematic effects of marker density and effective population size on observed LD and haplotype structure are discussed.
Collapse
Affiliation(s)
- S Qanbari
- Animal Breeding and Genetics Group, Department of Animal Sciences, Georg-August University, 37075 Göttingen, Germany.
| | | | | | | | | | | | | |
Collapse
|
7
|
Huang YT, Chao KM. A new framework for the selection of tag SNPs by multimarker haplotypes. J Biomed Inform 2008; 41:953-61. [DOI: 10.1016/j.jbi.2008.04.003] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2007] [Revised: 03/31/2008] [Accepted: 04/04/2008] [Indexed: 10/22/2022]
|
8
|
Fujisawa H, Isomura M, Eguchi S, Ushijima M, Miyata S, Miki Y, Matsuura M. Identifying haplotype block structure using an ancestor-derived model. J Hum Genet 2007; 52:738-746. [PMID: 17636360 DOI: 10.1007/s10038-007-0176-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2007] [Accepted: 06/22/2007] [Indexed: 11/30/2022]
Abstract
Recently, haplotype-based association studies have become popular for detecting disease-related or drug-response-associated genes. In these studies, it has been gradually recognized that a haplotype block structure is important. A rational and automatic method for identifying the haplotype block structure from SNP data has been desired. We have developed a new method using an ancestor-derived model and the minimum description length principle. The proposed method was applied to real data on the TAP2 gene in which a recombination hotspot was previously reported in human sperm data. The proposed method could identify an appropriate haplotype block structure, while existing methods failed. The performance of the proposed method was also investigated in a simulation study. The proposed method presented a better performance in real data analysis and the simulation study than existing methods. The proposed method was powerful from the viewpoint of hotspot sensitivity and was robust to mutation except at the edge of a sequence.
Collapse
Affiliation(s)
| | - Minoru Isomura
- Genome Center, Japanese Foundation for Cancer Research, Tokyo, 135-8550, Japan
| | - Shinto Eguchi
- The Institute of Statistical Mathematics, Tokyo, 106-8569, Japan
| | - Masaru Ushijima
- Genome Center, Japanese Foundation for Cancer Research, Tokyo, 135-8550, Japan
| | - Satoshi Miyata
- Genome Center, Japanese Foundation for Cancer Research, Tokyo, 135-8550, Japan
| | - Yoshio Miki
- Genome Center, Japanese Foundation for Cancer Research, Tokyo, 135-8550, Japan
| | - Masaaki Matsuura
- Genome Center, Japanese Foundation for Cancer Research, Tokyo, 135-8550, Japan
| |
Collapse
|
9
|
|
10
|
Phuong TM, Lin Z, Altman RB. Choosing SNPs using feature selection. PROCEEDINGS. IEEE COMPUTATIONAL SYSTEMS BIOINFORMATICS CONFERENCE 2007:301-9. [PMID: 16447987 DOI: 10.1109/csb.2005.22] [Citation(s) in RCA: 49] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
A major challenge for genomewide disease association studies is the high cost of genotyping large number of single nucleotide polymorphisms (SNP). The correlations between SNPs, however, make it possible to select a parsimonious set of informative SNPs, known as "tagging" SNPs, able to capture most variation in a population. Considerable research interest has recently focused on the development of methods for finding such SNPs. In this paper, we present an efficient method for finding tagging SNPs. The method does not involve computation-intensive search for SNP subsets but discards redundant SNPs using a feature selection algorithm. In contrast to most existing methods, the method presented here does not limit itself to using only correlations between SNPs in local groups. By using correlations that occur across different chromosomal regions, the method can reduce the number of globally redundant SNPs. Experimental results show that the number of tagging SNPs selected by our method is smaller than by using block-based methods.
Collapse
Affiliation(s)
- Tu Minh Phuong
- Department of Information Technology, Post & Telecom. Institute of Technology, Hanoi, Vietnam.
| | | | | |
Collapse
|
11
|
|
12
|
De La Vega FM. Selecting single-nucleotide polymorphisms for association studies with SNPbrowser software. Methods Mol Biol 2007; 376:177-93. [PMID: 17984546 DOI: 10.1007/978-1-59745-389-9_13] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023]
Abstract
The design of genetic association studies using single-nucleotide polymorphisms (SNPs) requires the selection of subsets of the variants providing high statistical power at a reasonable cost. SNPs must be selected to maximize the probability that a causative mutation is in linkage disequilibrium (LD) with at least one marker genotyped in the study. The HapMap Project performed a genome-wide survey of genetic variation with over 3 million SNPs typed in four populations, providing a rich resource to inform the design of association studies. A number of strategies have been proposed for the selection of SNPs based on observed LD, including construction of metric LD maps and the selection of haplotype-tagging SNPs. Power calculations are important at the study design stage to ensure successful results. Integrating these methods and annotations can be challenging: the algorithms required to implement these methods are complex to deploy, and all the necessary data and annotations are deposited in disparate databases. Here, we review the typical workflows for the selection of markers for association studies utilizing the SNPbrowser software, a freely available, stand-alone application that incorporates the HapMap database together with gene and SNP annotations. Selected SNPs are screened for their conversion potential to genotyping platforms, expediting the set up of genetic studies with an increased probability of success.
Collapse
|
13
|
Phuong TM, Lin Z, Altman RB. Choosing SNPs using feature selection. J Bioinform Comput Biol 2006; 4:241-57. [PMID: 16819782 DOI: 10.1142/s0219720006001941] [Citation(s) in RCA: 17] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2005] [Accepted: 01/31/2006] [Indexed: 11/18/2022]
Abstract
A major challenge for genomewide disease association studies is the high cost of genotyping large number of single nucleotide polymorphisms (SNPs). The correlations between SNPs, however, make it possible to select a parsimonious set of informative SNPs, known as "tagging" SNPs, able to capture most variation in a population. Considerable research interest has recently focused on the development of methods for finding such SNPs. In this paper, we present an efficient method for finding tagging SNPs. The method does not involve computation-intensive search for SNP subsets but discards redundant SNPs using a feature selection algorithm. In contrast to most existing methods, the method presented here does not limit itself to using only correlations between SNPs in local groups. By using correlations that occur across different chromosomal regions, the method can reduce the number of globally redundant SNPs. Experimental results show that the number of tagging SNPs selected by our method is smaller than by using block-based methods. Supplementary website: http://htsnp.stanford.edu/FSFS/.
Collapse
Affiliation(s)
- Tu Minh Phuong
- Department of Information Technology, Posts & Telecommunications Institute of Technology, Hanoi, Vietnam.
| | | | | |
Collapse
|
14
|
Abstract
We propose a dictionary model for haplotypes. According to the model, a haplotype is constructed by randomly concatenating haplotype segments from a given dictionary of segments. A haplotype block is defined as a set of haplotype segments that begin and end with the same pair of markers. In this framework, haplotype blocks can overlap, and the model provides a setting for testing the accuracy of simpler models invoking only nonoverlapping blocks. Each haplotype segment in a dictionary has an assigned probability and alternate spellings that account for genotyping errors and mutation. The model also allows for missing data, unphased genotypes, and prior distribution of parameters. Likelihood evaluations rely on forward and backward recurrences similar to the ones encountered in hidden Markov models. Parameter estimation is carried out with an EM algorithm. The search for the optimal dictionary is particularly difficult because of the variable dimension of the model space. We define a minimum description length criteria to evaluate each dictionary and use a combination of greedy search and careful initialization to select a best dictionary for a given dataset. Application of the model to simulated data gives encouraging results. In a real dataset, we are able to reconstruct a parsimonious dictionary that captures patterns of linkage disequilibrium well.
Collapse
Affiliation(s)
- Kristin L Ayers
- Department of Biomathematics, University of California, Los Angeles, CA 90095-1766, USA
| | | | | |
Collapse
|
15
|
Abstract
MOTIVATION Recent studies have shown that a small subset of Single Nucleotide Polymorphisms (SNPs) (called tag SNPs) is sufficient to capture the haplotype patterns in a high linkage disequilibrium region. To find the minimum set of tag SNPs, exact algorithms for finding the optimal solution could take exponential time. On the other hand, approximation algorithms are more efficient but may fail to find the optimal solution. RESULTS We propose a hybrid method that combines the ideas of the branch-and-bound method and the greedy algorithm. This method explores larger solution space to obtain a better solution than a traditional greedy algorithm. It also allows the user to adjust the efficiency of the program and quality of solutions. This algorithm has been implemented and tested on a variety of simulated and biological data. The experimental results indicate that our program can find better solutions than previous methods. This approach is quite general since it can be used to adapt other greedy algorithms to solve their corresponding problems. AVAILABILITY The program is available upon request.
Collapse
Affiliation(s)
- Chia-Jung Chang
- Department of Computer Science and Information Engineering, National Taiwan University Taipei, Taiwan
| | | | | |
Collapse
|
16
|
Huang YT, Chao KM, Chen T. An Approximation Algorithm for Haplotype Inference by Maximum Parsimony. J Comput Biol 2005; 12:1261-74. [PMID: 16379533 DOI: 10.1089/cmb.2005.12.1261] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
This paper studies haplotype inference by maximum parsimony using population data. We define the optimal haplotype inference (OHI) problem as given a set of genotypes and a set of related haplotypes, find a minimum subset of haplotypes that can resolve all the genotypes. We prove that OHI is NP-hard and can be formulated as an integer quadratic programming (IQP) problem. To solve the IQP problem, we propose an iterative semidefinite programming-based approximation algorithm, (called SDPHapInfer). We show that this algorithm finds a solution within a factor of O(log n) of the optimal solution, where n is the number of genotypes. This algorithm has been implemented and tested on a variety of simulated and biological data. In comparison with three other methods, (1) HAPAR, which was implemented based on the branching and bound algorithm, (2) HAPLOTYPER, which was implemented based on the expectation-maximization algorithm, and (3) PHASE, which combined the Gibbs sampling algorithm with an approximate coalescent prior, the experimental results indicate that SDPHapInfer and HAPLOTYPER have similar error rates. In addition, the results generated by PHASE have lower error rates on some data but higher error rates on others. The error rates of HAPAR are higher than the others on biological data. In terms of efficiency, SDPHapInfer, HAPLOTYPER, and PHASE output a solution in a stable and consistent way, and they run much faster than HAPAR when the number of genotypes becomes large.
Collapse
Affiliation(s)
- Yao-Ting Huang
- Department of Computer Science and Information Engineering, National Taiwan University, Taipei
| | | | | |
Collapse
|
17
|
Huang YT, Zhang K, Chen T, Chao KM. Selecting additional tag SNPs for tolerating missing data in genotyping. BMC Bioinformatics 2005; 6:263. [PMID: 16259642 PMCID: PMC1316880 DOI: 10.1186/1471-2105-6-263] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2005] [Accepted: 11/01/2005] [Indexed: 11/13/2022] Open
Abstract
Background Recent studies have shown that the patterns of linkage disequilibrium observed in human populations have a block-like structure, and a small subset of SNPs (called tag SNPs) is sufficient to distinguish each pair of haplotype patterns in the block. In reality, some tag SNPs may be missing, and we may fail to distinguish two distinct haplotypes due to the ambiguity caused by missing data. Results We show there exists a subset of SNPs (referred to as robust tag SNPs) which can still distinguish all distinct haplotypes even when some SNPs are missing. The problem of finding minimum robust tag SNPs is shown to be NP-hard. To find robust tag SNPs efficiently, we propose two greedy algorithms and one linear programming relaxation algorithm. The experimental results indicate that (1) the solutions found by these algorithms are quite close to the optimal solution; (2) the genotyping cost saved by using tag SNPs can be as high as 80%; and (3) genotyping additional tag SNPs for tolerating missing data is still cost-effective. Conclusion Genotyping robust tag SNPs is more practical than just genotyping the minimum tag SNPs if we can not avoid the occurrence of missing data. Our theoretical analysis and experimental results show that the performance of our algorithms is not only efficient but the solution found is also close to the optimal solution.
Collapse
Affiliation(s)
- Yao-Ting Huang
- Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan
| | - Kui Zhang
- Section on Statistical Genetics, Department of Biostatistics, University of Alabama at Birmingham, USA
| | - Ting Chen
- Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA
| | - Kun-Mao Chao
- Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan
- Graduate Institute of Networking and Multimedia, National Taiwan University, Taipei, Taiwan
| |
Collapse
|
18
|
Ke X, Miretti MM, Broxholme J, Hunt S, Beck S, Bentley DR, Deloukas P, Cardon LR. A comparison of tagging methods and their tagging space. Hum Mol Genet 2005; 14:2757-67. [PMID: 16103130 DOI: 10.1093/hmg/ddi309] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
Single-nucleotide polymorphism (SNP) tagging is widely used as a way of saving genotyping costs in association studies. A number of different tagging methods have been developed to reduce the number of markers to be genotyped while maintaining power for detecting effects on non-assayed SNPs. How the different methods perform in different settings, the degree to which they overlap and share common tags and how they differ are important questions. We investigated these questions by comparing three widely used tagging methods/algorithms--one haplotype r2-based method, one pair-wise r2-based method and one method which was based on haplotype diversity but focused on major haplotypes. Tagging efficiency was defined as the number of genotyped markers divided by the number of tagging SNPs. Tagging effectiveness was defined as the proportion of un-genotyped or 'hidden' SNPs being detected (having a pair-wise or haplotype r2 with a set of tagging SNPs over a threshold, e.g. haplotype r2> or =0.80). The ENCODE regions genotyped on the HapMap CEPH individuals were examined in this study. Tagging effectiveness was generally poor for rare SNPs than for common SNPs, for all three tagging methods. Inclusion of rare SNPs into initial HapMap scheme could enhance the performance of tags on rare hidden SNPs at the expense of increased genotyping cost. At a moderate tagging efficiency, more than 90% of hidden SNPs detected by tagging SNPs selected by one method were also detected by tagging SNPs selected by another method, and this figure could be increased to 100% if tagging efficiency was allowed to drop. These results indicate that the tagging space is highly concordant between different tagging methods, despite the fact that they often involve different sets of tagging SNPs.
Collapse
Affiliation(s)
- Xiayi Ke
- Wellcome Trust Centre for Human Genetics, University of Oxford, UK.
| | | | | | | | | | | | | | | |
Collapse
|
19
|
Zhao J, Boerwinkle E, Xiong M. An entropy-based statistic for genomewide association studies. Am J Hum Genet 2005; 77:27-40. [PMID: 15931594 PMCID: PMC1226192 DOI: 10.1086/431243] [Citation(s) in RCA: 42] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2004] [Accepted: 04/19/2005] [Indexed: 11/04/2022] Open
Abstract
Efficient genotyping methods and the availability of a large collection of single-nucleotide polymorphisms provide valuable tools for genetic studies of human disease. The standard chi2 statistic for case-control studies, which uses a linear function of allele frequencies, has limited power when the number of marker loci is large. We introduce a novel test statistic for genetic association studies that uses Shannon entropy and a nonlinear function of allele frequencies to amplify the differences in allele and haplotype frequencies to maintain statistical power with large numbers of marker loci. We investigate the relationship between the entropy-based test statistic and the standard chi2 statistic and show that, in most cases, the power of the entropy-based statistic is greater than that of the standard chi2 statistic. The distribution of the entropy-based statistic and the type I error rates are validated using simulation studies. Finally, we apply the new entropy-based test statistic to two real data sets, one for the COMT gene and schizophrenia and one for the MMP-2 gene and esophageal carcinoma, to evaluate the performance of the new method for genetic association studies. The results show that the entropy-based statistic obtained smaller P values than did the standard chi2 statistic.
Collapse
Affiliation(s)
- Jinying Zhao
- Human Genetic Center, University of Texas, Health Science Center at Houston, Houston, TX 77225, USA
| | | | | |
Collapse
|
20
|
Halldórsson BV, Istrail S, De La Vega FM. Optimal Selection of SNP Markers for Disease Association Studies. Hum Hered 2005; 58:190-202. [PMID: 15812176 DOI: 10.1159/000083546] [Citation(s) in RCA: 48] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023] Open
Abstract
Genetic association studies with population samples hold the promise of uncovering the susceptibility genes underlying the heritability of complex or common disease. Most association studies rely on the use of surrogate markers, single-nucleotide polymorphism (SNP) being the most suitable due to their abundance and ease of scoring. SNP marker selection is aimed to increase the chances that at least one typed SNP would be in linkage disequilibrium (LD) with the disease causative variant, while at the same time controlling the cost of the study in terms of the number of markers genotyped and samples. Empirical studies reporting block-like segments in the genome with high LD and low haplotype diversity have motivated a marker selection strategy whereby subsets of SNPs that 'tag' the common haplotypes of a region are picked for genotyping, avoiding typing redundant SNPs. Based on these initial observations, a plethora of 'tagging' algorithms for selecting minimum informative subsets of SNPs has recently appeared in the literature. These differ mostly in two major aspects: the quality or correlation measure used to define tagging and the algorithm used for the minimization of the final number of tagging SNPs. In this review we describe the available tagging algorithms utilizing a 3-step unifying framework, point out their methodological and conceptual differences, and make an assessment of their assumptions, performance, and scalability.
Collapse
|
21
|
Hu X, Schrodi SJ, Ross DA, Cargill M. Selecting tagging SNPs for association studies using power calculations from genotype data. Hum Hered 2005; 57:156-70. [PMID: 15297809 DOI: 10.1159/000079246] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2003] [Accepted: 04/13/2004] [Indexed: 11/19/2022] Open
Abstract
Recent studies have indicated that linkage disequilibrium (LD) between single nucleotide polymorphism (SNP) markers can be used to derive a reduced set of tagging SNPs (tSNPs) for genetic association studies. Previous strategies for identifying tSNPs have focused on LD measures or haplotype diversity, but the statistical power to detect disease-associated variants using tSNPs in genetic studies has not been fully characterized. We propose a new approach of selecting tSNPs based on determining the set of SNPs with the highest power to detect association. Two-locus genotype frequencies are used in the power calculations. To show utility, we applied this power method to a large number of SNPs that had been genotyped in Caucasian samples. We demonstrate that a significant reduction in genotyping efforts can be achieved although the reduction depends on genotypic relative risk, inheritance mode and the prevalence of disease in the human population. The tSNP sets identified by our method are remarkably robust to changes in the disease model when small relative risk and additive mode of inheritance are employed. We have also evaluated the ability of the method to detect unidentified SNPs. Our findings have important implications in applying tSNPs from different data sources in association studies.
Collapse
Affiliation(s)
- Xiaolan Hu
- Celera Diagnostics, Harbor Bay Pkwy, Alameda, CA 94502, USA.
| | | | | | | |
Collapse
|
22
|
Abstract
Atherosclerosis, the primary cause of coronary artery disease (CAD) and stroke, is a disorder with multiple genetic and environmental contributions. Genetic-epidemiologic studies have identified a surprisingly long list of genetic and nongenetic risk factors for CAD. However, such studies indicate that family history is the most significant independent risk factor (15, 52, 77). Many Mendelian disorders associated with atherosclerosis, such as familial hypercholesterolemia (FH), have been characterized, but they explain only a small percentage of disease susceptibility (although a substantial fraction of early CAD). Most cases of myocardial infarction (MI) and stroke result from the interactions of multiple genetic and environmental factors, none of which can cause disease by itself. Successful discovery of these genetic factors will require using complementary approaches with animal models, large-scale human genetic studies, and functional experiments. This review emphasizes the common, complex forms of CAD.
Collapse
Affiliation(s)
- Aldons J Lusis
- Department of 1Human Genetics, University of California, Los Angeles, California 90095, USA.
| | | | | |
Collapse
|
23
|
Halldórsson BV, Bafna V, Lippert R, Schwartz R, De La Vega FM, Clark AG, Istrail S. Optimal haplotype block-free selection of tagging SNPs for genome-wide association studies. Genome Res 2004; 14:1633-40. [PMID: 15289481 PMCID: PMC509273 DOI: 10.1101/gr.2570004] [Citation(s) in RCA: 100] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
It is widely hoped that the study of sequence variation in the human genome will provide a means of elucidating the genetic component of complex diseases and variable drug responses. A major stumbling block to the successful design and execution of genome-wide disease association studies using single-nucleotide polymorphisms (SNPs) and linkage disequilibrium is the enormous number of SNPs in the human genome. This results in unacceptably high costs for exhaustive genotyping and presents a challenging problem of statistical inference. Here, we present a new method for optimally selecting minimum informative subsets of SNPs, also known as "tagging" SNPs, that is efficient for genome-wide selection. We contrast this method to published methods including haplotype block tagging, that is, grouping SNPs into segments of low haplotype diversity and typing a subset of the SNPs that can discriminate all common haplotypes within the blocks. Because our method does not rely on a predefined haplotype block structure and makes use of the weaker correlations that occur across neighboring blocks, it can be effectively applied across chromosomal regions with both high and low local linkage disequilibrium. We show that the number of tagging SNPs selected is substantially smaller than previously reported using block-based approaches and that selecting tagging SNPs optimally can result in a two- to threefold savings over selecting random SNPs.
Collapse
|
24
|
Ke X, Durrant C, Morris AP, Hunt S, Bentley DR, Deloukas P, Cardon LR. Efficiency and consistency of haplotype tagging of dense SNP maps in multiple samples. Hum Mol Genet 2004; 13:2557-65. [PMID: 15367493 DOI: 10.1093/hmg/ddh294] [Citation(s) in RCA: 48] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Haplotype tagging is a means of retaining most of the information in high density marker maps, while reducing genotyping requirements. Estimates of the numbers of tagging SNPs required to cover the human genome have varied widely, ranging from 100,000 to 1,000,000. Tagging has been applied to a number of gene-based datasets but has not been evaluated in contexts reflecting those of genome-wide association studies--large chromosome regions and multiple samples drawn from the same population. We analysed 5000 common markers across a 10 Mb segment of human chromosome 20 in three samples (UK Caucasian, CEPH Caucasian, African American) to evaluate tagging efficiency and consistency. Overall, the results indicate a high degree of efficiency, yielding 3-5-fold savings in Caucasians and 2-3-fold savings in African Americans. These levels varied according to linkage disequilibrium (LD) levels, tagging thresholds and allele frequencies, but in high LD regions they did not vary markedly due to marker density. However, a strong positive relationship between marker density and tagging was observed, relating to the fact that increasing marker density yields greater sequence coverage in high LD, thus requiring more tag SNPs to cover a greater fraction of the genome. Encouragingly, whatever the density employed, a high level of robustness was observed between UK and CEPH samples, as most of the htSNPs selected in one sample were also appropriate as tags in the other.
Collapse
Affiliation(s)
- Xiayi Ke
- Wellcome Trust Centre for Human Genetics, University of Oxford, Roosevelt Drive, Oxford, OX3 7BN, UK
| | | | | | | | | | | | | |
Collapse
|
25
|
Zhang K, Qin Z, Chen T, Liu JS, Waterman MS, Sun F. HapBlock: haplotype block partitioning and tag SNP selection software using a set of dynamic programming algorithms. Bioinformatics 2004; 21:131-4. [PMID: 15333454 DOI: 10.1093/bioinformatics/bth482] [Citation(s) in RCA: 94] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
UNLABELLED Recent studies have revealed that linkage disequilibrium (LD) patterns vary across the human genome with some regions of high LD interspersed with regions of low LD. Such LD patterns make it possible to select a set of single nucleotide polymorphism (SNPs; tag SNPs) for genome-wide association studies. We have developed a suite of computer programs to analyze the block-like LD patterns and to select the corresponding tag SNPs. Compared to other programs for haplotype block partitioning and tag SNP selection, our program has several notable features. First, the dynamic programming algorithms implemented are guaranteed to find the block partition with minimum number of tag SNPs for the given criteria of blocks and tag SNPs. Second, both haplotype data and genotype data from unrelated individuals and/or from general pedigrees can be analyzed. Third, several existing measures/criteria for haplotype block partitioning and tag SNP selection have been implemented in the program. Finally, the programs provide flexibility to include specific SNPs (e.g. non-synonymous SNPs) as tag SNPs. AVAILABILITY The HapBlock program and its supplemental documents can be downloaded from the website http://www.cmb.usc.edu/~msms/HapBlock.
Collapse
Affiliation(s)
- Kui Zhang
- Section on Statistical Genetics, Department of Biostatistics, University of Alabama at Birmingham, Birmingham, AL 35294, USA
| | | | | | | | | | | |
Collapse
|
26
|
Thomas DC, Stram DO, Conti D, Molitor J, Marjoram P. Bayesian spatial modeling of haplotype associations. Hum Hered 2004; 56:32-40. [PMID: 14614236 DOI: 10.1159/000073730] [Citation(s) in RCA: 37] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2003] [Indexed: 11/19/2022] Open
Abstract
We review methods for relating the risk of disease to a collection of single nucleotide polymorphisms (SNPs) within a small region. Association studies using case-control designs with unrelated individuals could be used either to test for a direct effect of a candidate gene and characterize the responsible variant(s), or to fine map an unknown gene by exploiting the pattern of linkage disequilibrium (LD). We consider a flexible class of logistic penetrance models based on haplotypes and compare them with an alternative formulation based on unphased multilocus genotypes. The likelihood for haplotype-based models requires summation over all possible haplotype assignments consistent with the observed genotype data, and can be fitted using either Expectation-Maximization (E-M) or Markov chain Monte Carlo (MCMC) methods. Subtleties involving ascertainment correction for case-control studies are discussed. There has been great interest in methods for LD mapping based on the coalescent or ancestral recombination graphs as well as methods based on haplotype sharing, both of which we review briefly. Because of their computational complexity, we propose some alternative empirical modeling approaches using techniques borrowed from the Bayesian spatial statistics literature. Here, space is interpreted in terms of a distance metric describing the similarity of any pair of haplotypes to each other, and hence their presumed common ancestry. Specifically, we discuss the conditional autoregressive model and two spatial clustering models: Potts and Voronoi. We conclude with a discussion of the implications of these methods for modeling cryptic relatedness, haplotype blocks, and haplotype tagging SNPs, and suggest a Bayesian framework for the HapMap project.
Collapse
Affiliation(s)
- Duncan C Thomas
- University of Southern California, Los Angeles, CA 90089-9011, USA.
| | | | | | | | | |
Collapse
|
27
|
Zhang K, Qin ZS, Liu JS, Chen T, Waterman MS, Sun F. Haplotype block partitioning and tag SNP selection using genotype data and their applications to association studies. Genome Res 2004; 14:908-16. [PMID: 15078859 PMCID: PMC479119 DOI: 10.1101/gr.1837404] [Citation(s) in RCA: 125] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Recent studies have revealed that linkage disequilibrium (LD) patterns vary across the human genome with some regions of high LD interspersed by regions of low LD. A small fraction of SNPs (tag SNPs) is sufficient to capture most of the haplotype structure of the human genome. In this paper, we develop a method to partition haplotypes into blocks and to identify tag SNPs based on genotype data by combining a dynamic programming algorithm for haplotype block partitioning and tag SNP selection based on haplotype data with a variation of the expectation maximization (EM) algorithm for haplotype inference. We assess the effects of using either haplotype or genotype data in haplotype block identification and tag SNP selection as a function of several factors, including sample size, density or number of SNPs studied, allele frequencies, fraction of missing data, and genotyping error rate, using extensive simulations. We find that a modest number of haplotype or genotype samples will result in consistent block partitions and tag SNP selection. The power of association studies based on tag SNPs using genotype data is similar to that using haplotype data.
Collapse
Affiliation(s)
- Kui Zhang
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, California 90089-1113, USA
| | | | | | | | | | | |
Collapse
|
28
|
Ke X, Hunt S, Tapper W, Lawrence R, Stavrides G, Ghori J, Whittaker P, Collins A, Morris AP, Bentley D, Cardon LR, Deloukas P. The impact of SNP density on fine-scale patterns of linkage disequilibrium. Hum Mol Genet 2004; 13:577-88. [PMID: 14734624 DOI: 10.1093/hmg/ddh060] [Citation(s) in RCA: 165] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Linkage disequilibrium (LD) is a measure of the degree of association between alleles in a population. The detection of disease-causing variants by association with neighbouring single nucleotide polymorphisms (SNPs) depends on the existence of strong LD between them. Previous studies have indicated that the extent of LD is highly variable in different chromosome regions and different populations, demonstrating the importance of genome-wide accurate measurement of LD at high resolution throughout the human genome. A uniform feature of these studies has been the inability to detect LD in regions of low marker density. To investigate the dependence of LD patterns on marker selection we performed a high-resolution study in African-American, Asian and UK Caucasian populations. We selected over 5000 SNPs with an average spacing of approximately 1 SNP per 2 kb after validating ca 12 000 SNPs derived from a dense SNP collection (1 SNP per 0.3 kb on average). Applications of different statistical methods of LD assessment highlight similar areas of high and low LD. However, at high resolution, features such as overall sequence coverage in LD blocks and block boundaries vary substantially with respect to marker density. Model-based linkage disequilibrium unit (LDU) maps appear robust to marker density and consistently influenced by marker allele frequency. The results suggest that very dense marker sets will be required to yield stable views of fine-scale LD in the human genome.
Collapse
Affiliation(s)
- Xiayi Ke
- Wellcome Trust Centre for Human Genetics, University of Oxford, UK
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
29
|
Abstract
There is currently a broad effort to produce genome-wide high-density linkage disequilibrium (LD) maps with single nucleotide polymorphisms. The hope is that the resulting maps can be exploited to find genes that affect the onset and severity of at least some common human diseases. These maps may also be useful for identifying genes that affect drug response or the likelihood of drug toxicities. The goal of this review is to provide a broad overview of some of the key concerns motivating the design of a major international project called the International Haplotype Map Project. The process of map production requires the identification of very large numbers of polymorphic sites, implementation of facile, highly accurate and inexpensive genotyping production pipelines, and provision for public access to the genotype data. Great progress has been made recently in genotyping methods and these advances are allowing very large-scale data collection. A major goal of these efforts is to enable the selection of subsets of markers that capture useful genetic information in short genomic intervals, while optimally reducing the number of markers that must be genotyped. Standard measures of LD provide a starting point but may not fully capture the complexity of the information inherent in the data. Extremely dense genotype data in several broadly representative populations (European, Chinese, Japanese, and Yoruba) should yield important insights into the genetic structure of most genes. Further study is required to determine how broadly applicable the data will be to other population groups. Significant challenges lie ahead in determining the best methods for the selection of markers in disease/phenotype studies, large-scale genotyping, and analysis of the resulting genetic data.
Collapse
Affiliation(s)
- John W Belmont
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas 77030, USA
| | | |
Collapse
|
30
|
Schulze TG, Zhang K, Chen YS, Akula N, Sun F, McMahon FJ. Defining haplotype blocks and tag single-nucleotide polymorphisms in the human genome. Hum Mol Genet 2003; 13:335-42. [PMID: 14681300 DOI: 10.1093/hmg/ddh035] [Citation(s) in RCA: 30] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Recent studies suggest that the genome is organized into blocks of haplotypes, and efforts to create a genome-wide haplotype map of single-nucleotide polymorphisms (SNPs) are already underway. Haplotype blocks are defined algorithmically and to date several algorithms have been proposed. However, little is known about their relative performance in real data or about the impact of allele frequencies and parameter choices on the detection of haplotype blocks and the markers that tag them. Here we present a formal comparison of two major algorithms, a linkage disequilibrium (LD)-based method and a dynamic programming algorithm (DPA), in three chromosomal regions differing in gene content and recombination rate. The two methods produced strikingly different results. DPA identified fewer and larger haplotype blocks as well as a smaller set of tag SNPs than the LD method. For both methods, the results were strongly dependent on the allele frequency. Decreasing the minor allele frequency led to an up to 3.7-fold increase in the number of haplotype blocks and tag SNPs. Definition of haploytpe blocks and tag SNPs was also sensitive to parameter changes, but the results could not be reconciled simply by parameter adjustment. These results show that two major methods for detecting haplotype blocks and tag SNPs can produce different results in the same data and that these results are sensitive to marker allele frequencies and parameter choices. More information is needed to guide the choice of method, marker allele frequencies, and parameters in the development of a haplotype map.
Collapse
Affiliation(s)
- Thomas G Schulze
- Dicvision of Genetic Epidemiology in Psychiatry, Central Institute of Mental Health (ZI), 68159 Mannheim, Germany.
| | | | | | | | | | | |
Collapse
|
31
|
Thompson D, Stram D, Goldgar D, Witte JS. Haplotype Tagging Single Nucleotide Polymorphisms and Association Studies. Hum Hered 2003; 56:48-55. [PMID: 14614238 DOI: 10.1159/000073732] [Citation(s) in RCA: 54] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2003] [Accepted: 07/08/2003] [Indexed: 11/19/2022] Open
Abstract
OBJECTIVES Discrete blocks of low haplotype diversity exist within the human genome. The non-redundant subset of 'haplotype tagging' single nucleotide polymorphisms (htSNPs) in such blocks can distinguish a majority of the haplotypes. Several approaches have been proposed to determine htSNPs, ranging from visual inspection to formal analytic procedures. Optimal htSNPs can be estimated using a small subgroup of an association study population that have been genotyped for a dense SNP map, and it is just these htSNPs that are genotyped in the remainder of the samples. We investigated by simulation how the size of the subsample affects the power of association studies, and what type of subjects it should include. METHODS We used the program tagSNPs [Stram et al., Hum Hered 2003;55:27-36], which selects htSNPs to minimize the uncertainty in predicting common haplotypes for individuals with unphased genotype data. RESULTS On average, 27% of the SNPs were designated as htSNPs. Genotyping as few as 25 unphased individuals to select the htSNPs did not appear to reduce the power of an association study, as compared with using all SNPs. For the disease models considered, selecting htSNPs based on cases, controls, or a mixture of both gave similar results. CONCLUSIONS These results suggest that the genotyping effort in an association study can be substantially reduced with little loss of power by identifying htSNPs in a small subsample of individuals.
Collapse
Affiliation(s)
- Deborah Thompson
- Unit of Genetic Cancer Epidemiology, International Agency for Cancer Research, Lyon, France
| | | | | | | |
Collapse
|
32
|
Abbott CA. 11th Intelligent Systems for Molecular Biology 2003 (ISMB 2003). Comp Funct Genomics 2003; 4:654-9. [PMID: 18629025 PMCID: PMC2447307 DOI: 10.1002/cfg.336] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2003] [Revised: 09/25/2003] [Accepted: 09/29/2003] [Indexed: 01/22/2023] Open
Abstract
This report profiles the keynote talks given at ISMB03 in Brisbane, Australia by Ron Shamir, David Haussler, John Mattick, Yoshihide Hayashizaki, Sydney Brenner, the
Overton Prize winner, Jim Kent, and the ISCB Senior Accomplishment Awardee,
David Sankov.
Collapse
Affiliation(s)
- Catherine A. Abbott
- School of Biological Sciences, Flinders University, GPO BOX 2100, Adelaide, SA, Australia
| |
Collapse
|