1
|
Abstract
Bioinformatics regularly poses new challenges to algorithm engineers and theoretical computer scientists. This work surveys recent developments of parameterized algorithms and complexity for important NP-hard problems in bioinformatics. We cover sequence assembly and analysis, genome comparison and completion, and haplotyping and phylogenetics. Aside from reporting the state of the art, we give challenges and open problems for each topic.
Collapse
|
2
|
Feizabadi R, Bagherian M, Vaziri HR, Salahi M. A new mathematical modeling for pure parsimony haplotyping problem. Math Biosci 2016; 281:92-97. [PMID: 27633948 DOI: 10.1016/j.mbs.2016.09.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2016] [Revised: 09/04/2016] [Accepted: 09/06/2016] [Indexed: 11/16/2022]
Abstract
Pure parsimony haplotyping (PPH) problem is important in bioinformatics because rational haplotyping inference plays important roles in analysis of genetic data, mapping complex genetic diseases such as Alzheimer's disease, heart disorders and etc. Haplotypes and genotypes are m-length sequences. Although several integer programing models have already been presented for PPH problem, its NP-hardness characteristic resulted in ineffectiveness of those models facing the real instances especially instances with many heterozygous sites. In this paper, we assign a corresponding number to each haplotype and genotype and based on those numbers, we set a mixed integer programing model. Using numbers, instead of sequences, would lead to less complexity of the new model in comparison with previous models in a way that there are neither constraints nor variables corresponding to heterozygous nucleotide sites in it. Experimental results approve the efficiency of the new model in producing better solution in comparison to two state-of-the art haplotyping approaches.
Collapse
Affiliation(s)
- R Feizabadi
- Department of Applied Mathematics, Faculty of Mathematical Science, University of Guilan, Namjoo street, Rasht, Iran
| | - M Bagherian
- Department of Applied Mathematics, Faculty of Mathematical Science, University of Guilan, Namjoo street, Rasht, Iran.
| | - H R Vaziri
- Department of Biology, Faculty of Science, University of Guilan, Najoo street, Rasht, Iran
| | - M Salahi
- Department of Applied Mathematics, Faculty of Mathematical Science, University of Guilan, Namjoo street, Rasht, Iran
| |
Collapse
|
3
|
Chung CS, Lee YC, Liou JM, Wang CP, Ko JY, Lee JM, Wu MS, Wang HP. Tag single nucleotide polymorphisms of alcohol-metabolizing enzymes modify the risk of upper aerodigestive tract cancers: HapMap database analysis. Dis Esophagus 2014; 27:493-503. [PMID: 23088731 DOI: 10.1111/j.1442-2050.2012.01437.x] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
Although alcohol is associated with higher upper aerodigestive tract (UADT) cancer risk, only a small fraction of alcoholics develop cancers. There is a lack of evidence proving the association of tag single nucleotide polymorphisms of alcohol-metabolizing enzymes with cancer risk. The aim of this study was to determine the association of these genetic polymorphisms with UADT cancer risk in a Chinese population. It was a hospital-based case-control candidate gene study. The databases of the International HapMap Project were searched for haplotype tag single nucleotide polymorphisms of the genes alcohol dehydrogenase (ADH)1B, ADH1C, and aldehyde dehydrogenase (ALDH)2. The genotyping was performed by the Sequenom MassARRAY system. Totally, 120 head and neck squamous cell carcinoma, 138 esophageal squamous cell carcinoma patients, and 276 age- and gender-matched subjects were enrolled between June 2008 and June 2010.Minor alleles of ADH1B (rs1229984) and ALDH2(rs671) were not only associated with the risk of UADT cancers (odds ratio [OR] [95% confidence interval, CI]: 3.53 [2.14-5.80] and 2.59 [1.79-3.75], respectively) but also potentiated the carcinogenic effects of alcohol (OR [95% CI]: 53.44 [25.21-113.29] and 70.08 [33.65-145.95], respectively). Similar effects were observed for head/neck and esophageal cancer subgroups. Multivariate logistic regression analysis identified four significant risk factors, including habitual use of cigarettes, alcohol, betel quid, and lower body mass index (P < 0.001). The haplotypes GAGC (OR 1.61, 95% CI 1.08-2.40, P = 0.018) and CCAATG (OR 1.69, 95% CI 1.24-2.30, P < 0.001) on chromosomes 4 and 12, respectively, were associated with higher cancer risk. These findings suggested that risk allele or haplotype carriers who consume alcohol and other carcinogens should be advised to undergo endoscopy screening. The information can be used to determine the degree of susceptibility of each subject and can be combined with other environmental factors, like carcinogen consumption, in the screening analysis.
Collapse
Affiliation(s)
- C-S Chung
- Department of Internal Medicine, National Taiwan University Hospital, Taipei, Taiwan; Graduate Institute of Clinical Medicine, College of Medicine, National Taiwan University, Taipei, Taiwan; Department of Internal Medicine, Far Eastern Memorial Hospital, New Taipei City, Taiwan
| | | | | | | | | | | | | | | |
Collapse
|
4
|
Disentangling homeologous contigs in allo-tetraploid assembly: application to durum wheat. BMC Bioinformatics 2013; 14 Suppl 15:S15. [PMID: 24564644 PMCID: PMC3851826 DOI: 10.1186/1471-2105-14-s15-s15] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/02/2023] Open
Abstract
BACKGROUND Using Next Generation Sequencing, SNP discovery is relatively easy on diploid species and still hampered in polyploid species by the confusion due to homeology. We develop HomeoSplitter; a fast and effective solution to split original contigs obtained by RNAseq into two homeologous sequences. It uses the differential expression of the two homeologous genes in the RNA. We verify that the new sequences are closer to the diploid progenitors of the allopolyploid species than the original contig. By remapping original reads on these new sequences, we also verify that the number of valuable detected SNPs has significantly increased. RESULTS HomeoSplitter is a fast and effective solution to disentangle homeologous sequences based on a maximum likelihood optimization. On a benchmark set of 2,505 clusters containing homologous sequences of urartu, speltoides and durum, HomeoSplitter was efficient to build sequences closer to the diploid references and increased the number of valuable SNPs from 188 out of 1,360 SNPs detected when mapping the reads on the de novo durum assembly to 762 out of 1,620 SNPs when mapping on HomeoSplitter contigs. CONCLUSIONS The HomeoSplitter program is freely available at http://bioweb.supagro.inra.fr/homeoSplitter/. This work provides a practical solution to the complex problem of disentangling homeologous transcripts in allo-tetraploids, which further allows an improved SNP detection.
Collapse
|
5
|
Efros A, Halperin E. Haplotype reconstruction using perfect phylogeny and sequence data. BMC Bioinformatics 2012; 13 Suppl 6:S3. [PMID: 22537042 PMCID: PMC3330028 DOI: 10.1186/1471-2105-13-s6-s3] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open
Abstract
Haplotype phasing is a well studied problem in the context of genotype data. With the recent developments in high-throughput sequencing, new algorithms are needed for haplotype phasing, when the number of samples sequenced is low and when the sequencing coverage is blow. High-throughput sequencing technologies enables new possibilities for the inference of haplotypes. Since each read is originated from a single chromosome, all the variant sites it covers must derive from the same haplotype. Moreover, the sequencing process yields much higher SNP density than previous methods, resulting in a higher correlation between neighboring SNPs. We offer a new approach for haplotype phasing, which leverages on these two properties. Our suggested algorithm, called Perfect Phlogeny Haplotypes from Sequencing (PPHS) uses a perfect phylogeny model and it models the sequencing errors explicitly. We evaluated our method on real and simulated data, and we demonstrate that the algorithm outperforms previous methods when the sequencing error rate is high or when coverage is low.
Collapse
Affiliation(s)
- Anatoly Efros
- The Blavatnik School of Computer Science, Tel-Aviv University, Tel-Aviv, Israel.
| | | |
Collapse
|
6
|
Abstract
Mutation in DNA is the principal cause for differences among human beings, and Single Nucleotide Polymorphisms (SNPs) are the most common mutations. Hence, a fundamental task is to complete a map of haplotypes (which identify SNPs) in the human population. Associated with this effort, a key computational problem is the inference of haplotype data from genotype data, since in practice genotype data rather than haplotype data is usually obtained. Different haplotype inference approaches have been proposed, including the utilization of statistical methods and the utilization of the pure parsimony criterion. The problem of haplotype inference by pure parsimony (HIPP) is interesting not only because of its application to haplotype inference, but also because it is a challenging NP-hard problem, being APX-hard. Recent work has shown that a SAT-based approach is the most efficient approach for the problem of haplotype inference by pure parsimony (HIPP), being several orders of magnitude faster than existing integer linear programming and branch and bound solutions. This paper provides a detailed description of SHIPs, a SAT-based approach for the HIPP problem, and presents comprehensive experimental results comparing SHIPs with all other exact approaches for the HIPP problem. These results confirm that SHIPs is currently the most effective approach for the HIPP problem.
Collapse
Affiliation(s)
- INÊS LYNCE
- Instituto Superior Técnico/INESC-ID, Technical University of Lisbon, Rua Alves Redol, 9, 1000-029 Lisboa, Portugal
| | - JOÃO MARQUES-SILVA
- School of Electronics and Computer Science, University of Southampton, Highfield, Southampton SO17 1BJ, United Kingdom
| |
Collapse
|
7
|
Huang YT, Wu MH. Inference of chromosome-specific copy numbers using population haplotypes. BMC Bioinformatics 2011; 12:194. [PMID: 21605463 PMCID: PMC3128032 DOI: 10.1186/1471-2105-12-194] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2010] [Accepted: 05/24/2011] [Indexed: 12/28/2022] Open
Abstract
Background Using microarray and sequencing platforms, a large number of copy number variations (CNVs) have been identified in humans. In practice, because our human genome is a diploid, these platforms are limited to or more accurate for detecting total copy numbers rather than chromosome-specific copy numbers at each of the two homologous chromosomes. Nevertheless, the analysis of linkage disequilibrium (LD) between CNVs and SNPs indicates that distinct copy numbers often sit on their own background haplotypes. Results We propose new computational models for inferring chromosome-specific copy numbers by distinguishing background haplotypes of each copy number. The formulated problems are shown to be NP-hard and approximation/heuristic algorithms are developed. Simulation indicates that our method is accurate and outperforms the existing approach. By testing the program in 60 parent-offspring trios, the inferred chromosome-specific copy numbers are highly consistent with the law of Mendelian inheritance. The distributions of copy numbers at chromosomal level are provided for 270 individuals in three HapMap panels. Conclusions The estimation of chromosome-specific copy numbers using microarray or sequencing platforms was often confounded by a number of factors. This study showed that the integration of background haplotypes is able to improve the accuracies of copy number estimation at chromosome level, especially for the CNVs having strong LD with SNPs in proximity.
Collapse
Affiliation(s)
- Yao-Ting Huang
- Department of Computer Science and Information Engineering, National Chung Cheng University, Chia-Yi, Taiwan.
| | | |
Collapse
|
8
|
Wang IL, Chang CY. Mathematical properties and bounds on haplotyping populations by pure parsimony. Math Biosci 2011; 231:120-5. [PMID: 21354185 DOI: 10.1016/j.mbs.2011.02.008] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2009] [Revised: 08/20/2010] [Accepted: 02/18/2011] [Indexed: 10/18/2022]
Abstract
Although the haplotype data can be used to analyze the function of DNA, due to the significant efforts required in collecting the haplotype data, usually the genotype data is collected and then the population haplotype inference (PHI) problem is solved to infer haplotype data from genotype data for a population. This paper investigates the PHI problem based on the pure parsimony criterion (HIPP), which seeks the minimum number of distinct haplotypes to infer a given genotype data. We analyze the mathematical structure and properties for the HIPP problem, propose techniques to reduce the given genotype data into an equivalent one of much smaller size, and analyze the relations of genotype data using a compatible graph. Based on the mathematical properties in the compatible graph, we propose a maximal clique heuristic to obtain an upper bound, and a new polynomial-sized integer linear programming formulation to obtain a lower bound for the HIPP problem.
Collapse
Affiliation(s)
- I-Lin Wang
- Department of Industrial and Information Management, National Cheng Kung University, No.1 University Rd., Tainan 701, Taiwan.
| | | |
Collapse
|
9
|
Graça A, Lynce I, Marques-Silva J, Oliveira AL. Haplotype inference by Pure Parsimony: a survey. J Comput Biol 2010; 17:969-92. [PMID: 20726791 DOI: 10.1089/cmb.2009.0101] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Given a set of genotypes from a population, the process of recovering the haplotypes that explain the genotypes is called haplotype inference. The haplotype inference problem under the assumption of pure parsimony consists in finding the smallest number of haplotypes that explain a given set of genotypes. This problem is NP-hard. The original formulations for solving the Haplotype Inference by Pure Parsimony (HIPP) problem were based on integer linear programming and branch-and-bound techniques. More recently, solutions based on Boolean satisfiability, pseudo-Boolean optimization, and answer set programming have been shown to be remarkably more efficient. HIPP can now be regarded as a feasible approach for haplotype inference, which can be competitive with other different approaches. This article provides an overview of the methods for solving the HIPP problem, including preprocessing, bounding techniques, and heuristic approaches. The article also presents an empirical evaluation of exact HIPP solvers on a comprehensive set of synthetic and real problem instances. Moreover, the bounding techniques to the exact problem are evaluated. The final section compares and discusses the HIPP approach with a well-established statistical method that represents the reference algorithm for this problem.
Collapse
Affiliation(s)
- Ana Graça
- Instituto Superior Técnico (IST), Technical University of Lisbon, INESC-ID Lisboa, Lisbon, Portugal.
| | | | | | | |
Collapse
|
10
|
Climer S, Jäger G, Templeton AR, Zhang W. How frugal is Mother Nature with haplotypes? Bioinformatics 2009; 25:68-74. [PMID: 18987010 DOI: 10.1093/bioinformatics/btn572] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023] Open
Abstract
MOTIVATION Inference of haplotypes from genotype data is crucial and challenging for many vitally important studies. The first, and most critical step, is the ascertainment of a biologically sound model to be optimized. Many models that have been proposed rely partially or entirely on reducing the number of unique haplotypes in the solution. RESULTS This article examines the parsimony of haplotypes using known haplotypes as well as genotypes from the HapMap project. Our study reveals that there are relatively few unique haplotypes, but not always the least possible, for the datasets with known solutions. Furthermore, we show that there are frequently very large numbers of parsimonious solutions, and the number increases exponentially with increasing cardinality. Moreover, these solutions are quite varied, most of which are not consistent with the true solutions. These results quantify the limitations of the Pure Parsimony model and demonstrate the imperative need to consider additional properties for haplotype inference models. At a higher level, and with broad applicability, this article illustrates the power of combinatorial methods to tease out imperfections in a given biological model.
Collapse
Affiliation(s)
- Sharlee Climer
- Department of Computer Science and Engineering, Washington University, St. Louis, MO, USA
| | | | | | | |
Collapse
|
11
|
Huang YT, Chao KM. A new framework for the selection of tag SNPs by multimarker haplotypes. J Biomed Inform 2008; 41:953-61. [DOI: 10.1016/j.jbi.2008.04.003] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2007] [Revised: 03/31/2008] [Accepted: 04/04/2008] [Indexed: 10/22/2022]
|
12
|
van Iersel L, Keijsper J, Kelk S, Stougie L. Shorelines of islands of tractability: algorithms for parsimony and minimum perfect phylogeny haplotyping problems. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2008; 5:301-312. [PMID: 18451439 DOI: 10.1109/tcbb.2007.70232] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/26/2023]
Abstract
The problem Parsimony Haplotyping (PH) asks for the smallest set of haplotypes which can explain a given set of genotypes, and the problem Minimum Perfect Phylogeny Haplotyping (MPPH) asks for the smallest such set which also allows the haplotypes to be embedded in a perfect phylogeny, an evolutionary tree with biologically-motivated restrictions. For PH, we extend recent work by further mapping the interface between ;;easy'' and ;;hard'' instances, within the framework of (k,l)-bounded instances where the number of 2's per column and row of the input matrix is restricted. By exploring, in the same way, the tractability frontier of MPPH we provide the first concrete, positive results for this problem. In addition, we construct for both PH and MPPH polynomial time approximation algorithms, based on properties of the columns of the input matrix.
Collapse
Affiliation(s)
- Leo van Iersel
- Technische Universiteit Eindhoven, Eindhoven, The Netherlands.
| | | | | | | |
Collapse
|
13
|
Huang ZS, Ji YJ, Zhang DX. Haplotype reconstruction for scnp DNA: a consensus vote approach with extensive sequence data from populations of the migratory locust (Locusta migratoria). Mol Ecol 2008; 17:1930-47. [PMID: 18346127 DOI: 10.1111/j.1365-294x.2008.03730.x] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Single copy nuclear polymorphic (scnp) DNA is potentially a powerful molecular marker for evolutionary studies of populations. However, a practical obstacle to its employment is the general problem of haplotype determination due to the common occurrence of heterozygosity in diploid organisms. We explore here a 'consensus vote' (CV) approach to this question, combining statistical haplotype reconstruction and experimental verification using as an example an indel-free scnp DNA marker from the flanking region of a microsatellite locus of the migratory locust. The raw data comprise 251-bp sequences from 526 locust individuals (1052 chromosomes), with 71 (28.3%) polymorphic nucleotide sites (including seven triallelic sites) and 141 distinct genotypes (with frequencies ranging from 0.2 to 25.5%). Six representative statistical haplotype reconstruction algorithms are employed in our CV approach, including one parsimony method, two expectation-maximization (EM) methods and three Bayesian methods. The phases of 116 ambiguous individuals inferred by this approach are verified by molecular cloning experiments. We demonstrate the effectiveness of the CV approach compared to inferences based on individual statistical algorithms. First, it has the unique power to partition the inferrals into a reliable group and an uncertain group, thereby allowing the identification of the inferrals with greater uncertainty (12.7% of the total sample in this case). This considerably reduces subsequent efforts of experimental verification. Second, this approach is capable of handling genotype data pooled from many geographical populations, thus tolerating heterogeneity of genetic diversity among populations. Third, the performance of the CV approach is not influenced by the number of heterozygous sites in the ambiguous genotypes. Therefore, the CV approach is potentially a reliable strategy for effective haplotype determination of nuclear DNA markers. Our results also show that rare variations and rare inferrals tend to be more vulnerable to inference error, and hence deserve extra surveillance.
Collapse
Affiliation(s)
- Zu-Shi Huang
- State Key Laboratory of Integrated Management of Pest Insects and Rodents, Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, China
| | | | | |
Collapse
|
14
|
Abstract
Association methods based on linkage disequilibrium (LD) offer a promising approach for detecting genetic variations that are responsible for complex human diseases. Although methods based on individual single nucleotide polymorphisms (SNPs) may lead to significant findings, methods based on haplotypes comprising multiple SNPs on the same inherited chromosome may provide additional power for mapping disease genes and also provide insight on factors influencing the dependency among genetic markers. Such insights may provide information essential for understanding human evolution and also for identifying cis-interactions between two or more causal variants. Because obtaining haplotype information directly from experiments can be cost prohibitive in most studies, especially in large scale studies, haplotype analysis presents many unique challenges. In this chapter, we focus on two main issues: haplotype inference and haplotype-association analysis. We first provide a detailed review of methods for haplotype inference using unrelated individuals as well as related individuals from pedigrees. We then cover a number of statistical methods that employ haplotype information in association analysis. In addition, we discuss the advantages and limitations of different methods.
Collapse
Affiliation(s)
- Nianjun Liu
- Section on Statistical Genetics, Department of Biostatistics, University of Alabama at Birmingham, Birmingham, AL 35294, USA
| | | | | |
Collapse
|
15
|
Efficient and Tight Upper Bounds for Haplotype Inference by Pure Parsimony Using Delayed Haplotype Selection. PROGRESS IN ARTIFICIAL INTELLIGENCE 2007. [DOI: 10.1007/978-3-540-77002-2_52] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
16
|
Yoo YJ, Tang J, Kaslow RA, Zhang K. Haplotype inference for present-absent genotype data using previously identified haplotypes and haplotype patterns. ACTA ACUST UNITED AC 2007; 23:2399-406. [PMID: 17644820 DOI: 10.1093/bioinformatics/btm371] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2023]
Abstract
MOTIVATION Killer immunoglobulin-like receptor (KIR) genes vary considerably in their presence or absence on a specific regional haplotype. Because presence or absence of these genes is largely detected using locus-specific genotyping technology, the distinction between homozygosity and hemizygosity is often ambiguous. The performance of methods for haplotype inference (e.g. PL-EM, PHASE) for KIR genes may be compromised due to the large portion of ambiguous data. At the same time, many haplotypes or partial haplotype patterns have been previously identified and can be incorporated to facilitate haplotype inference for unphased genotype data. To accommodate the increased ambiguity of present-absent genotyping of KIR genes, we developed a hybrid approach combining a greedy algorithm with the Expectation-Maximization (EM) method for haplotype inference based on previously identified haplotypes and haplotype patterns. RESULTS We implemented this algorithm in a software package named HAPLO-IHP (Haplotype inference using identified haplotype patterns) and compared its performance with that of HAPLORE and PHASE on simulated KIR genotypes. We compared five measures in order to evaluate the reliability of haplotype assignments and the accuracy in estimating haplotype frequency. Our method outperformed the two existing techniques by all five measures when either 60% or 25% of previously identified haplotypes were incorporated into the analyses. AVAILABILITY The HAPLO-IHP is available at http://www.soph.uab.edu/Statgenetics/People/KZhang/HAPLO-IHP/index.html. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yun Joo Yoo
- Department of Biostatistics, University of Alabama at Birmingham, Birmingham, AL 35294, USA
| | | | | | | |
Collapse
|
17
|
Brown DG, Harrower IM. Integer programming approaches to haplotype inference by pure parsimony. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2006; 3:141-54. [PMID: 17048400 DOI: 10.1109/tcbb.2006.24] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/12/2023]
Abstract
In 2003, Gusfield introduced the Haplotype Inference by Pure Parsimony (HIPP) problem and presented an integer program (IP) that quickly solved many simulated instances of the problem. Although it solved well on small instances, Gusfield's IP can be of exponential size in the worst case. Several authors have presented polynomial-sized IPs for the problem. In this paper, we further the work on IP approaches to HIPP. We extend the existing polynomial-sized IPs by introducing several classes of valid cuts for the IP. We also present a new polynomial-sized IP formulation that is a hybrid between two existing IP formulations and inherits many of the strengths of both. Many problems that are too complex for the exponential-sized formulations can still be solved in our new formulation in a reasonable amount of time. We provide a detailed empirical comparison of these IP formulations on both simulated and real genotype sequences. Our formulation can also be extended in a variety of ways to allow errors in the input or model the structure of the population under consideration.
Collapse
Affiliation(s)
- Daniel G Brown
- David R. Cheriton School of Computer Science, University of Waterloo, 200 University Ave. West, Waterloo, ON N2L 3G1 Canada.
| | | |
Collapse
|