1
|
Aronoff-Spencer E, Mazrouee S, Graham R, Handcock MS, Nguyen K, Nebeker C, Malekinejad M, Longhurst CA. Exposure notification system activity as a leading indicator for SARS-COV-2 caseload forecasting. PLoS One 2023; 18:e0287368. [PMID: 37594936 PMCID: PMC10437830 DOI: 10.1371/journal.pone.0287368] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2022] [Accepted: 05/29/2023] [Indexed: 08/20/2023] Open
Abstract
PURPOSE Digital methods to augment traditional contact tracing approaches were developed and deployed globally during the COVID-19 pandemic. These "Exposure Notification (EN)" systems present new opportunities to support public health interventions. To date, there have been attempts to model the impact of such systems, yet no reports have explored the value of real-time system data for predictive epidemiological modeling. METHODS We investigated the potential to short-term forecast COVID-19 caseloads using data from California's implementation of the Google Apple Exposure Notification (GAEN) platform, branded as CA Notify. CA Notify is a digital public health intervention leveraging resident's smartphones for anonymous EN. We extended a published statistical model that uses prior case counts to investigate the possibility of predicting short-term future case counts and then added EN activity to test for improved forecast performance. Additional predictive value was assessed by comparing the pandemic forecasting models with and without EN activity to the actual reported caseloads from 1-7 days in the future. RESULTS Observation of time series presents noticeable evidence for temporal association of system activity and caseloads. Incorporating earlier ENs in our model improved prediction of the caseload counts. Using Bayesian inference, we found nonzero influence of EN terms with probability one. Furthermore, we found a reduction in both the mean absolute percentage error and the mean squared prediction error, the latter of at least 5% and up to 32% when using ENs over the model without. CONCLUSIONS This preliminary investigation suggests smartphone based ENs can significantly improve the accuracy of short-term forecasting. These predictive models can be readily deployed as local early warning systems to triage resources and interventions.
Collapse
Affiliation(s)
- Eliah Aronoff-Spencer
- School of Medicine, Division of Infectious Diseases and Global Public Health, University of California San Diego, La Jolla, CA, United States of America
| | - Sepideh Mazrouee
- School of Medicine, Division of Infectious Diseases and Global Public Health, University of California San Diego, La Jolla, CA, United States of America
| | - Rishi Graham
- School of Medicine, Division of Infectious Diseases and Global Public Health, University of California San Diego, La Jolla, CA, United States of America
| | - Mark S. Handcock
- University of California Los Angeles, Los Angeles, CA, United States of America
| | - Kevin Nguyen
- Herbert Wertheim School of Public Health and Longevity Sciences, University of California San Diego, La Jolla, CA, United States of America
- University of California San Diego Health, San Diego, CA, United States of America
| | - Camille Nebeker
- Herbert Wertheim School of Public Health and Longevity Sciences, University of California San Diego, La Jolla, CA, United States of America
| | - Mohsen Malekinejad
- California Department of Public Health, Sacramento, CA, United States of America
- Department of Epidemiology and Biostatistics, University of California, San Francisco, San Francisco, CA, United States of America
| | | |
Collapse
|
2
|
Mazrouee S, Hallmark CJ, Mora R, Del Vecchio N, Carrasco Hernandez R, Carr M, McNeese M, Fujimoto K, Wertheim JO. Impact of molecular sequence data completeness on HIV cluster detection and a network science approach to enhance detection. Sci Rep 2022; 12:19230. [PMID: 36357480 PMCID: PMC9648870 DOI: 10.1038/s41598-022-21924-8] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2022] [Accepted: 10/05/2022] [Indexed: 11/11/2022] Open
Abstract
Detection of viral transmission clusters using molecular epidemiology is critical to the response pillar of the Ending the HIV Epidemic initiative. Here, we studied whether inference with an incomplete dataset would influence the accuracy of the reconstructed molecular transmission network. We analyzed viral sequence data available from ~ 13,000 individuals with diagnosed HIV (2012-2019) from Houston Health Department surveillance data with 53% completeness (n = 6852 individuals with sequences). We extracted random subsamples and compared the resulting reconstructed networks versus the full-size network. Increasing simulated completeness was associated with an increase in the number of detected clusters. We also subsampled based on the network node influence in the transmission of the virus where we measured Expected Force (ExF) for each node in the network. We simulated the removal of nodes with the highest and then lowest ExF from the full dataset and discovered that 4.7% and 60% of priority clusters were detected respectively. These results highlight the non-uniform impact of capturing high influence nodes in identifying transmission clusters. Although increasing sequence reporting completeness is the way to fully detect HIV transmission patterns, reaching high completeness has remained challenging in the real world. Hence, we suggest taking a network science approach to enhance performance of molecular cluster detection, augmented by node influence information.
Collapse
Affiliation(s)
- Sepideh Mazrouee
- Department of Medicine, University of California San Diego, San Diego, CA, USA.
| | | | | | | | - Rocio Carrasco Hernandez
- Department of Medicine, University of California San Diego, San Diego, CA, USA
- Instituto Nacional de Enfermedades Respiratorias "Ismael Cosío Villegas", Mexico City, México
| | | | | | - Kayo Fujimoto
- Department of Health Promotion and Behavioral Sciences, The University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Joel O Wertheim
- Department of Medicine, University of California San Diego, San Diego, CA, USA
| |
Collapse
|
3
|
Mazrouee S. ARHap: Association Rule Haplotype Phasing. IEEE/ACM Trans Comput Biol Bioinform 2022; 19:3281-3294. [PMID: 34648456 DOI: 10.1109/tcbb.2021.3119955] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
This article proposes a novel approach for Individual Human phasing through discovery of interesting hidden relations among single variant sites. The proposed framework, called ARHap, learns strong association rules among variant loci on the genome and develops a combinatorial approach for fast and accurate haplotype phasing based on the discovered associations. ARHap is composed of two main modules or processing phases. In the first phase, called association rule learning, ARHap identifies quantitative association rules from a collection of DNA reads of the organism under study, resulting in a set of strong rules that reveal the inter-dependency of alleles. In the next phase, called haplotype reconstruction, we develop algorithms to utilize the learned rules to construct highly reliable haplotypes at individual single nucleotide polymorphism (SNP) sites. ARHap has several features that lead to both fast and accurate haplotyping. It uses an incremental haplotype reconstruction approach that enables us to generate association rules according to the unreconstructed SNP sites during each round of the algorithm. During each round, the association rule learning module generates rules while constraining the length of the rules and limiting the rules to those that contribute to reconstruction of unreconstructed sites only. The framework begins by generating rules of small size and highly strong. The rule length can increase and/or criteria about strongness of the rule are adjusted gradually, during subsequent rounds, if some SNP sites have remained unreconstructed. This adaptive approach, which uses feedback from haplotype reconstruction module, eliminates generation of rules that do not contribute to haplotype reconstruction as well as weak rules that may introduce error in the final haplotypes. Extensive experimental analyses on datasets representing diploid organisms demonstrate superiority of ARHap in diploid haplotyping compared to the state-of-the-art algorithms. In particular, we show that this novel approach to haplotype phasing not only is fast but also achieves significantly better accuracy performance compared to other read-based computational approaches.
Collapse
|
4
|
Aronoff-Spencer E, Nebeker C, Wenzel AT, Nguyen K, Kunowski R, Zhu M, Adamos G, Goyal R, Mazrouee S, Reyes A, May N, Howard H, Longhurst CA, Malekinejad M. Defining Key Performance Indicators for the California COVID-19 Exposure Notification System (CA Notify). Public Health Rep 2022; 137:67S-75S. [PMID: 36314660 PMCID: PMC9678789 DOI: 10.1177/00333549221129354] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022] Open
Abstract
OBJECTIVES Toward common methods for system monitoring and evaluation, we proposed a key performance indicator framework and discussed lessons learned while implementing a statewide exposure notification (EN) system in California during the COVID-19 epidemic. MATERIALS AND METHODS California deployed the Google Apple Exposure Notification framework, branded CA Notify, on December 10, 2020, to supplement traditional COVID-19 contact tracing programs. For system evaluation, we defined 6 key performance indicators: adoption, retention, sharing of unique codes, identification of potential contacts, behavior change, and impact. We aggregated and analyzed data from December 10, 2020, to July 1, 2021, in compliance with the CA Notify privacy policy. RESULTS We estimated CA Notify adoption at nearly 11 million smartphone activations during the study period. Among 1 654 201 CA Notify users who received a positive test result for SARS-CoV-2, 446 634 (27%) shared their unique code, leading to ENs for other CA Notify users who were in close proximity to the SARS-CoV-2-positive individual. We identified at least 122 970 CA Notify users as contacts through this process. Contact identification occurred a median of 4 days after symptom onset or specimen collection date of the user who received a positive test result for SARS-CoV-2. PRACTICE IMPLICATIONS Smartphone-based EN systems are promising new tools to supplement traditional contact tracing and public health interventions, particularly when efficient scaling is not feasible for other approaches. Methods to collect and interpret appropriate measures of system performance must be refined while maintaining trust and privacy.
Collapse
Affiliation(s)
- Eliah Aronoff-Spencer
- Division of Infectious Diseases and Global Public Health, School of Medicine, University of California San Diego, La Jolla, CA, USA
- University of California San Diego Health, La Jolla, CA, USA
- The Design Lab, University of California San Diego, La Jolla, CA, USA
| | - Camille Nebeker
- The Design Lab, University of California San Diego, La Jolla, CA, USA
- Herbert Wertheim School of Public Health and Human Longevity Science, University of California San Diego, La Jolla, CA, USA
| | - Alexander T. Wenzel
- Department of Biomedical Informatics, School of Medicine, University of California San Diego, La Jolla, CA, USA
| | - Kevin Nguyen
- University of California San Diego Health, La Jolla, CA, USA
- Herbert Wertheim School of Public Health and Human Longevity Science, University of California San Diego, La Jolla, CA, USA
| | - Rachel Kunowski
- University of California San Diego Health, La Jolla, CA, USA
| | - Mingjia Zhu
- University of California San Diego Health, La Jolla, CA, USA
| | - Gary Adamos
- University of California San Diego Health, La Jolla, CA, USA
| | - Ravi Goyal
- Division of Infectious Diseases and Global Public Health, School of Medicine, University of California San Diego, La Jolla, CA, USA
| | - Sepideh Mazrouee
- Division of Infectious Diseases and Global Public Health, School of Medicine, University of California San Diego, La Jolla, CA, USA
| | - Aaron Reyes
- University of California San Diego Health, La Jolla, CA, USA
| | - Nicole May
- University of California San Diego Health, La Jolla, CA, USA
| | - Holly Howard
- California Connected, Center for Infectious Diseases, California Department of Public Health, Richmond, CA, USA
- Institute for Global Health Sciences, University of California San Francisco, San Francisco, CA, USA
| | - Christopher A. Longhurst
- Department of Biomedical Informatics, School of Medicine, University of California San Diego, La Jolla, CA, USA
- Department of Pediatrics, School of Medicine, University of California San Diego, La Jolla, CA, USA
| | - Mohsen Malekinejad
- California Connected, Center for Infectious Diseases, California Department of Public Health, Richmond, CA, USA
- Institute for Global Health Sciences, University of California San Francisco, San Francisco, CA, USA
- Department of Epidemiology and Biostatistics, University of California San Francisco, San Francisco, CA, USA
| |
Collapse
|
5
|
Mazrouee S, Little SJ, Wertheim JO. Incorporating metadata in HIV transmission network reconstruction: A machine learning feasibility assessment. PLoS Comput Biol 2021; 17:e1009336. [PMID: 34550966 PMCID: PMC8457453 DOI: 10.1371/journal.pcbi.1009336] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2021] [Accepted: 08/09/2021] [Indexed: 12/30/2022] Open
Abstract
HIV molecular epidemiology estimates the transmission patterns from clustering genetically similar viruses. The process involves connecting genetically similar genotyped viral sequences in the network implying epidemiological transmissions. This technique relies on genotype data which is collected only from HIV diagnosed and in-care populations and leaves many persons with HIV (PWH) who have no access to consistent care out of the tracking process. We use machine learning algorithms to learn the non-linear correlation patterns between patient metadata and transmissions between HIV-positive cases. This enables us to expand the transmission network reconstruction beyond the molecular network. We employed multiple commonly used supervised classification algorithms to analyze the San Diego Primary Infection Resource Consortium (PIRC) cohort dataset, consisting of genotypes and nearly 80 additional non-genetic features. First, we trained classification models to determine genetically unrelated individuals from related ones. Our results show that random forest and decision tree achieved over 80% in accuracy, precision, recall, and F1-score by only using a subset of meta-features including age, birth sex, sexual orientation, race, transmission category, estimated date of infection, and first viral load date besides genetic data. Additionally, both algorithms achieved approximately 80% sensitivity and specificity. The Area Under Curve (AUC) is reported 97% and 94% for random forest and decision tree classifiers respectively. Next, we extended the models to identify clusters of similar viral sequences. Support vector machine demonstrated one order of magnitude improvement in accuracy of assigning the sequences to the correct cluster compared to dummy uniform random classifier. These results confirm that metadata carries important information about the dynamics of HIV transmission as embedded in transmission clusters. Hence, novel computational approaches are needed to apply the non-trivial knowledge collected from inter-individual genetic information to metadata from PWH in order to expand the estimated transmissions. We note that feature extraction alone will not be effective in identifying patterns of transmission and will result in random clustering of the data, but its utilization in conjunction with genetic data and the right algorithm can contribute to the expansion of the reconstructed network beyond individuals with genetic data.
Collapse
Affiliation(s)
- Sepideh Mazrouee
- Department of Medicine, Division of Infectious Diseases and Global Public Health, University of California San Diego, San Diego, California, United States
| | - Susan J. Little
- Department of Medicine, Division of Infectious Diseases and Global Public Health, University of California San Diego, San Diego, California, United States
| | - Joel O. Wertheim
- Department of Medicine, Division of Infectious Diseases and Global Public Health, University of California San Diego, San Diego, California, United States
| |
Collapse
|
6
|
Abstract
Phasing is an emerging area in computational biology with important applications in clinical decision making and biomedical sciences. While machine learning techniques have shown tremendous potential in many biomedical applications, their utility in phasing has not yet been fully understood. In this paper, we investigate development of clustering-based techniques for phasing in polyploidy organisms where more than two copies of each chromosome exist in the cells of the organism under study. We develop a novel framework, called PolyCluster, based on the concept of correlation clustering followed by an effective cluster merging mechanism to minimize the amount of disagreement among short reads residing in each cluster. We first introduce a graph model to quantify the amount of similarity between each pair of DNA reads. We then present a combination of linear programming, rounding, region-growing, and cluster merging to group similar reads and reconstruct haplotypes. Our extensive analysis demonstrates the effectiveness of PolyCluster in accurate and scalable phasing. In particular, we show that PolyCluster reduces switching error of H-PoP, HapColor, and HapTree by 44.4, 51.2, and 48.3 percent, respectively. Also, the running time of PolyCluster is several orders-of-magnitude less than HapTree while it achieves a running time comparable to other algorithms.
Collapse
|
7
|
Hezarjaribi N, Mazrouee S, Hemati S, Chaytor NS, Perrigue M, Ghasemzadeh H. Human-in-the-loop Learning for Personalized Diet Monitoring from Unstructured Mobile Data. ACM T INTERACT INTEL 2019. [DOI: 10.1145/3319370] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]
Abstract
Lifestyle interventions with the focus on diet are crucial in self-management and prevention of many chronic conditions, such as obesity, cardiovascular disease, diabetes, and cancer. Such interventions require a diet monitoring approach to estimate overall dietary composition and energy intake. Although wearable sensors have been used to estimate eating context (e.g., food type and eating time), accurate monitoring of dietary intake has remained a challenging problem. In particular, because monitoring dietary intake is a self-administered task that involves the end-user to record or report their nutrition intake, current diet monitoring technologies are prone to measurement errors related to challenges of human memory, estimation, and bias. New approaches based on mobile devices have been proposed to facilitate the process of dietary intake recording. These technologies require individuals to use mobile devices such as smartphones to record nutrition intake by either entering text or taking images of the food. Such approaches, however, suffer from errors due to low adherence to technology adoption and time sensitivity to the dietary intake context.
In this article, we introduce
EZNutriPal
,
1
an interactive diet monitoring system that operates on unstructured mobile data such as speech and free-text to facilitate dietary recording, real-time prompting, and personalized nutrition monitoring. EZNutriPal features a natural language processing unit that learns incrementally to add user-specific nutrition data and rules to the system. To prevent missing data that are required for dietary monitoring (e.g., calorie intake estimation), EZNutriPal devises an interactive operating mode that prompts the end-user to complete missing data in real-time. Additionally, we propose a combinatorial optimization approach to identify the most appropriate pairs of food names and food quantities in complex input sentences. We evaluate the performance of EZNutriPal using real data collected from 23 human subjects who participated in two user studies conducted in 13 days each. The results demonstrate that EZNutriPal achieves an accuracy of 89.7% in calorie intake estimation. We also assess the impacts of the incremental training and interactive prompting technologies on the accuracy of nutrient intake estimation and show that incremental training and interactive prompting improve the performance of diet monitoring by 49.6% and 29.1%, respectively, compared to a system without such computing units.
Collapse
|
8
|
Hezarjaribi N, Dutta R, Xing T, Murdoch GK, Mazrouee S, Mortazavi BJ, Ghasemzadeh H. Monitoring Lung Mechanics during Mechanical Ventilation using Machine Learning Algorithms. Annu Int Conf IEEE Eng Med Biol Soc 2019; 2018:1160-1163. [PMID: 30440597 DOI: 10.1109/embc.2018.8512483] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Evaluation of lung mechanics is the primary component for designing lung protective optimal ventilation strategies. This paper presents a machine learning approach for bedside assessment of respiratory resistance (R) and compliance (C). We develop machine learning algorithms to track flow rate and airway pressure and estimate R and C continuously and in real-time. An experimental study is conducted, by connecting a pressure control ventilator to a test lung that simulates various R and C values, to gather sensor data for validation of the devised algorithms. We develop supervised learning algorithms based on decision tree, decision table, and Support Vector Machine (SVM) techniques to predict R and C values. Our experimental results demonstrate that the proposed algorithms achieve 90.3%, 93.1%, and 63.9% accuracy in assessing respiratory R and C using decision table, decision tree, and SVM, respectively. These results along with our ability to estimate R and C with 99.4% accuracy using a linear regression model demonstrate the potential of the proposed approach for constructing a new generation of ventilation technologies that leverage novel computational models to control their underlying parameters for personalized healthcare and context-aware interventions.
Collapse
|
9
|
Hezarjaribi N, Mazrouee S, Ghasemzadeh H. Speech2Health: A Mobile Framework for Monitoring Dietary Composition From Spoken Data. IEEE J Biomed Health Inform 2018; 22:252-264. [PMID: 29300701 DOI: 10.1109/jbhi.2017.2709333] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Diet and physical activity are known as important lifestyle factors in self-management and prevention of many chronic diseases. Mobile sensors such as accelerometers have been used to measure physical activity or detect eating time. In many intervention studies, however, stringent monitoring of overall dietary composition and energy intake is needed. Currently, such a monitoring relies on self-reported data by either entering text or taking an image that represents food intake. These approaches suffer from limitations such as low adherence in technology adoption and time sensitivity to the diet intake context. In order to address these limitations, we introduce development and validation of Speech2Health, a voice-based mobile nutrition monitoring system that devises speech processing, natural language processing (NLP), and text mining techniques in a unified platform to facilitate nutrition monitoring. After converting the spoken data to text, nutrition-specific data are identified within the text using an NLP-based approach that combines standard NLP with our introduced pattern mapping technique. We then develop a tiered matching algorithm to search the food name in our nutrition database and accurately compute calorie intake values. We evaluate Speech2Health using real data collected with 30 participants. Our experimental results show that Speech2Health achieves an accuracy of 92.2% in computing calorie intake. Furthermore, our user study demonstrates that Speech2Health achieves significantly higher scores on technology adoption metrics compared to text-based and image-based nutrition monitoring. Our research demonstrates that new sensor modalities such as voice can be used either standalone or as a complementary source of information to existing modalities to improve the accuracy and acceptability of mobile health technologies for dietary composition monitoring.
Collapse
|
10
|
Lee CJ, Toven-Lindsey B, Shapiro C, Soh M, Mazrouee S, Levis-Fitzgerald M, Sanders ER. Error-Discovery Learning Boosts Student Engagement and Performance, while Reducing Student Attrition in a Bioinformatics Course. CBE Life Sci Educ 2018; 17:ar40. [PMID: 30040529 PMCID: PMC6234822 DOI: 10.1187/cbe.17-04-0061] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/05/2017] [Revised: 04/19/2018] [Accepted: 04/24/2018] [Indexed: 05/20/2023]
Abstract
We sought to test a hypothesis that systemic blind spots in active learning are a barrier both for instructors-who cannot see what every student is actually thinking on each concept in each class-and for students-who often cannot tell precisely whether their thinking is right or wrong, let alone exactly how to fix it. We tested a strategy for eliminating these blind spots by having students answer open-ended, conceptual problems using a Web-based platform, and measured the effects on student attrition, engagement, and performance. In 4 years of testing both in class and using an online platform, this approach revealed (and provided specific resolution lessons for) more than 200 distinct conceptual errors, dramatically increased average student engagement, and reduced student attrition by approximately fourfold compared with the original lecture course format (down from 48.3% to 11.4%), especially for women undergraduates (down from 73.1% to 7.4%). Median exam scores increased from 53% to 72-80%, and the bottom half of students boosted their scores to the range in which the top half had scored before the pedagogical switch. By contrast, in our control year with the same active-learning content (but without this "zero blind spots" approach), these gains were not observed.
Collapse
Affiliation(s)
- Christopher J. Lee
- Department of Chemistry and Biochemistry, University of California, Los Angeles, Los Angeles, CA 90095
- Department of Computer Science, School of Engineering and Applied Sciences, University of California, Los Angeles, Los Angeles, CA 90095
| | - Brit Toven-Lindsey
- Center for Educational Assessment, Office of Instructional Development, University of California, Los Angeles, Los Angeles, CA 90095
| | - Casey Shapiro
- Center for Educational Assessment, Office of Instructional Development, University of California, Los Angeles, Los Angeles, CA 90095
| | - Michael Soh
- Center for Educational Assessment, Office of Instructional Development, University of California, Los Angeles, Los Angeles, CA 90095
| | - Sepideh Mazrouee
- Department of Computer Science, School of Engineering and Applied Sciences, University of California, Los Angeles, Los Angeles, CA 90095
| | - Marc Levis-Fitzgerald
- Center for Educational Assessment, Office of Instructional Development, University of California, Los Angeles, Los Angeles, CA 90095
| | - Erin R. Sanders
- Center for Education Innovation and Learning Sciences, College of Letters and Science, University of California, Los Angeles, Los Angeles, CA 90095
- Department of Microbiology, Immunology and Molecular Genetics, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA 90095
| |
Collapse
|
11
|
Abstract
Motivation: Understanding exact structure of an individual’s haplotype plays a significant role in various fields of human genetics. Despite tremendous research effort in recent years, fast and accurate haplotype reconstruction remains as an active research topic, mainly owing to the computational challenges involved. Existing haplotype assembly algorithms focus primarily on improving accuracy of the assembly, making them computationally challenging for applications on large high-throughput sequence data. Therefore, there is a need to develop haplotype reconstruction algorithms that are not only accurate but also highly scalable. Results: In this article, we introduce FastHap, a fast and accurate haplotype reconstruction approach, which is up to one order of magnitude faster than the state-of-the-art haplotype inference algorithms while also delivering higher accuracy than these algorithms. FastHap leverages a new similarity metric that allows us to precisely measure distances between pairs of fragments. The distance is then used in building the fuzzy conflict graphs of fragments. Given that optimal haplotype reconstruction based on minimum error correction is known to be NP-hard, we use our fuzzy conflict graphs to develop a fast heuristic for fragment partitioning and haplotype reconstruction. Availability: An implementation of FastHap is available for sharing on request. Contact: sepideh@cs.ucla.edu
Collapse
Affiliation(s)
- Sepideh Mazrouee
- Computer Science Department, University of California Los Angeles (UCLA), 3551 Boelter Hall, Los Angeles, CA 90095-1596, USA
| | - Wei Wang
- Computer Science Department, University of California Los Angeles (UCLA), 3551 Boelter Hall, Los Angeles, CA 90095-1596, USA
| |
Collapse
|