1
|
Wang L, Ding R, He S, Wang Q, Zhou Y. A Pipeline for Constructing Reference Genomes for Large Cohort-Specific Metagenome Compression. Microorganisms 2023; 11:2560. [PMID: 37894218 PMCID: PMC10609127 DOI: 10.3390/microorganisms11102560] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2023] [Revised: 09/16/2023] [Accepted: 09/18/2023] [Indexed: 10/29/2023] Open
Abstract
Metagenomic data compression is very important as metagenomic projects are facing the challenges of larger data volumes per sample and more samples nowadays. Reference-based compression is a promising method to obtain a high compression ratio. However, existing microbial reference genome databases are not suitable to be directly used as references for compression due to their large size and redundancy, and different metagenomic cohorts often have various microbial compositions. We present a novel pipeline that generated simplified and tailored reference genomes for large metagenomic cohorts, enabling the reference-based compression of metagenomic data. We constructed customized reference genomes, ranging from 2.4 to 3.9 GB, for 29 real metagenomic datasets and evaluated their compression performance. Reference-based compression achieved an impressive compression ratio of over 20 for human whole-genome data and up to 33.8 for all samples, demonstrating a remarkable 4.5 times improvement than the standard Gzip compression. Our method provides new insights into reference-based metagenomic data compression and has a broad application potential for faster and cheaper data transfer, storage, and analysis.
Collapse
Affiliation(s)
- Linqi Wang
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Fudan University, Shanghai 200438, China; (L.W.); (Q.W.)
| | - Renpeng Ding
- MGI Tech, Shenzhen 518083, China; (R.D.); (S.H.)
| | - Shixu He
- MGI Tech, Shenzhen 518083, China; (R.D.); (S.H.)
| | - Qinyu Wang
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Fudan University, Shanghai 200438, China; (L.W.); (Q.W.)
| | - Yan Zhou
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Fudan University, Shanghai 200438, China; (L.W.); (Q.W.)
- MGI Tech, Shenzhen 518083, China; (R.D.); (S.H.)
| |
Collapse
|
2
|
Bobroske K, Larish C, Cattrell A, Bjarnadóttir MV, Huan L. The bird's-eye view: A data-driven approach to understanding patient journeys from claims data. J Am Med Inform Assoc 2021; 27:1037-1045. [PMID: 32521006 DOI: 10.1093/jamia/ocaa052] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2020] [Revised: 03/31/2020] [Accepted: 04/09/2020] [Indexed: 12/29/2022] Open
Abstract
OBJECTIVE In preference-sensitive conditions such as back pain, there can be high levels of variability in the trajectory of patient care. We sought to develop a methodology that extracts a realistic and comprehensive understanding of the patient journey using medical and pharmaceutical insurance claims data. MATERIALS AND METHODS We processed a sample of 10 000 patient episodes (comprised of 113 215 back pain-related claims) into strings of characters, where each letter corresponds to a distinct encounter with the healthcare system. We customized the Levenshtein edit distance algorithm to evaluate the level of similarity between each pair of episodes based on both their content (types of events) and ordering (sequence of events). We then used clustering to extract the main variations of the patient journey. RESULTS The algorithm resulted in 12 comprehensive and clinically distinct patterns (clusters) of patient journeys that represent the main ways patients are diagnosed and treated for back pain. We further characterized demographic and utilization metrics for each cluster and observed clear differentiation between the clusters in terms of both clinical content and patient characteristics. DISCUSSION Despite being a complex and often noisy data source, administrative claims provide a unique longitudinal overview of patient care across multiple service providers and locations. This methodology leverages claims to capture a data-driven understanding of how patients traverse the healthcare system. CONCLUSIONS When tailored to various conditions and patient settings, this methodology can provide accurate overviews of patient journeys and facilitate a shift toward high-quality practice patterns.
Collapse
Affiliation(s)
- Katherine Bobroske
- Cambridge Centre for Health and Leadership Enterprise, University of Cambridge, Cambridge, United Kingdom
| | - Christine Larish
- Research and Development, Evolent Health, Arlington, Virginia, USA
| | - Anita Cattrell
- Research and Development, Evolent Health, Arlington, Virginia, USA
| | | | - Lawrence Huan
- Cambridge Centre for Health and Leadership Enterprise, University of Cambridge, Cambridge, United Kingdom
| |
Collapse
|
3
|
Guerra A, Lotero J, Aedo JÉ, Isaza S. Tackling the Challenges of FASTQ Referential Compression. Bioinform Biol Insights 2019; 13:1177932218821373. [PMID: 30792576 PMCID: PMC6376532 DOI: 10.1177/1177932218821373] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2018] [Accepted: 11/26/2018] [Indexed: 11/16/2022] Open
Abstract
The exponential growth of genomic data has recently motivated the development of compression algorithms to tackle the storage capacity limitations in bioinformatics centers. Referential compressors could theoretically achieve a much higher compression than their non-referential counterparts; however, the latest tools have not been able to harness such potential yet. To reach such goal, an efficient encoding model to represent the differences between the input and the reference is needed. In this article, we introduce a novel approach for referential compression of FASTQ files. The core of our compression scheme consists of a referential compressor based on the combination of local alignments with binary encoding optimized for long reads. Here we present the algorithms and performance tests developed for our reads compression algorithm, named UdeACompress. Our compressor achieved the best results when compressing long reads and competitive compression ratios for shorter reads when compared to the best programs in the state of the art. As an added value, it also showed reasonable execution times and memory consumption, in comparison with similar tools.
Collapse
Affiliation(s)
- Aníbal Guerra
- Facultad de Ciencias y Tecnología (FaCyT), Universidad de Carabobo (UC), Valencia, Venezuela
- Facultad de Ingeniería, Universidad de Antioquia (UdeA), Medellín, Colombia
| | - Jaime Lotero
- Facultad de Ciencias y Tecnología (FaCyT), Universidad de Carabobo (UC), Valencia, Venezuela
| | - José Édinson Aedo
- Facultad de Ciencias y Tecnología (FaCyT), Universidad de Carabobo (UC), Valencia, Venezuela
| | - Sebastián Isaza
- Facultad de Ciencias y Tecnología (FaCyT), Universidad de Carabobo (UC), Valencia, Venezuela
| |
Collapse
|
4
|
Sarkar H, Patro R. Quark enables semi-reference-based compression of RNA-seq data. Bioinformatics 2017; 33:3380-3386. [DOI: 10.1093/bioinformatics/btx428] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2016] [Accepted: 06/29/2017] [Indexed: 12/19/2022] Open
Affiliation(s)
- Hirak Sarkar
- Department of Computer Science, Stony Brook University Stony Brook, NY, USA
| | - Rob Patro
- Department of Computer Science, Stony Brook University Stony Brook, NY, USA
| |
Collapse
|
5
|
Huang Z, Ayday E, Lin H, Aiyar RS, Molyneaux A, Xu Z, Fellay J, Steinmetz LM, Hubaux JP. A privacy-preserving solution for compressed storage and selective retrieval of genomic data. Genome Res 2016; 26:1687-1696. [PMID: 27789525 PMCID: PMC5131820 DOI: 10.1101/gr.206870.116] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2016] [Accepted: 10/20/2016] [Indexed: 01/08/2023]
Abstract
In clinical genomics, the continuous evolution of bioinformatic algorithms and sequencing platforms makes it beneficial to store patients’ complete aligned genomic data in addition to variant calls relative to a reference sequence. Due to the large size of human genome sequence data files (varying from 30 GB to 200 GB depending on coverage), two major challenges facing genomics laboratories are the costs of storage and the efficiency of the initial data processing. In addition, privacy of genomic data is becoming an increasingly serious concern, yet no standard data storage solutions exist that enable compression, encryption, and selective retrieval. Here we present a privacy-preserving solution named SECRAM (Selective retrieval on Encrypted and Compressed Reference-oriented Alignment Map) for the secure storage of compressed aligned genomic data. Our solution enables selective retrieval of encrypted data and improves the efficiency of downstream analysis (e.g., variant calling). Compared with BAM, the de facto standard for storing aligned genomic data, SECRAM uses 18% less storage. Compared with CRAM, one of the most compressed nonencrypted formats (using 34% less storage than BAM), SECRAM maintains efficient compression and downstream data processing, while allowing for unprecedented levels of security in genomic data storage. Compared with previous work, the distinguishing features of SECRAM are that (1) it is position-based instead of read-based, and (2) it allows random querying of a subregion from a BAM-like file in an encrypted form. Our method thus offers a space-saving, privacy-preserving, and effective solution for the storage of clinical genomic data.
Collapse
Affiliation(s)
- Zhicong Huang
- School of Computer and Communication Sciences, École Polytechnique Fédérale de Lausanne, CH-1015 Lausanne, Switzerland
| | - Erman Ayday
- Department of Computer Engineering, Bilkent University, Bilkent 06800 Ankara, Turkey
| | - Huang Lin
- School of Computer and Communication Sciences, École Polytechnique Fédérale de Lausanne, CH-1015 Lausanne, Switzerland
| | - Raeka S Aiyar
- Stanford Genome Technology Center, Stanford University, Palo Alto, California 94304, USA
| | | | - Zhenyu Xu
- Sophia Genetics, CH-1025 Saint-Sulpice, Switzerland
| | - Jacques Fellay
- School of Life Sciences, École Polytechnique Fédérale de Lausanne, CH-1015 Lausanne, Switzerland
| | - Lars M Steinmetz
- Stanford Genome Technology Center, Stanford University, Palo Alto, California 94304, USA.,Department of Genetics, Stanford University School of Medicine, Stanford, California 94305, USA
| | - Jean-Pierre Hubaux
- School of Computer and Communication Sciences, École Polytechnique Fédérale de Lausanne, CH-1015 Lausanne, Switzerland
| |
Collapse
|
6
|
Patro R, Kingsford C. Data-dependent bucketing improves reference-free compression of sequencing reads. Bioinformatics 2015; 31:2770-7. [PMID: 25910696 PMCID: PMC4547610 DOI: 10.1093/bioinformatics/btv248] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2014] [Revised: 04/11/2015] [Accepted: 04/20/2015] [Indexed: 01/24/2023] Open
Abstract
MOTIVATION The storage and transmission of high-throughput sequencing data consumes significant resources. As our capacity to produce such data continues to increase, this burden will only grow. One approach to reduce storage and transmission requirements is to compress this sequencing data. RESULTS We present a novel technique to boost the compression of sequencing that is based on the concept of bucketing similar reads so that they appear nearby in the file. We demonstrate that, by adopting a data-dependent bucketing scheme and employing a number of encoding ideas, we can achieve substantially better compression ratios than existing de novo sequence compression tools, including other bucketing and reordering schemes. Our method, Mince, achieves up to a 45% reduction in file sizes (28% on average) compared with existing state-of-the-art de novo compression schemes. AVAILABILITY AND IMPLEMENTATION Mince is written in C++11, is open source and has been made available under the GPLv3 license. It is available at http://www.cs.cmu.edu/∼ckingsf/software/mince. CONTACT carlk@cs.cmu.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Rob Patro
- Department of Computer Science, Stony Brook University, Stony Brook, NY 11794-4400, USA and
| | - Carl Kingsford
- Department Computational Biology, School of Computer Science, Carnegie Mellon University, 5000 Forbes Ave., Pittsburgh, PA 15213, USA
| |
Collapse
|
7
|
Zhang Y, Li L, Yang Y, Yang X, He S, Zhu Z. Light-weight reference-based compression of FASTQ data. BMC Bioinformatics 2015; 16:188. [PMID: 26051252 PMCID: PMC4459677 DOI: 10.1186/s12859-015-0628-7] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2015] [Accepted: 05/27/2015] [Indexed: 01/23/2023] Open
Abstract
Background The exponential growth of next generation sequencing (NGS) data has posed big challenges to data storage, management and archive. Data compression is one of the effective solutions, where reference-based compression strategies can typically achieve superior compression ratios compared to the ones not relying on any reference. Results This paper presents a lossless light-weight reference-based compression algorithm namely LW-FQZip to compress FASTQ data. The three components of any given input, i.e., metadata, short reads and quality score strings, are first parsed into three data streams in which the redundancy information are identified and eliminated independently. Particularly, well-designed incremental and run-length-limited encoding schemes are utilized to compress the metadata and quality score streams, respectively. To handle the short reads, LW-FQZip uses a novel light-weight mapping model to fast map them against external reference sequence(s) and produce concise alignment results for storage. The three processed data streams are then packed together with some general purpose compression algorithms like LZMA. LW-FQZip was evaluated on eight real-world NGS data sets and achieved compression ratios in the range of 0.111-0.201. This is comparable or superior to other state-of-the-art lossless NGS data compression algorithms. Conclusions LW-FQZip is a program that enables efficient lossless FASTQ data compression. It contributes to the state of art applications for NGS data storage and transmission. LW-FQZip is freely available online at: http://csse.szu.edu.cn/staff/zhuzx/LWFQZip.
Collapse
Affiliation(s)
- Yongpeng Zhang
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, 518060, China.
| | - Linsen Li
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, 518060, China.
| | - Yanli Yang
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, 518060, China.
| | - Xiao Yang
- The Broad Institute, Cambridge, MA, 02142, USA.
| | - Shan He
- School of Computer Science, University of Birmingham, Birmingham, B15 2TT, UK.
| | - Zexuan Zhu
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, 518060, China.
| |
Collapse
|
8
|
Kingsford C, Patro R. Reference-based compression of short-read sequences using path encoding. Bioinformatics 2015; 31:1920-8. [PMID: 25649622 PMCID: PMC4481695 DOI: 10.1093/bioinformatics/btv071] [Citation(s) in RCA: 46] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2014] [Accepted: 01/29/2015] [Indexed: 12/31/2022] Open
Abstract
MOTIVATION Storing, transmitting and archiving data produced by next-generation sequencing is a significant computational burden. New compression techniques tailored to short-read sequence data are needed. RESULTS We present here an approach to compression that reduces the difficulty of managing large-scale sequencing data. Our novel approach sits between pure reference-based compression and reference-free compression and combines much of the benefit of reference-based approaches with the flexibility of de novo encoding. Our method, called path encoding, draws a connection between storing paths in de Bruijn graphs and context-dependent arithmetic coding. Supporting this method is a system to compactly store sets of kmers that is of independent interest. We are able to encode RNA-seq reads using 3-11% of the space of the sequence in raw FASTA files, which is on average more than 34% smaller than competing approaches. We also show that even if the reference is very poorly matched to the reads that are being encoded, good compression can still be achieved. AVAILABILITY AND IMPLEMENTATION Source code and binaries freely available for download at http://www.cs.cmu.edu/∼ckingsf/software/pathenc/, implemented in Go and supported on Linux and Mac OS X.
Collapse
Affiliation(s)
- Carl Kingsford
- Department of Computational Biology, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA and Department of Computer Science, Stony Brook University, Stony Brook, NY 11794-4400, USA
| | - Rob Patro
- Department of Computational Biology, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA and Department of Computer Science, Stony Brook University, Stony Brook, NY 11794-4400, USA
| |
Collapse
|
9
|
Hormozdiari F, Eskin E. Memory efficient assembly of human genome. J Bioinform Comput Biol 2015; 13:1550008. [PMID: 25603998 DOI: 10.1142/s0219720015500080] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
The ability to detect the genetic variations between two individuals is an essential component for genetic studies. In these studies, obtaining the genome sequence of both individuals is the first step toward variation detection problem. The emergence of high-throughput sequencing (HTS) technology has made DNA sequencing practical, and is widely used by diagnosticians to increase their knowledge about the casual factor in genetic related diseases. As HTS advances, more data are generated every day than the amount that scientists can process. Genome assembly is one of the existing methods to tackle the variation detection problem. The de Bruijn graph formulation of the assembly problem is widely used in the field. Furthermore, it is the only method which can assemble any genome in linear time. However, it requires an enormous amount of memory in order to assemble any mammalian size genome. The high demands of sequencing more individuals and the urge to assemble them are the driving forces for a memory efficient assembler. In this work, we propose a novel method which builds the de Bruijn graph while consuming lower memory. Moreover, our proposed method can reduce the memory usage by 37% compared to the existing methods. In addition, we used a real data set (chromosome 17 of A/J strain) to illustrate the performance of our method.
Collapse
Affiliation(s)
- Farhad Hormozdiari
- Department of Computer Science, University of California Los Angeles, Los Angeles, CA 90095, USA
| | | |
Collapse
|
10
|
Zhang Y, Li L, Xiao J, Yang Y, Zhu Z. FQZip: Lossless Reference-Based Compression of Next Generation Sequencing Data in FASTQ Format. PROCEEDINGS IN ADAPTATION, LEARNING AND OPTIMIZATION 2015. [DOI: 10.1007/978-3-319-13356-0_11] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/21/2023]
|