1
|
Sun H, Zheng Y, Xie H, Ma H, Zhong C, Yan M, Liu X, Wang G. PQSDC: a parallel lossless compressor for quality scores data via sequences partition and run-length prediction mapping. Bioinformatics 2024; 40:btae323. [PMID: 38759114 PMCID: PMC11139522 DOI: 10.1093/bioinformatics/btae323] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2024] [Revised: 04/22/2024] [Accepted: 05/16/2024] [Indexed: 05/19/2024] Open
Abstract
MOTIVATION The quality scores data (QSD) account for 70% in compressed FastQ files obtained from the short and long reads sequencing technologies. Designing effective compressors for QSD that counterbalance compression ratio, time cost, and memory consumption is essential in scenarios such as large-scale genomics data sharing and long-term data backup. This study presents a novel parallel lossless QSD-dedicated compression algorithm named PQSDC, which fulfills the above requirements well. PQSDC is based on two core components: a parallel sequences-partition model designed to reduce peak memory consumption and time cost during compression and decompression processes, as well as a parallel four-level run-length prediction mapping model to enhance compression ratio. Besides, the PQSDC algorithm is also designed to be highly concurrent using multicore CPU clusters. RESULTS We evaluate PQSDC and four state-of-the-art compression algorithms on 27 real-world datasets, including 61.857 billion QSD characters and 632.908 million QSD sequences. (1) For short reads, compared to baselines, the maximum improvement of PQSDC reaches 7.06% in average compression ratio, and 8.01% in weighted average compression ratio. During compression and decompression, the maximum total time savings of PQSDC are 79.96% and 84.56%, respectively; the maximum average memory savings are 68.34% and 77.63%, respectively. (2) For long reads, the maximum improvement of PQSDC reaches 12.51% and 13.42% in average and weighted average compression ratio, respectively. The maximum total time savings during compression and decompression are 53.51% and 72.53%, respectively; the maximum average memory savings are 19.44% and 17.42%, respectively. (3) Furthermore, PQSDC ranks second in compression robustness among the tested algorithms, indicating that it is less affected by the probability distribution of the QSD collections. Overall, our work provides a promising solution for QSD parallel compression, which balances storage cost, time consumption, and memory occupation primely. AVAILABILITY AND IMPLEMENTATION The proposed PQSDC compressor can be downloaded from https://github.com/fahaihi/PQSDC.
Collapse
Affiliation(s)
- Hui Sun
- Nankai-Baidu Joint Laboratory, Parallel and Distributed Software Technology Laboratory, TMCC, SysNet, DISSec, GTIISC, College of Computer Science, Nankai University, Tianjin 300350, China
| | - Yingfeng Zheng
- Nankai-Baidu Joint Laboratory, Parallel and Distributed Software Technology Laboratory, TMCC, SysNet, DISSec, GTIISC, College of Computer Science, Nankai University, Tianjin 300350, China
| | - Haonan Xie
- Institute of Artificial Intelligence, School of Electrical Engineering, Guangxi University, Nanning 530004, China
| | - Huidong Ma
- Nankai-Baidu Joint Laboratory, Parallel and Distributed Software Technology Laboratory, TMCC, SysNet, DISSec, GTIISC, College of Computer Science, Nankai University, Tianjin 300350, China
| | - Cheng Zhong
- Key Laboratory of Parallel, Distributed and Intelligent of Guangxi Universities and Colleges, School of Computer, Electronics and Information, Guangxi University, Nanning 530004, China
| | - Meng Yan
- Nankai-Baidu Joint Laboratory, Parallel and Distributed Software Technology Laboratory, TMCC, SysNet, DISSec, GTIISC, College of Computer Science, Nankai University, Tianjin 300350, China
| | - Xiaoguang Liu
- Nankai-Baidu Joint Laboratory, Parallel and Distributed Software Technology Laboratory, TMCC, SysNet, DISSec, GTIISC, College of Computer Science, Nankai University, Tianjin 300350, China
| | - Gang Wang
- Nankai-Baidu Joint Laboratory, Parallel and Distributed Software Technology Laboratory, TMCC, SysNet, DISSec, GTIISC, College of Computer Science, Nankai University, Tianjin 300350, China
| |
Collapse
|
2
|
Ermini L, Driguez P. The Application of Long-Read Sequencing to Cancer. Cancers (Basel) 2024; 16:1275. [PMID: 38610953 PMCID: PMC11011098 DOI: 10.3390/cancers16071275] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2024] [Revised: 03/20/2024] [Accepted: 03/21/2024] [Indexed: 04/14/2024] Open
Abstract
Cancer is a multifaceted disease arising from numerous genomic aberrations that have been identified as a result of advancements in sequencing technologies. While next-generation sequencing (NGS), which uses short reads, has transformed cancer research and diagnostics, it is limited by read length. Third-generation sequencing (TGS), led by the Pacific Biosciences and Oxford Nanopore Technologies platforms, employs long-read sequences, which have marked a paradigm shift in cancer research. Cancer genomes often harbour complex events, and TGS, with its ability to span large genomic regions, has facilitated their characterisation, providing a better understanding of how complex rearrangements affect cancer initiation and progression. TGS has also characterised the entire transcriptome of various cancers, revealing cancer-associated isoforms that could serve as biomarkers or therapeutic targets. Furthermore, TGS has advanced cancer research by improving genome assemblies, detecting complex variants, and providing a more complete picture of transcriptomes and epigenomes. This review focuses on TGS and its growing role in cancer research. We investigate its advantages and limitations, providing a rigorous scientific analysis of its use in detecting previously hidden aberrations missed by NGS. This promising technology holds immense potential for both research and clinical applications, with far-reaching implications for cancer diagnosis and treatment.
Collapse
Affiliation(s)
- Luca Ermini
- NORLUX Neuro-Oncology Laboratory, Department of Cancer Research, Luxembourg Institute of Health, L-1210 Luxembourg, Luxembourg
| | - Patrick Driguez
- Bioscience Core Lab, King Abdullah University of Science and Technology, Thuwal 23955-6900, Saudi Arabia
| |
Collapse
|
3
|
Sun H, Zheng Y, Xie H, Ma H, Liu X, Wang G. PMFFRC: a large-scale genomic short reads compression optimizer via memory modeling and redundant clustering. BMC Bioinformatics 2023; 24:454. [PMID: 38036969 PMCID: PMC10691058 DOI: 10.1186/s12859-023-05566-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2023] [Accepted: 11/13/2023] [Indexed: 12/02/2023] Open
Abstract
BACKGROUND Genomic sequencing reads compressors are essential for balancing high-throughput sequencing short reads generation speed, large-scale genomic data sharing, and infrastructure storage expenditure. However, most existing short reads compressors rarely utilize big-memory systems and duplicative information between diverse sequencing files to achieve a higher compression ratio for conserving reads data storage space. RESULTS We employ compression ratio as the optimization objective and propose a large-scale genomic sequencing short reads data compression optimizer, named PMFFRC, through novelty memory modeling and redundant reads clustering technologies. By cascading PMFFRC, in 982 GB fastq format sequencing data, with 274 GB and 3.3 billion short reads, the state-of-the-art and reference-free compressors HARC, SPRING, Mstcom, and FastqCLS achieve 77.89%, 77.56%, 73.51%, and 29.36% average maximum compression ratio gains, respectively. PMFFRC saves 39.41%, 41.62%, 40.99%, and 20.19% of storage space sizes compared with the four unoptimized compressors. CONCLUSIONS PMFFRC rational usage big-memory of compression server, effectively saving the sequencing reads data storage space sizes, which relieves the basic storage facilities costs and community sharing transmitting overhead. Our work furnishes a novel solution for improving sequencing reads compression and saving storage space. The proposed PMFFRC algorithm is packaged in a same-name Linux toolkit, available un-limited at https://github.com/fahaihi/PMFFRC .
Collapse
Affiliation(s)
- Hui Sun
- Nankai-Baidu Joint Laboratory, College of Computer Science, Nankai University, Tianjin, China
| | - Yingfeng Zheng
- Nankai-Baidu Joint Laboratory, College of Computer Science, Nankai University, Tianjin, China
| | - Haonan Xie
- Institute of Artificial Intelligence, School of Electrical Engineering, Guangxi University, Nanning, China
| | - Huidong Ma
- Nankai-Baidu Joint Laboratory, College of Computer Science, Nankai University, Tianjin, China
| | - Xiaoguang Liu
- Nankai-Baidu Joint Laboratory, College of Computer Science, Nankai University, Tianjin, China.
| | - Gang Wang
- Nankai-Baidu Joint Laboratory, College of Computer Science, Nankai University, Tianjin, China.
| |
Collapse
|
4
|
Meng Q, Chandak S, Zhu Y, Weissman T. Reference-free lossless compression of nanopore sequencing reads using an approximate assembly approach. Sci Rep 2023; 13:2082. [PMID: 36747011 PMCID: PMC9902536 DOI: 10.1038/s41598-023-29267-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2022] [Accepted: 02/01/2023] [Indexed: 02/08/2023] Open
Abstract
The amount of data produced by genome sequencing experiments has been growing rapidly over the past several years, making compression important for efficient storage, transfer and analysis of the data. In recent years, nanopore sequencing technologies have seen increasing adoption since they are portable, real-time and provide long reads. However, there has been limited progress on compression of nanopore sequencing reads obtained in FASTQ files since most existing tools are either general-purpose or specialized for short read data. We present NanoSpring, a reference-free compressor for nanopore sequencing reads, relying on an approximate assembly approach. We evaluate NanoSpring on a variety of datasets including bacterial, metagenomic, plant, animal, and human whole genome data. For recently basecalled high quality nanopore datasets, NanoSpring, which focuses only on the base sequences in the FASTQ file, uses just 0.35-0.65 bits per base which is 3-6[Formula: see text] lower than general purpose compressors like gzip. NanoSpring is competitive in compression ratio and compression resource usage with the state-of-the-art tool CoLoRd while being significantly faster at decompression when using multiple threads (> 4[Formula: see text] faster decompression with 20 threads). NanoSpring is available on GitHub at https://github.com/qm2/NanoSpring .
Collapse
Affiliation(s)
- Qingxi Meng
- Department of Electrical Engineering, Stanford University, Stanford, CA, 94305, USA.
| | - Shubham Chandak
- Department of Electrical Engineering, Stanford University, Stanford, CA, 94305, USA.
| | - Yifan Zhu
- Department of Electrical Engineering, Stanford University, Stanford, CA, 94305, USA
| | - Tsachy Weissman
- Department of Electrical Engineering, Stanford University, Stanford, CA, 94305, USA
| |
Collapse
|
5
|
Chen P, Sun Z, Wang J, Liu X, Bai Y, Chen J, Liu A, Qiao F, Chen Y, Yuan C, Sha J, Zhang J, Xu LQ, Li J. Portable nanopore-sequencing technology: Trends in development and applications. Front Microbiol 2023; 14:1043967. [PMID: 36819021 PMCID: PMC9929578 DOI: 10.3389/fmicb.2023.1043967] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2022] [Accepted: 01/03/2023] [Indexed: 02/04/2023] Open
Abstract
Sequencing technology is the most commonly used technology in molecular biology research and an essential pillar for the development and applications of molecular biology. Since 1977, when the first generation of sequencing technology opened the door to interpreting the genetic code, sequencing technology has been developing for three generations. It has applications in all aspects of life and scientific research, such as disease diagnosis, drug target discovery, pathological research, species protection, and SARS-CoV-2 detection. However, the first- and second-generation sequencing technology relied on fluorescence detection systems and DNA polymerization enzyme systems, which increased the cost of sequencing technology and limited its scope of applications. The third-generation sequencing technology performs PCR-free and single-molecule sequencing, but it still depends on the fluorescence detection device. To break through these limitations, researchers have made arduous efforts to develop a new advanced portable sequencing technology represented by nanopore sequencing. Nanopore technology has the advantages of small size and convenient portability, independent of biochemical reagents, and direct reading using physical methods. This paper reviews the research and development process of nanopore sequencing technology (NST) from the laboratory to commercially viable tools; discusses the main types of nanopore sequencing technologies and their various applications in solving a wide range of real-world problems. In addition, the paper collates the analysis tools necessary for performing different processing tasks in nanopore sequencing. Finally, we highlight the challenges of NST and its future research and application directions.
Collapse
Affiliation(s)
- Pin Chen
- Key Laboratory of DGHD, MOE, School of Life Science and Technology, Southeast University, Nanjing, China
| | - Zepeng Sun
- China Mobile (Chengdu) Industrial Research Institute, Chengdu, China
| | - Jiawei Wang
- School of Computer Science and Technology, Southeast University, Nanjing, China
| | - Xinlong Liu
- China Mobile (Chengdu) Industrial Research Institute, Chengdu, China
| | - Yun Bai
- Key Laboratory of DGHD, MOE, School of Life Science and Technology, Southeast University, Nanjing, China
| | - Jiang Chen
- Key Laboratory of DGHD, MOE, School of Life Science and Technology, Southeast University, Nanjing, China
| | - Anna Liu
- Key Laboratory of DGHD, MOE, School of Life Science and Technology, Southeast University, Nanjing, China
| | - Feng Qiao
- China Mobile (Chengdu) Industrial Research Institute, Chengdu, China
| | - Yang Chen
- Key Laboratory of DGHD, MOE, School of Life Science and Technology, Southeast University, Nanjing, China
| | - Chenyan Yuan
- Clinical Laboratory, Southeast University Zhongda Hospital, Nanjing, China
| | - Jingjie Sha
- School of Mechanical Engineering, Southeast University, Nanjing, China
| | - Jinghui Zhang
- School of Computer Science and Technology, Southeast University, Nanjing, China
| | - Li-Qun Xu
- China Mobile (Chengdu) Industrial Research Institute, Chengdu, China,*Correspondence: Li-Qun Xu, ✉
| | - Jian Li
- Key Laboratory of DGHD, MOE, School of Life Science and Technology, Southeast University, Nanjing, China,Jian Li, ✉
| |
Collapse
|
6
|
Rivara-Espasandín M, Balestrazzi L, Dufort y Álvarez G, Ochoa I, Seroussi G, Smircich P, Sotelo-Silveira J, Martín Á. Nanopore quality score resolution can be reduced with little effect on downstream analysis. BIOINFORMATICS ADVANCES 2022; 2:vbac054. [PMID: 36699360 PMCID: PMC9710687 DOI: 10.1093/bioadv/vbac054] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/12/2022] [Revised: 07/13/2022] [Accepted: 08/08/2022] [Indexed: 01/28/2023]
Abstract
Motivation The use of high precision for representing quality scores in nanopore sequencing data makes these scores hard to compress and, thus, responsible for most of the information stored in losslessly compressed FASTQ files. This motivates the investigation of the effect of quality score information loss on downstream analysis from nanopore sequencing FASTQ files. Results We polished de novo assemblies for a mock microbial community and a human genome, and we called variants on a human genome. We repeated these experiments using various pipelines, under various coverage level scenarios and various quality score quantizers. In all cases, we found that the quantization of quality scores causes little difference (or even sometimes improves) on the results obtained with the original (non-quantized) data. This suggests that the precision that is currently used for nanopore quality scores may be unnecessarily high, and motivates the use of lossy compression algorithms for this kind of data. Moreover, we show that even a non-specialized compressor, such as gzip, yields large storage space savings after the quantization of quality scores. Availability and supplementary information Quantizers are freely available for download at: https://github.com/mrivarauy/QS-Quantizer.
Collapse
Affiliation(s)
- Martín Rivara-Espasandín
- Instituto de Computación, Facultad de Ingeniería, Universidad de la República, 11300 Montevideo, Uruguay
- Departamento de Genética, Facultad de Medicina, Universidad de la República, 11800 Montevideo, Uruguay
- Departamento de Genómica, Instituto de Investigaciones Biológicas Clemente Estable, 11600 Montevideo, Uruguay
| | - Lucía Balestrazzi
- Instituto de Computación, Facultad de Ingeniería, Universidad de la República, 11300 Montevideo, Uruguay
- Sección Bioinformática, Unidad de Genómica Evolutiva, Facultad de Ciencias, Universidad de la República, 11400 Montevideo, Uruguay
| | - Guillermo Dufort y Álvarez
- Instituto de Computación, Facultad de Ingeniería, Universidad de la República, 11300 Montevideo, Uruguay
| | - Idoia Ochoa
- Electrical and Electronics Department, Tecnun, University of Navarra, 20018 San Sebastián, Spain
- Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, Champaign, IL 61801, USA
| | - Gadiel Seroussi
- Instituto de Computación, Facultad de Ingeniería, Universidad de la República, 11300 Montevideo, Uruguay
- Instituto de Ingeniería Eléctrica, Facultad de Ingeniería, Universidad de la República, 11300 Montevideo, Uruguay
| | - Pablo Smircich
- Departamento de Genómica, Instituto de Investigaciones Biológicas Clemente Estable, 11600 Montevideo, Uruguay
- Laboratorio de Interacciones Moleculares, Facultad de Ciencias, Universidad de la República, 11400 Montevideo, Uruguay
| | - José Sotelo-Silveira
- Departamento de Genómica, Instituto de Investigaciones Biológicas Clemente Estable, 11600 Montevideo, Uruguay
- Departamento de Biología Celular y Molecular, Sección Biología Celular, Facultad de Ciencias, Universidad de la República, 11400 Montevideo, Uruguay
| | - Álvaro Martín
- Instituto de Computación, Facultad de Ingeniería, Universidad de la República, 11300 Montevideo, Uruguay
| |
Collapse
|