1
|
Lu Z, Guo L, Chen J, Wang R. Reference-based genome compression using the longest matched substrings with parallelization consideration. BMC Bioinformatics 2023; 24:369. [PMID: 37777730 PMCID: PMC10544193 DOI: 10.1186/s12859-023-05500-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2023] [Accepted: 09/26/2023] [Indexed: 10/02/2023] Open
Abstract
BACKGROUND A large number of researchers have devoted to accelerating the speed of genome sequencing and reducing the cost of genome sequencing for decades, and they have made great strides in both areas, making it easier for researchers to study and analyze genome data. However, how to efficiently store and transmit the vast amount of genome data generated by high-throughput sequencing technologies has become a challenge for data compression researchers. Therefore, the research of genome data compression algorithms to facilitate the efficient representation of genome data has gradually attracted the attention of these researchers. Meanwhile, considering that the current computing devices have multiple cores, how to make full use of the advantages of the computing devices and improve the efficiency of parallel processing is also an important direction for designing genome compression algorithms. RESULTS We proposed an algorithm (LMSRGC) based on reference genome sequences, which uses the suffix array (SA) and the longest common prefix (LCP) array to find the longest matched substrings (LMS) for the compression of genome data in FASTA format. The proposed algorithm utilizes the characteristics of SA and the LCP array to select all appropriate LMSs between the genome sequence to be compressed and the reference genome sequence and then utilizes LMSs to compress the target genome sequence. To speed up the operation of the algorithm, we use GPUs to parallelize the construction of SA, while using multiple threads to parallelize the creation of the LCP array and the filtering of LMSs. CONCLUSIONS Experiment results demonstrate that our algorithm is competitive with the current state-of-the-art algorithms in compression ratio and compression time.
Collapse
Affiliation(s)
- Zhiwen Lu
- School of Information, Yunnan University, KunMing, China
| | - Lu Guo
- Yunnan Physical Science and Sports Professional College, KunMing, China
| | - Jianhua Chen
- School of Information, Yunnan University, KunMing, China.
| | - Rongshu Wang
- School of Information, Yunnan University, KunMing, China
| |
Collapse
|
2
|
Oujja A, Abid MR, Boumhidi J, Bourhnane S, Mourhir A, Merchant F, Benhaddou D. High-performance computing for SARS-CoV-2 RNAs clustering: a data science‒based genomics approach. Genomics Inform 2022; 19:e49. [PMID: 35012291 PMCID: PMC8752974 DOI: 10.5808/gi.21056] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2021] [Accepted: 12/08/2021] [Indexed: 11/20/2022] Open
Abstract
Nowadays, Genomic data constitutes one of the fastest growing datasets in the world. As of 2025, it is supposed to become the fourth largest source of Big Data, and thus mandating adequate high-performance computing (HPC) platform for processing. With the latest unprecedented and unpredictable mutations in severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the research community is in crucial need for ICT tools to process SARS-CoV-2 RNA data, e.g., by classifying it (i.e., clustering) and thus assisting in tracking virus mutations and predict future ones. In this paper, we are presenting an HPC-based SARS-CoV-2 RNAs clustering tool. We are adopting a data science approach, from data collection, through analysis, to visualization. In the analysis step, we present how our clustering approach leverages on HPC and the longest common subsequence (LCS) algorithm. The approach uses the Hadoop MapReduce programming paradigm and adapts the LCS algorithm in order to efficiently compute the length of the LCS for each pair of SARS-CoV-2 RNA sequences. The latter are extracted from the U.S. National Center for Biotechnology Information (NCBI) Virus repository. The computed LCS lengths are used to measure the dissimilarities between RNA sequences in order to work out existing clusters. In addition to that, we present a comparative study of the LCS algorithm performance based on variable workloads and different numbers of Hadoop worker nodes.
Collapse
Affiliation(s)
- Anas Oujja
- School of Science and Engineering, Al Akhawayn University in Ifrane, Ifrane 53000, Morocco.,Computer Science, Signals, Automation and Cognitivism Laboratory (LISAC), Computer Science Department, Faculty of Science Dhar El Mahraz, Sidi Mohamed Ben Abdellah University, Fez 30000, Morocco
| | - Mohamed Riduan Abid
- School of Science and Engineering, Al Akhawayn University in Ifrane, Ifrane 53000, Morocco
| | - Jaouad Boumhidi
- Computer Science, Signals, Automation and Cognitivism Laboratory (LISAC), Computer Science Department, Faculty of Science Dhar El Mahraz, Sidi Mohamed Ben Abdellah University, Fez 30000, Morocco
| | - Safae Bourhnane
- School of Science and Engineering, Al Akhawayn University in Ifrane, Ifrane 53000, Morocco.,Faculty of Sciences, Chouaib Doukkali University, El Jadida 24000, Morocco
| | - Asmaa Mourhir
- School of Science and Engineering, Al Akhawayn University in Ifrane, Ifrane 53000, Morocco
| | - Fatima Merchant
- Computer Engineering Technology Faculty, University of Houston, Houston, TX 77204, USA
| | - Driss Benhaddou
- Computer Engineering Technology Faculty, University of Houston, Houston, TX 77204, USA
| |
Collapse
|
3
|
|
4
|
Kredens KV, Martins JV, Dordal OB, Ferrandin M, Herai RH, Scalabrin EE, Ávila BC. Vertical lossless genomic data compression tools for assembled genomes: A systematic literature review. PLoS One 2020; 15:e0232942. [PMID: 32453750 PMCID: PMC7250429 DOI: 10.1371/journal.pone.0232942] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2019] [Accepted: 04/25/2020] [Indexed: 11/19/2022] Open
Abstract
The recent decrease in cost and time to sequence and assemble of complete genomes created an increased demand for data storage. As a consequence, several strategies for assembled biological data compression were created. Vertical compression tools implement strategies that take advantage of the high level of similarity between multiple assembled genomic sequences for better compression results. However, current reviews on vertical compression do not compare the execution flow of each tool, which is constituted by phases of preprocessing, transformation, and data encoding. We performed a systematic literature review to identify and compare existing tools for vertical compression of assembled genomic sequences. The review was centered on PubMed and Scopus, in which 45726 distinct papers were considered. Next, 32 papers were selected according to the following criteria: to present a lossless vertical compression tool; to use the information contained in other sequences for the compression; to be able to manipulate genomic sequences in FASTA format; and no need prior knowledge. Although we extracted performance compression results, they were not compared as the tools did not use a standardized evaluation protocol. Thus, we conclude that there's a lack of definition of an evaluation protocol that must be applied by each tool.
Collapse
Affiliation(s)
- Kelvin V. Kredens
- Graduate Program in Informatics (PPGia), Pontifícia Universidade Católica do Paraná, Curitiba, Paraná, Brazil
| | - Juliano V. Martins
- Graduate Program in Informatics (PPGia), Pontifícia Universidade Católica do Paraná, Curitiba, Paraná, Brazil
| | - Osmar B. Dordal
- Polytechnic School, Centro Universitário UniDomBosco, Curitiba, Paraná, Brazil
| | - Mauri Ferrandin
- Department of Control, Automation and Computing Engineering, Universidade Federal de Santa Catarina (UFSC), Blumenau, Brazil
| | - Roberto H. Herai
- Graduate Program in Health Sciences, School of Medicine, Pontifícia Universidade Católica do Paraná (PUCPR), Curitiba, Paraná, Brazil
| | - Edson E. Scalabrin
- Graduate Program in Informatics (PPGia), Pontifícia Universidade Católica do Paraná, Curitiba, Paraná, Brazil
| | - Bráulio C. Ávila
- Graduate Program in Informatics (PPGia), Pontifícia Universidade Católica do Paraná, Curitiba, Paraná, Brazil
| |
Collapse
|
5
|
De Mattéo L, Holtz Y, Ranwez V, Bérard S. Efficient algorithms for Longest Common Subsequence of two bucket orders to speed up pairwise genetic map comparison. PLoS One 2018; 13:e0208838. [PMID: 30589848 PMCID: PMC6320017 DOI: 10.1371/journal.pone.0208838] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2018] [Accepted: 11/25/2018] [Indexed: 11/30/2022] Open
Abstract
Genetic maps order genetic markers along chromosomes. They are, for instance, extensively used in marker-assisted selection to accelerate breeding programs. Even for the same species, people often have to deal with several alternative maps obtained using different ordering methods or different datasets, e.g. resulting from different segregating populations. Having efficient tools to identify the consistency and discrepancy of alternative maps is thus essential to facilitate genetic map comparisons. We propose to encode genetic maps by bucket order, a kind of order, which takes into account the blurred parts of the marker order while being an efficient data structure to achieve low complexity algorithms. The main result of this paper is an O(n log(n)) procedure to identify the largest agreements between two bucket orders of n elements, their Longest Common Subsequence (LCS), providing an efficient solution to highlight discrepancies between two genetic maps. The LCS of two maps, being the largest set of their collinear markers, is used as a building block to compute pairwise map congruence, to visually emphasize maker collinearity and in some scaffolding methods relying on genetic maps to improve genome assembly. As the LCS computation is a key subroutine of all these genetic map related tools, replacing the current LCS subroutine of those methods by ours -to do the exact same work but faster- could significantly speed up those methods without changing their accuracy. To ease such transition we provide all required algorithmic details in this self contained paper as well as an R package implementing them, named LCSLCIS, which is freely available at: https://github.com/holtzy/LCSLCIS.
Collapse
Affiliation(s)
- Lisa De Mattéo
- ISEM, Université de Montpellier, CNRS, IRD, EPHE, Montpellier, France
| | - Yan Holtz
- Queensland Brain Institute, University of Queensland, Brisbane, Australia
| | - Vincent Ranwez
- AGAP, Univ Montpellier, CIRAD, INRA, Montpellier SupAgro, Montpellier, France
| | - Sèverine Bérard
- ISEM, Université de Montpellier, CNRS, IRD, EPHE, Montpellier, France
| |
Collapse
|
6
|
Lin J, Adjeroh DA, Jiang BH, Jiang Y. K2 and K2*: efficient alignment-free sequence similarity measurement based on Kendall statistics. Bioinformatics 2018; 34:1682-1689. [PMID: 29253072 PMCID: PMC6355110 DOI: 10.1093/bioinformatics/btx809] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2017] [Revised: 12/11/2017] [Accepted: 12/14/2017] [Indexed: 11/13/2022] Open
Abstract
Motivation Alignment-free sequence comparison methods can compute the pairwise similarity between a huge number of sequences much faster than sequence-alignment based methods. Results We propose a new non-parametric alignment-free sequence comparison method, called K2, based on the Kendall statistics. Comparing to the other state-of-the-art alignment-free comparison methods, K2 demonstrates competitive performance in generating the phylogenetic tree, in evaluating functionally related regulatory sequences, and in computing the edit distance (similarity/dissimilarity) between sequences. Furthermore, the K2 approach is much faster than the other methods. An improved method, K2*, is also proposed, which is able to determine the appropriate algorithmic parameter (length) automatically, without first considering different values. Comparative analysis with the state-of-the-art alignment-free sequence similarity methods demonstrates the superiority of the proposed approaches, especially with increasing sequence length, or increasing dataset sizes. Availability and implementation The K2 and K2* approaches are implemented in the R language as a package and is freely available for open access (http://community.wvu.edu/daadjeroh/projects/K2/K2_1.0.tar.gz). Contact yueljiang@163.com. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jie Lin
- Department of Software engineering, College of Mathematics and
Informatics, Fujian Normal University, Fuzhou, China
| | - Donald A Adjeroh
- Department of Computer Science & Electrical Engineering, West
Virginia University, Morgantown, WV, USA
| | - Bing-Hua Jiang
- Department of Pathology, Carver College of Medicine, The University of
Iowa, Iowa City, IA, USA
| | - Yue Jiang
- Department of Software engineering, College of Mathematics and
Informatics, Fujian Normal University, Fuzhou, China
| |
Collapse
|
7
|
Beal R, Afrin T, Farheen A, Adjeroh D. A new algorithm for "the LCS problem" with application in compressing genome resequencing data. BMC Genomics 2016; 17 Suppl 4:544. [PMID: 27556803 PMCID: PMC5001248 DOI: 10.1186/s12864-016-2793-0] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022] Open
Abstract
Background The longest common subsequence (LCS) problem is a classical problem in computer science, and forms the basis of the current best-performing reference-based compression schemes for genome resequencing data. Methods First, we present a new algorithm for the LCS problem. Using the generalized suffix tree, we identify the common substrings shared between the two input sequences. Using the maximal common substrings, we construct a directed acyclic graph (DAG), based on which we determine the LCS as the longest path in the DAG. Then, we introduce an LCS-motivated reference-based compression scheme using the components of the LCS, rather than the LCS itself. Results Our basic scheme compressed the Homo sapiens genome (with an original size of 3,080,436,051 bytes) to 15,460,478 bytes. An improvement on the basic method further reduced this to 8,556,708 bytes, or an overall compression ratio of 360. This can be compared to the previous state-of-the-art compression ratios of 157 (Wang and Zhang, 2011) and 171 (Pinho, Pratas, and Garcia, 2011). Conclusion We propose a new algorithm to address the longest common subsequence problem. Motivated by our LCS algorithm, we introduce a new reference-based compression scheme for genome resequencing data. Comparative results against state-of-the-art reference-based compression algorithms demonstrate the performance of the proposed method.
Collapse
Affiliation(s)
- Richard Beal
- Lane Department of Computer Science and Electrical Engineering, West Virginia University, Morgantown, WV, USA.
| | - Tazin Afrin
- Lane Department of Computer Science and Electrical Engineering, West Virginia University, Morgantown, WV, USA
| | - Aliya Farheen
- Lane Department of Computer Science and Electrical Engineering, West Virginia University, Morgantown, WV, USA
| | - Donald Adjeroh
- Lane Department of Computer Science and Electrical Engineering, West Virginia University, Morgantown, WV, USA
| |
Collapse
|