1
|
Cao B, Zheng Y, Shao Q, Liu Z, Xie L, Zhao Y, Wang B, Zhang Q, Wei X. Efficient data reconstruction: The bottleneck of large-scale application of DNA storage. Cell Rep 2024; 43:113699. [PMID: 38517891 DOI: 10.1016/j.celrep.2024.113699] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2023] [Revised: 11/15/2023] [Accepted: 01/05/2024] [Indexed: 03/24/2024] Open
Abstract
Over the past decade, the rapid development of DNA synthesis and sequencing technologies has enabled preliminary use of DNA molecules for digital data storage, overcoming the capacity and persistence bottlenecks of silicon-based storage media. DNA storage has now been fully accomplished in the laboratory through existing biotechnology, which again demonstrates the viability of carbon-based storage media. However, the high cost and latency of data reconstruction pose challenges that hinder the practical implementation of DNA storage beyond the laboratory. In this article, we review existing advanced DNA storage methods, analyze the characteristics and performance of biotechnological approaches at various stages of data writing and reading, and discuss potential factors influencing DNA storage from the perspective of data reconstruction.
Collapse
Affiliation(s)
- Ben Cao
- School of Computer Science and Technology, Dalian University of Technology, Lingshui Street, Dalian, Liaoning 116024, China; Centre for Frontier AI Research, Agency for Science, Technology, and Research (A(∗)STAR), 1 Fusionopolis Way, Singapore 138632, Singapore
| | - Yanfen Zheng
- School of Computer Science and Technology, Dalian University of Technology, Lingshui Street, Dalian, Liaoning 116024, China
| | - Qi Shao
- Key Laboratory of Advanced Design and Intelligent Computing, Ministry of Education, School of Software Engineering, Dalian University, Xuefu Street, Dalian, Liaoning 116622, China
| | - Zhenlu Liu
- Key Laboratory of Advanced Design and Intelligent Computing, Ministry of Education, School of Software Engineering, Dalian University, Xuefu Street, Dalian, Liaoning 116622, China
| | - Lei Xie
- Key Laboratory of Advanced Design and Intelligent Computing, Ministry of Education, School of Software Engineering, Dalian University, Xuefu Street, Dalian, Liaoning 116622, China
| | - Yunzhu Zhao
- Key Laboratory of Advanced Design and Intelligent Computing, Ministry of Education, School of Software Engineering, Dalian University, Xuefu Street, Dalian, Liaoning 116622, China
| | - Bin Wang
- Key Laboratory of Advanced Design and Intelligent Computing, Ministry of Education, School of Software Engineering, Dalian University, Xuefu Street, Dalian, Liaoning 116622, China
| | - Qiang Zhang
- School of Computer Science and Technology, Dalian University of Technology, Lingshui Street, Dalian, Liaoning 116024, China.
| | - Xiaopeng Wei
- School of Computer Science and Technology, Dalian University of Technology, Lingshui Street, Dalian, Liaoning 116024, China
| |
Collapse
|
2
|
Wang K, Cao B, Ma T, Zhao Y, Zheng Y, Wang B, Zhou S, Zhang Q. Storing Images in DNA via base128 Encoding. J Chem Inf Model 2024; 64:1719-1729. [PMID: 38385334 DOI: 10.1021/acs.jcim.3c01592] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/23/2024]
Abstract
Current DNA storage schemes lack flexibility and consistency in processing highly redundant and correlated image data, resulting in low sequence stability and image reconstruction rates. Therefore, according to the characteristics of image storage, this paper proposes storing images in DNA via base128 encoding (DNA-base128). In the data writing stage, data segmentation and probability statistics are carried out, and then, the data block frequency and constraint encoding set are associated with achieving encoding. When the image needs to be recovered, DNA-base128 completes internal error correction by threshold setting and drift comparison. Compared with representative work, the DNA-base128 encoding results show that the undesired motifs were reduced by 71.2-90.7% and that the local guanine-cytosine content variance was reduced by 3 times, indicating that DNA-base128 can store images more stably. In addition, the structural similarity index (SSIM) and multiscale structural similarity (MS-SSIM) of image reconstruction using DNA-base128 were improved by 19-102 and 6.6-20.3%, respectively. In summary, DNA-base128 provides image encoding with internal error correction and provides a potential solution for DNA image storage. The data and code are available at the GitHub repository: https://github.com/123456wk/DNA_base128.
Collapse
Affiliation(s)
- Kun Wang
- The Key Laboratory of Advanced Design and Intelligent Computing, Ministry of Education, School of Software Engineering, Dalian University, Dalian 116622, China
| | - Ben Cao
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Tao Ma
- Brain Function Research Section, China Medical University, Shenyang 110001, China
| | - Yunzhu Zhao
- The Key Laboratory of Advanced Design and Intelligent Computing, Ministry of Education, School of Software Engineering, Dalian University, Dalian 116622, China
| | - Yanfen Zheng
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Bin Wang
- The Key Laboratory of Advanced Design and Intelligent Computing, Ministry of Education, School of Software Engineering, Dalian University, Dalian 116622, China
| | - Shihua Zhou
- The Key Laboratory of Advanced Design and Intelligent Computing, Ministry of Education, School of Software Engineering, Dalian University, Dalian 116622, China
| | - Qiang Zhang
- The Key Laboratory of Advanced Design and Intelligent Computing, Ministry of Education, School of Software Engineering, Dalian University, Dalian 116622, China
| |
Collapse
|
3
|
Zheng Y, Cao B, Zhang X, Cui S, Wang B, Zhang Q. DNA-QLC: an efficient and reliable image encoding scheme for DNA storage. BMC Genomics 2024; 25:266. [PMID: 38461245 PMCID: PMC10925009 DOI: 10.1186/s12864-024-10178-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2023] [Accepted: 03/01/2024] [Indexed: 03/11/2024] Open
Abstract
BACKGROUND DNA storage has the advantages of large capacity, long-term stability, and low power consumption relative to other storage mediums, making it a promising new storage medium for multimedia information such as images. However, DNA storage has a low coding density and weak error correction ability. RESULTS To achieve more efficient DNA storage image reconstruction, we propose DNA-QLC (QRes-VAE and Levenshtein code (LC)), which uses the quantized ResNet VAE (QRes-VAE) model and LC for image compression and DNA sequence error correction, thus improving both the coding density and error correction ability. Experimental results show that the DNA-QLC encoding method can not only obtain DNA sequences that meet the combinatorial constraints, but also have a net information density that is 2.4 times higher than DNA Fountain. Furthermore, at a higher error rate (2%), DNA-QLC achieved image reconstruction with an SSIM value of 0.917. CONCLUSIONS The results indicate that the DNA-QLC encoding scheme guarantees the efficiency and reliability of the DNA storage system and improves the application potential of DNA storage for multimedia information such as images.
Collapse
Affiliation(s)
- Yanfen Zheng
- School of Computer Science and Technology, Dalian University of Technology, Lingshui Street, DalianLiaoning, 116024, China
| | - Ben Cao
- School of Computer Science and Technology, Dalian University of Technology, Lingshui Street, DalianLiaoning, 116024, China
| | - Xiaokang Zhang
- School of Computer Science and Technology, Dalian University of Technology, Lingshui Street, DalianLiaoning, 116024, China
| | - Shuang Cui
- School of Computer Science and Technology, Dalian University of Technology, Lingshui Street, DalianLiaoning, 116024, China
| | - Bin Wang
- The Key Laboratory of Advanced Design and Intelligent Computing, Ministry of Education, School of Software Engineering, Dalian University, Xuefu Street, DalianLiaoning, 116622, China
| | - Qiang Zhang
- School of Computer Science and Technology, Dalian University of Technology, Lingshui Street, DalianLiaoning, 116024, China.
| |
Collapse
|
4
|
Xie R, Zan X, Chu L, Su Y, Xu P, Liu W. Study of the error correction capability of multiple sequence alignment algorithm (MAFFT) in DNA storage. BMC Bioinformatics 2023; 24:111. [PMID: 36959531 PMCID: PMC10037887 DOI: 10.1186/s12859-023-05237-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2022] [Accepted: 03/17/2023] [Indexed: 03/25/2023] Open
Abstract
Synchronization (insertions-deletions) errors are still a major challenge for reliable information retrieval in DNA storage. Unlike traditional error correction codes (ECC) that add redundancy in the stored information, multiple sequence alignment (MSA) solves this problem by searching the conserved subsequences. In this paper, we conduct a comprehensive simulation study on the error correction capability of a typical MSA algorithm, MAFFT. Our results reveal that its capability exhibits a phase transition when there are around 20% errors. Below this critical value, increasing sequencing depth can eventually allow it to approach complete recovery. Otherwise, its performance plateaus at some poor levels. Given a reasonable sequencing depth (≤ 70), MSA could achieve complete recovery in the low error regime, and effectively correct 90% of the errors in the medium error regime. In addition, MSA is robust to imperfect clustering. It could also be combined with other means such as ECC, repeated markers, or any other code constraints. Furthermore, by selecting an appropriate sequencing depth, this strategy could achieve an optimal trade-off between cost and reading speed. MSA could be a competitive alternative for future DNA storage.
Collapse
Affiliation(s)
- Ranze Xie
- Institution of Computational Science and Technology, Guangzhou University, Guangzhou, 510006, China
| | - Xiangzhen Zan
- Institution of Computational Science and Technology, Guangzhou University, Guangzhou, 510006, China
| | - Ling Chu
- Institution of Computational Science and Technology, Guangzhou University, Guangzhou, 510006, China
| | - Yanqing Su
- Institution of Computational Science and Technology, Guangzhou University, Guangzhou, 510006, China
| | - Peng Xu
- Institution of Computational Science and Technology, Guangzhou University, Guangzhou, 510006, China.
| | - Wenbin Liu
- Institution of Computational Science and Technology, Guangzhou University, Guangzhou, 510006, China.
| |
Collapse
|
5
|
Zhang Q, Xia K, Jiang M, Li Q, Chen W, Han M, Li W, Ke R, Wang F, Zhao Y, Liu Y, Fan C, Gu H. Catalytic DNA-Assisted Mass Production of Arbitrary Single-Stranded DNA. Angew Chem Int Ed Engl 2023; 62:e202212011. [PMID: 36347780 DOI: 10.1002/anie.202212011] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2022] [Revised: 10/25/2022] [Accepted: 11/08/2022] [Indexed: 11/11/2022]
Abstract
Synthetic single-stranded (ss) DNA is a cornerstone for life and materials science, yet the purity, quantity, length, and customizability of synthetic DNA are still limiting in various applications. Here, we present PECAN, paired-end cutting assisted by DNAzymes (DNA enzymes or deoxyribozymes), which enables mass production of ssDNA of arbitrary sequence (up to 7000 nucleotides, or nt) with single-base precision. At the core of PECAN technique are two newly identified classes of DNAzymes, each robustly self-hydrolyzing with minimal sequence requirement up- or down-stream of its cleavage site. Flanking the target ssDNA with a pair of such DNAzymes generates a precursor ssDNA amplifiable by pseudogene-recombinant bacteriophage, which subsequently releases the target ssDNA in large quantities after efficient auto-processing. PECAN produces ssDNA of virtually any terminal bases and compositions with >98.5 % purity at the milligram-to-gram scale. We demonstrate the feasibility of using PECAN ssDNA for RNA in situ detection, homology-directed genome editing, and DNA-based data storage.
Collapse
Affiliation(s)
- Qiao Zhang
- Fudan University Shanghai Cancer Center, and the Shanghai Key Laboratory of Medical Epigenetics, Institutes of Biomedical Sciences, Shanghai Stomatological Hospital, Fudan University, Shanghai, 200433, China
| | - Kai Xia
- Fudan University Shanghai Cancer Center, and the Shanghai Key Laboratory of Medical Epigenetics, Institutes of Biomedical Sciences, Shanghai Stomatological Hospital, Fudan University, Shanghai, 200433, China.,Department of Chemical Biology, School of Chemistry and Chemical Engineering, Frontiers Science Center for Transformative Molecules, National Center for Translational Medicine, Shanghai Jiao Tong University, Shanghai, 201108, China.,Shanghai Frontier Innovation Research Institute, Shanghai, 201108, China
| | - Meng Jiang
- School of Medicine and School of Biomedical Science, Huaqiao University, Fujian, 362021, China
| | - Qingting Li
- Fudan University Shanghai Cancer Center, and the Shanghai Key Laboratory of Medical Epigenetics, Institutes of Biomedical Sciences, Shanghai Stomatological Hospital, Fudan University, Shanghai, 200433, China.,Department of Chemical Biology, School of Chemistry and Chemical Engineering, Frontiers Science Center for Transformative Molecules, National Center for Translational Medicine, Shanghai Jiao Tong University, Shanghai, 201108, China
| | - Weigang Chen
- Frontier Science Center for Synthetic Biology (Ministry of Education), Tianjin University, Tianjin, 300072, China
| | - Mingzhe Han
- Frontier Science Center for Synthetic Biology (Ministry of Education), Tianjin University, Tianjin, 300072, China
| | - Wei Li
- Fudan University Shanghai Cancer Center, and the Shanghai Key Laboratory of Medical Epigenetics, Institutes of Biomedical Sciences, Shanghai Stomatological Hospital, Fudan University, Shanghai, 200433, China
| | - Rongqin Ke
- School of Medicine and School of Biomedical Science, Huaqiao University, Fujian, 362021, China
| | - Fei Wang
- Department of Chemical Biology, School of Chemistry and Chemical Engineering, Frontiers Science Center for Transformative Molecules, National Center for Translational Medicine, Shanghai Jiao Tong University, Shanghai, 201108, China
| | - Yongxing Zhao
- Department of Pharmaceutics, School of Pharmaceutical Sciences, Key Laboratory of Targeting Therapy and Diagnosis for Critical Diseases, and Key Laboratory of Advanced Drug Preparation Technologies, Zhengzhou University, Henan, 450001, China
| | - Yuehua Liu
- Fudan University Shanghai Cancer Center, and the Shanghai Key Laboratory of Medical Epigenetics, Institutes of Biomedical Sciences, Shanghai Stomatological Hospital, Fudan University, Shanghai, 200433, China
| | - Chunhai Fan
- Department of Chemical Biology, School of Chemistry and Chemical Engineering, Frontiers Science Center for Transformative Molecules, National Center for Translational Medicine, Shanghai Jiao Tong University, Shanghai, 201108, China
| | - Hongzhou Gu
- Fudan University Shanghai Cancer Center, and the Shanghai Key Laboratory of Medical Epigenetics, Institutes of Biomedical Sciences, Shanghai Stomatological Hospital, Fudan University, Shanghai, 200433, China.,Department of Chemical Biology, School of Chemistry and Chemical Engineering, Frontiers Science Center for Transformative Molecules, National Center for Translational Medicine, Shanghai Jiao Tong University, Shanghai, 201108, China
| |
Collapse
|
6
|
Zan X, Yao X, Xu P, Chen Z, Xie L, Li S, Liu W. A Hierarchical Error Correction Strategy for Text DNA Storage. Interdiscip Sci 2021; 14:141-150. [PMID: 34463928 DOI: 10.1007/s12539-021-00476-x] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2021] [Revised: 08/20/2021] [Accepted: 08/22/2021] [Indexed: 12/28/2022]
Abstract
DNA storage has been a thriving interdisciplinary research area because of its high density, low maintenance cost, and long durability for information storage. However, the complexity of errors in DNA sequences including substitutions, insertions and deletions hinders its application for massive data storage. Motivated by the divide-and-conquer algorithm, we propose a hierarchical error correction strategy for text DNA storage. The basic idea is to design robust codes for common characters which have one-base error correction ability including insertion and/or deletion. The errors are gradually corrected by the codes in DNA reads, multiple alignment of character lines, and finally word spelling. On one hand, the proposed encoding method provides a systematic way to design storage friendly codes, such as 50% GC content, no more than 2-base homopolymers, and robustness against secondary structures. On the other hand, the proposed error correction method not only corrects single insertion or deletion, but also deals with multiple insertions or deletions. Simulation results demonstrate that the proposed method can correct more than 98% errors when error rate is less than or equal to 0.05. Thus, it is more powerful and adaptable to the complicated DNA storage applications.
Collapse
Affiliation(s)
- Xiangzhen Zan
- Institution of Computational Science and Technology, Guangzhou University, Guangzhou, 510006, China
| | - Xiangyu Yao
- Institution of Computational Science and Technology, Guangzhou University, Guangzhou, 510006, China
| | - Peng Xu
- Institution of Computational Science and Technology, Guangzhou University, Guangzhou, 510006, China
| | - Zhihua Chen
- Institution of Computational Science and Technology, Guangzhou University, Guangzhou, 510006, China
| | - Lian Xie
- Institution of Huangpu Research, Guangzhou University, Guangzhou, 510006, China
| | - Shudong Li
- Cyberspace Institute of Advanced Technology, Guangzhou University, Guangzhou, 510006, China
| | - Wenbin Liu
- Institution of Computational Science and Technology, Guangzhou University, Guangzhou, 510006, China.
| |
Collapse
|
7
|
Low-complexity and highly robust barcodes for error-rich single molecular sequencing. 3 Biotech 2021; 11:78. [PMID: 33505833 DOI: 10.1007/s13205-020-02607-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2020] [Accepted: 12/23/2020] [Indexed: 12/28/2022] Open
Abstract
DNA barcodes are frequently corrupted due to insertion, deletion, and substitution errors during DNA synthesis, amplification and sequencing, resulting in index hopping. In this paper, we propose a new DNA barcode construction scheme that combines a cyclic block code with a predetermined pseudo-random sequence bit by bit to form bit pairs, and then converts the bit pairs to bases, i.e., the DNA barcodes. Then, we present a barcode identification scheme for noisy sequencing reads, which uses a combination of cyclic shifting and traditional dynamic programming to mark the insertion and deletion positions, and then performs erasure-and-error-correction decoding on the corrupted codewords. Furthermore, we verify the identification error rate of barcodes for multiple errors and evaluate the reliability of the barcodes in DNA context. This method can be easily generalized for constructing long barcodes, which may be used in scenarios with serious errors. Simulation results show that the bit error rate after identifying insertions/deletions is greatly reduced using the combination of cyclic shift and dynamic programming compared to using dynamic programming only. It indicates that the proposed method can effectively improve the accuracy for estimating insertion/deletion errors. And the overall identification error rate of the proposed method is lower than 10 - 5 when the probability of each base mutation is less than 0.1, which is the typical scenario in third-generation sequencing.
Collapse
|