1
|
Yan Z, Zhang H, Lu B, Han T, Tong X, Yuan Y. DNA palette code for time-series archival data storage. Natl Sci Rev 2025; 12:nwae321. [PMID: 39758123 PMCID: PMC11697981 DOI: 10.1093/nsr/nwae321] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2024] [Revised: 08/21/2024] [Accepted: 08/28/2024] [Indexed: 01/07/2025] Open
Abstract
The long-term preservation of large volumes of infrequently accessed cold data poses challenges to the storage community. Deoxyribonucleic acid (DNA) is considered a promising solution due to its inherent physical stability and significant storage density. The information density and decoding sequence coverage are two important metrics that influence the efficiency of DNA data storage. In this study, we propose a novel coding scheme called the DNA palette code, which is suitable for cold data, especially time-series archival datasets. These datasets are not frequently accessed, but require reliable long-term storage for retrospective research. The DNA palette code employs unordered combinations of index-free oligonucleotides to represent binary information. It can achieve high net information density encoding and lossless decoding with low sequencing coverage. When sequencing reads are corrupted, it can still effectively recover partial information, preventing the complete failure of file retrieval. The in vitro testing of clinical brain magnetic resonance imaging (MRI) data storage, as well as simulation validations using large-scale public MRI datasets (10 GB), planetary science datasets and meteorological datasets, demonstrates the advantages of our coding scheme, including high net information density, low decoding sequence coverage and wide applicability.
Collapse
Affiliation(s)
- Zihui Yan
- Frontiers Science Center for Synthetic Biology and Key Laboratory of Systems Bioengineering (Ministry of Education), School of Chemical Engineering and Technology, Tianjin University, Tianjin 300072, China
- Frontiers Research Institute for Synthetic Biology, Tianjin University, Tianjin 300072, China
| | - Haoran Zhang
- Frontiers Science Center for Synthetic Biology and Key Laboratory of Systems Bioengineering (Ministry of Education), School of Chemical Engineering and Technology, Tianjin University, Tianjin 300072, China
- Frontiers Research Institute for Synthetic Biology, Tianjin University, Tianjin 300072, China
| | - Boyuan Lu
- Frontiers Science Center for Synthetic Biology and Key Laboratory of Systems Bioengineering (Ministry of Education), School of Chemical Engineering and Technology, Tianjin University, Tianjin 300072, China
- Frontiers Research Institute for Synthetic Biology, Tianjin University, Tianjin 300072, China
| | - Tong Han
- Department of Neurosurgery, Huanhu Hospital, Tianjin 300350, China
| | - Xiaoguang Tong
- Department of Neurosurgery, Huanhu Hospital, Tianjin 300350, China
| | - Yingjin Yuan
- Frontiers Science Center for Synthetic Biology and Key Laboratory of Systems Bioengineering (Ministry of Education), School of Chemical Engineering and Technology, Tianjin University, Tianjin 300072, China
- Frontiers Research Institute for Synthetic Biology, Tianjin University, Tianjin 300072, China
| |
Collapse
|
2
|
Schwarz PM, Freisleben B. Optimizing fountain codes for DNA data storage. Comput Struct Biotechnol J 2024; 23:3878-3896. [PMID: 39559773 PMCID: PMC11570749 DOI: 10.1016/j.csbj.2024.10.038] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2024] [Revised: 10/22/2024] [Accepted: 10/22/2024] [Indexed: 11/20/2024] Open
Abstract
Fountain codes, originally developed for reliable multicasting in communication networks, are effectively applied in various data transmission and storage systems. Their recent use in DNA data storage systems has unique challenges, since the DNA storage channel deviates from the traditional Gaussian white noise erasure model considered in communication networks and has several restrictions as well as special properties. Thus, optimizing fountain codes to address these challenges promises to improve their overall usability in DNA data storage systems. In this article, we present several methods for optimizing fountain codes for DNA data storage. Apart from generally applicable optimizations for fountain codes, we propose optimization algorithms to create tailored distribution functions of fountain codes, which is novel in the context of DNA data storage. We evaluate the proposed methods in terms of various metrics related to the DNA storage channel. Our evaluation shows that optimizing fountain codes for DNA data storage can significantly enhance the reliability and capacity of DNA data storage systems. The developed methods represent a step forward in harnessing the full potential of fountain codes for DNA-based data storage applications. The new coding schemes and all developed methods are available under a free and open-source software license.
Collapse
Affiliation(s)
- Peter Michael Schwarz
- Department of Mathematics and Computer Science, University of Marburg, Hans-Meerwein-Str. 6, D-35043, Marburg, Germany
| | - Bernd Freisleben
- Department of Mathematics and Computer Science, University of Marburg, Hans-Meerwein-Str. 6, D-35043, Marburg, Germany
| |
Collapse
|
3
|
Schwarz PM, Freisleben B. Data recovery methods for DNA storage based on fountain codes. Comput Struct Biotechnol J 2024; 23:1808-1823. [PMID: 38707543 PMCID: PMC11066528 DOI: 10.1016/j.csbj.2024.04.048] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2024] [Revised: 04/18/2024] [Accepted: 04/18/2024] [Indexed: 05/07/2024] Open
Abstract
Today's digital data storage systems typically offer advanced data recovery solutions to address the problem of catastrophic data loss, such as software-based disk sector analysis or physical-level data retrieval methods for conventional hard disk drives. However, DNA-based data storage currently relies solely on the inherent error correction properties of the methods used to encode digital data into strands of DNA. Any error that cannot be corrected utilizing the redundancy added by DNA encoding methods results in permanent data loss. To provide data recovery for DNA storage systems, we present a method to automatically reconstruct corrupted or missing data stored in DNA using fountain codes. Our method exploits the relationships between packets encoded with fountain codes to identify and rectify corrupted or lost data. Furthermore, we present file type-specific and content-based data recovery methods for three file types, illustrating how a fusion of fountain encoding-specific redundancy and knowledge about the data can effectively recover information in a corrupted DNA storage system, both in an automatic and in a guided manual manner. To demonstrate our approach, we introduce DR4DNA, a software toolkit that contains all methods presented. We evaluate DR4DNA using both in-silico and in-vitro experiments.
Collapse
Affiliation(s)
- Peter Michael Schwarz
- Department of Mathematics and Computer Science, University of Marburg, Hans-Meerwein-Straße 6, Marburg, D-35043, Germany
| | - Bernd Freisleben
- Department of Mathematics and Computer Science, University of Marburg, Hans-Meerwein-Straße 6, Marburg, D-35043, Germany
| |
Collapse
|
4
|
Wang C, Wei D, Wei Z, Yang D, Xing J, Wang Y, Wang X, Wang P, Ma G, Zhang X, Li H, Tang C, Hou P, Wang J, Gao R, Xie G, Li C, Ju Y, Wang P, Yue L, Zhao Y, Sheng Y, Xiao J, Niu H, Xu S, Yang H, Liu D, Duan B, Bu D, Tan G, Chen F. Cost-Effective DNA Storage System with DNA Movable Type. ADVANCED SCIENCE (WEINHEIM, BADEN-WURTTEMBERG, GERMANY) 2024:e2411354. [PMID: 39555674 DOI: 10.1002/advs.202411354] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/15/2024] [Revised: 11/06/2024] [Indexed: 11/19/2024]
Abstract
In the face of exponential data growth, DNA-based storage offers a promising solution for preserving big data. However, most existing DNA storage methods, akin to traditional block printing, require costly chemical synthesis for each individual data file, adopting a sequential, one-time-use synthesis approach. To overcome these limitations, a novel, cost-effective "DNA-movable-type storage" system, inspired by movable type printing, is introduced. This system utilizes prefabricated DNA movable types-short, double-stranded DNA oligonucleotides encoding specific payload, address, and checksum data. These DNA-MTs are enzymatically ligated/assembled into cohesive sequences, termed "DNA movable type blocks," streamlining the assembly process with the automated BISHENG-1 DNA-MT inkjet printer. Using BISHENG-1, 43.7 KB of data files are successfully printed, assembled, stored, and accurately retrieved in diverse formats (text, image, audio, and video) in vitro and in vivo, using only 350 DNA-MTs. Notably, each DNA-MT, synthesized once (2 OD), can be used up to 10000 times, reducing costs to $122/MB-outperforming existing DNA storage methods. This innovation circumvents the need to synthesize entire DNA sequences encoding files from scratch, offering significant cost and efficiency advantages. Furthermore, it has considerable untapped potential to advance a robust DNA storage system, better meeting the extensive data storage demands of the big-data era.
Collapse
Affiliation(s)
- Chenyang Wang
- China National Center for Bioinformation, Beijing, 100101, China
- Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, 100101, China
- University of Chinese Academy of Sciences, Beijing, 101408, China
| | - Di Wei
- China National Center for Bioinformation, Beijing, 100101, China
- Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, 100101, China
| | - Zheng Wei
- SKLP, Institue of Computing Technology, Chinese Academy of Sciences, Beijing, 100190, China
| | - Dongxin Yang
- Western Institute of Computing Technology, Chongqing, 401121, China
| | - Jing Xing
- SKLP, Institue of Computing Technology, Chinese Academy of Sciences, Beijing, 100190, China
| | - Yunze Wang
- China National Center for Bioinformation, Beijing, 100101, China
- Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, 100101, China
- University of Chinese Academy of Sciences, Beijing, 101408, China
| | - Xiaotong Wang
- China National Center for Bioinformation, Beijing, 100101, China
- Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, 100101, China
| | - Pei Wang
- Western Institute of Computing Technology, Chongqing, 401121, China
| | - Guannan Ma
- China National Center for Bioinformation, Beijing, 100101, China
- Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, 100101, China
| | - Xinru Zhang
- University of Chinese Academy of Sciences, Beijing, 101408, China
- SKLP, Institue of Computing Technology, Chinese Academy of Sciences, Beijing, 100190, China
| | - Haolan Li
- Western Institute of Computing Technology, Chongqing, 401121, China
| | - Chuan Tang
- Western Institute of Computing Technology, Chongqing, 401121, China
| | - Pengfei Hou
- Western Institute of Computing Technology, Chongqing, 401121, China
| | - Jie Wang
- China National Center for Bioinformation, Beijing, 100101, China
- Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, 100101, China
- University of Chinese Academy of Sciences, Beijing, 101408, China
| | - Renjun Gao
- Key Laboratory for Molecular Enzymology and Engineering of Ministry of Education, School of Life Sciences, Jilin University, Changchun, 130012, China
| | - Guiqiu Xie
- Key Laboratory for Molecular Enzymology and Engineering of Ministry of Education, School of Life Sciences, Jilin University, Changchun, 130012, China
| | - Cuidan Li
- China National Center for Bioinformation, Beijing, 100101, China
- Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, 100101, China
| | - Yingjiao Ju
- China National Center for Bioinformation, Beijing, 100101, China
- Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, 100101, China
- University of Chinese Academy of Sciences, Beijing, 101408, China
| | - Peihan Wang
- China National Center for Bioinformation, Beijing, 100101, China
- Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, 100101, China
- University of Chinese Academy of Sciences, Beijing, 101408, China
| | - Liya Yue
- China National Center for Bioinformation, Beijing, 100101, China
- Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, 100101, China
| | - Yongliang Zhao
- China National Center for Bioinformation, Beijing, 100101, China
- Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, 100101, China
| | - Yongjie Sheng
- Key Laboratory for Molecular Enzymology and Engineering of Ministry of Education, School of Life Sciences, Jilin University, Changchun, 130012, China
| | - Jingfa Xiao
- Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, 100101, China
- University of Chinese Academy of Sciences, Beijing, 101408, China
- National Genomics Data Center, China National Center for Bioinformation, Beijing, 100101, China
| | - Haitao Niu
- Key Laboratory of Viral Pathogenesis & Infection Prevention and Control (Jinan University), Ministry of Education, Guangzhou, 510632, China
| | - Sihong Xu
- Division II of In Vitro Diagnostics for Infectious Diseases, Institute for In Vitro Diagnostics Control, National Institutes for Food and Drug Control, Beijing, 100050, China
| | - Huaiyi Yang
- University of Chinese Academy of Sciences, Beijing, 101408, China
- Department of Microbial Physiological and Metabolic Engineering, State Key Lab of Mycology, Institute of Microbiology, Chinese Academy of Sciences, Beijing, 100101, China
| | - Di Liu
- University of Chinese Academy of Sciences, Beijing, 101408, China
- CAS Key Laboratory of Special Pathogens and Biosafety, Center for Biosafety Mega-Science, Wuhan Institute of Virology, Chinese Academy of Sciences, Wuhan, 430071, China
| | - Bo Duan
- University of Chinese Academy of Sciences, Beijing, 101408, China
- SKLP, Institue of Computing Technology, Chinese Academy of Sciences, Beijing, 100190, China
- Western Institute of Computing Technology, Chongqing, 401121, China
| | - Dongbo Bu
- University of Chinese Academy of Sciences, Beijing, 101408, China
- SKLP, Institue of Computing Technology, Chinese Academy of Sciences, Beijing, 100190, China
- Western Institute of Computing Technology, Chongqing, 401121, China
- Central China Institute for Artificial Intelligence, Zhengzhou, 450046, China
| | - Guangming Tan
- University of Chinese Academy of Sciences, Beijing, 101408, China
- SKLP, Institue of Computing Technology, Chinese Academy of Sciences, Beijing, 100190, China
- Western Institute of Computing Technology, Chongqing, 401121, China
| | - Fei Chen
- China National Center for Bioinformation, Beijing, 100101, China
- Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, 100101, China
- University of Chinese Academy of Sciences, Beijing, 101408, China
- Key Laboratory of Viral Pathogenesis & Infection Prevention and Control (Jinan University), Ministry of Education, Guangzhou, 510632, China
- State Key Laboratory of Pathogenesis, Prevention and Treatment of High Incidence Diseases in Central Asia, Clinical Medicine Institute, The First Affiliated Hospital of Xinjiang Medical University, Urumqi, Xinjiang, 830054, China
| |
Collapse
|
5
|
Hu Y, Liu Y, Yang Y. Adaptive Arithmetic Coding-Based Encoding Method Toward High-Density DNA Storage. J Comput Biol 2024. [PMID: 39544175 DOI: 10.1089/cmb.2024.0697] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2024] Open
Abstract
With the rapid advancement of big data and artificial intelligence technologies, the limitations inherent in traditional storage media for accommodating vast amounts of data have become increasingly evident. DNA storage is an innovative approach harnessing DNA and other biomolecules as storage mediums, endowed with superior characteristics including expansive capacity, remarkable density, minimal energy requirements, and unparalleled longevity. Central to the efficient DNA storage is the process of DNA coding, whereby digital information is converted into sequences of DNA bases. A novel encoding method based on adaptive arithmetic coding (AAC) has been introduced, delineating the encoding process into three distinct phases: compression, error correction, and mapping. Prediction by Partial Matching (PPM)-based AAC in the compression phase serves to compress data and enhance storage density. Subsequently, the error correction phase relies on octal Hamming code to rectify errors and safeguard data integrity. The mapping phase employs a "3-2 code" mapping relationship to ensure adherence to biochemical constraints. The proposed method was verified by encoding different formats of files such as text, pictures, and audio. The results indicated that the average coding density of bases can be up to 3.25 per nucleotide, the GC content (which includes guanine [G] and cytosine [C]) can be stabilized at 50% and the homopolymer length is restricted to no more than 2. Simulation experimental results corroborate the method's efficacy in preserving data integrity during both reading and writing operations, augmenting storage density, and exhibiting robust error correction capabilities.
Collapse
Affiliation(s)
- Yingxin Hu
- College of Information Science and Technology, Shijiazhuang Tiedao University, Shijiazhuang, China
| | - Yanjun Liu
- College of Information Science and Technology, Shijiazhuang Tiedao University, Shijiazhuang, China
| | - Yuefei Yang
- College of Information Science and Technology, Shijiazhuang Tiedao University, Shijiazhuang, China
| |
Collapse
|
6
|
Bar-Lev D, Sabary O, Yaakobi E. The zettabyte era is in our DNA. NATURE COMPUTATIONAL SCIENCE 2024; 4:813-817. [PMID: 39516373 DOI: 10.1038/s43588-024-00717-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/06/2023] [Accepted: 10/03/2024] [Indexed: 11/16/2024]
Abstract
This Perspective surveys the critical computational challenges associated with in vitro DNA-based data storage. As digital data expand exponentially, traditional storage media are becoming less viable, making DNA a promising solution due to its density and durability. However, numerous obstacles remain, including error correction, data retrieval from large volumes of noisy reads, and scalability. The Perspective also highlights challenges for DNA-based data centers, such as fault tolerance, random access, and data removal, which must be addressed to make DNA-based storage practical.
Collapse
Affiliation(s)
- Daniella Bar-Lev
- The Henry and Marilyn Taub Faculty of Computer Science, Technion, Israel Institute of Technology, Haifa, Israel.
| | - Omer Sabary
- The Henry and Marilyn Taub Faculty of Computer Science, Technion, Israel Institute of Technology, Haifa, Israel.
| | - Eitan Yaakobi
- The Henry and Marilyn Taub Faculty of Computer Science, Technion, Israel Institute of Technology, Haifa, Israel.
| |
Collapse
|
7
|
Zhao X, Li J, Fan Q, Dai J, Long Y, Liu R, Zhai J, Pan Q, Li Y. Composite Hedges Nanopores codec system for rapid and portable DNA data readout with high INDEL-Correction. Nat Commun 2024; 15:9395. [PMID: 39477940 PMCID: PMC11525716 DOI: 10.1038/s41467-024-53455-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2024] [Accepted: 10/11/2024] [Indexed: 11/02/2024] Open
Abstract
Reading digital information from highly dense but lightweight DNA medium nowadays relies on time-consuming next-generation sequencing. Nanopore sequencing holds the promise to overcome the efficiency problem, but high indel error rates lead to the requirement of large amount of high quality data for accurate readout. Here we introduce Composite Hedges Nanopores, capable of handling indel rates up to 15.9% and substitution rates up to 7.8%. The overall information density can be doubled from 0.59 to 1.17 by utilizing a degenerated eight-letter alphabet. We demonstrate that sequencing times of 20 and 120 minutes are sufficient for processing representative text and image files, respectively. Moreover, to achieve complete data recovery, it is estimated that text and image data require 4× and 8× physical redundancy of composite strands, respectively. Our codec system excels on both molecular design and equalized dictionary usage, laying a solid foundation approaching to real-time DNA data retrieval and encoding.
Collapse
Affiliation(s)
- Xuyang Zhao
- School of Microelectronics, MOE Engineering Research Center of Integrated Circuits for Next Generation Communications, Southern University of Science and Technology, Shenzhen, China
| | - Junyao Li
- School of Microelectronics, MOE Engineering Research Center of Integrated Circuits for Next Generation Communications, Southern University of Science and Technology, Shenzhen, China
| | - Qingyuan Fan
- School of Microelectronics, MOE Engineering Research Center of Integrated Circuits for Next Generation Communications, Southern University of Science and Technology, Shenzhen, China
| | - Jing Dai
- School of Microelectronics, MOE Engineering Research Center of Integrated Circuits for Next Generation Communications, Southern University of Science and Technology, Shenzhen, China
| | - Yanping Long
- Department of Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen, China
| | - Ronghui Liu
- School of Microelectronics, MOE Engineering Research Center of Integrated Circuits for Next Generation Communications, Southern University of Science and Technology, Shenzhen, China
| | - Jixian Zhai
- Department of Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen, China
| | - Qing Pan
- College of Information Engineering, Zhejiang University of Technology, Hangzhou, China.
| | - Yi Li
- School of Microelectronics, MOE Engineering Research Center of Integrated Circuits for Next Generation Communications, Southern University of Science and Technology, Shenzhen, China.
| |
Collapse
|
8
|
Rasool A, Hong J, Hong Z, Li Y, Zou C, Chen H, Qu Q, Wang Y, Jiang Q, Huang X, Dai J. An Effective DNA-Based File Storage System for Practical Archiving and Retrieval of Medical MRI Data. SMALL METHODS 2024; 8:e2301585. [PMID: 38807543 DOI: 10.1002/smtd.202301585] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/15/2023] [Revised: 03/29/2024] [Indexed: 05/30/2024]
Abstract
DNA-based data storage is a new technology in computational and synthetic biology, that offers a solution for long-term, high-density data archiving. Given the critical importance of medical data in advancing human health, there is a growing interest in developing an effective medical data storage system based on DNA. Data integrity, accuracy, reliability, and efficient retrieval are all significant concerns. Therefore, this study proposes an Effective DNA Storage (EDS) approach for archiving medical MRI data. The EDS approach incorporates three key components (i) a novel fraction strategy to address the critical issue of rotating encoding, which often leads to data loss due to single base error propagation; (ii) a novel rule-based quaternary transcoding method that satisfies bio-constraints and ensure reliable mapping; and (iii) an indexing technique designed to simplify random search and access. The effectiveness of this approach is validated through computer simulations and biological experiments, confirming its practicality. The EDS approach outperforms existing methods, providing superior control over bio-constraints and reducing computational time. The results and code provided in this study open new avenues for practical DNA storage of medical MRI data, offering promising prospects for the future of medical data archiving and retrieval.
Collapse
Affiliation(s)
- Abdur Rasool
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, 518055, China
- Shenzhen College of Advanced Technology, University of Chinese Academy of Sciences, Shenzhen, 518055, China
| | - Jingwei Hong
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, 518055, China
- College of Mathematics and Information Science, Hebei University, Baoding, 071002, China
| | - Zhiling Hong
- Quanzhou Development Group Co., Ltd, Quanzhou, 362000, China
| | - Yuanzhen Li
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, 518055, China
- Shenzhen Key Laboratory of Synthetic Genomics, Guangdong Provincial Key Laboratory of Synthetic Genomics, Key Laboratory of Quantitative Synthetic Biology, Shenzhen Institute of Synthetic Biology, Shenzhen, 518055, China
| | - Chao Zou
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, 518055, China
| | - Hui Chen
- Shenzhen Polytechnic University, Shenzhen, 518055, China
| | - Qiang Qu
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, 518055, China
| | - Yang Wang
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, 518055, China
| | - Qingshan Jiang
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, 518055, China
| | - Xiaoluo Huang
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, 518055, China
- Shenzhen Key Laboratory of Synthetic Genomics, Guangdong Provincial Key Laboratory of Synthetic Genomics, Key Laboratory of Quantitative Synthetic Biology, Shenzhen Institute of Synthetic Biology, Shenzhen, 518055, China
| | - Junbiao Dai
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, 518055, China
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, 518055, China
| |
Collapse
|
9
|
Gao Y, No A. Efficient and low-complexity variable-to-variable length coding for DNA storage. BMC Bioinformatics 2024; 25:320. [PMID: 39354338 PMCID: PMC11446080 DOI: 10.1186/s12859-024-05943-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2023] [Accepted: 09/23/2024] [Indexed: 10/03/2024] Open
Abstract
BACKGROUND Efficient DNA-based storage systems offer substantial capacity and longevity at reduced costs, addressing anticipated data growth. However, encoding data into DNA sequences is limited by two key constraints: 1) a maximum of h consecutive identical bases (homopolymer constraint h), and 2) a GC ratio between [ 0.5 - c GC , 0.5 + c GC ] (GC content constraint c GC ). Sequencing or synthesis errors tend to increase when these constraints are violated. RESULTS In this research, we address a pure source coding problem in the context of DNA storage, considering both homopolymer and GC content constraints. We introduce a novel coding technique that adheres to these constraints while maintaining linear complexity for increased block lengths and achieving near-optimal rates. We demonstrate the effectiveness of the proposed method through experiments on both randomly generated data and existing files. For example, when h = 4 andc GC = 0.05 , the rate reached 1.988, close to the theoretical limit of 1.990. The associated code can be accessed at GitHub. CONCLUSION We propose a variable-to-variable-length encoding method that does not rely on concatenating short predefined sequences, which achieves near-optimal rates.
Collapse
Affiliation(s)
- Yunfei Gao
- SJTU-Ruijing-UIH Institute for Medical Imaging Technology, Ruijin Hospital, Shanghai Jiaotong University School of Medicine, No. 197 Ruijin Second Road, Shanghai, 200025, China
| | - Albert No
- Department of Artificial Intelligence, Yonsei University, 50 Yonsei-ro, Seodaemun-gu, 03722, South Korea.
| |
Collapse
|
10
|
Dou C, Yang Y, Zhu F, Li B, Duan Y. Explorer: efficient DNA coding by De Bruijn graph toward arbitrary local and global biochemical constraints. Brief Bioinform 2024; 25:bbae363. [PMID: 39073829 DOI: 10.1093/bib/bbae363] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2024] [Revised: 06/25/2024] [Accepted: 07/13/2024] [Indexed: 07/30/2024] Open
Abstract
With the exponential growth of digital data, there is a pressing need for innovative storage media and techniques. DNA molecules, due to their stability, storage capacity, and density, offer a promising solution for information storage. However, DNA storage also faces numerous challenges, such as complex biochemical constraints and encoding efficiency. This paper presents Explorer, a high-efficiency DNA coding algorithm based on the De Bruijn graph, which leverages its capability to characterize local sequences. Explorer enables coding under various biochemical constraints, such as homopolymers, GC content, and undesired motifs. This paper also introduces Codeformer, a fast decoding algorithm based on the transformer architecture, to further enhance decoding efficiency. Numerical experiments indicate that, compared with other advanced algorithms, Explorer not only achieves stable encoding and decoding under various biochemical constraints but also increases the encoding efficiency and bit rate by ¿10%. Additionally, Codeformer demonstrates the ability to efficiently decode large quantities of DNA sequences. Under different parameter settings, its decoding efficiency exceeds that of traditional algorithms by more than two-fold. When Codeformer is combined with Reed-Solomon code, its decoding accuracy exceeds 99%, making it a good choice for high-speed decoding applications. These advancements are expected to contribute to the development of DNA-based data storage systems and the broader exploration of DNA as a novel information storage medium.
Collapse
Affiliation(s)
- Chang Dou
- Center for Applied Mathematics, Tianjin University, No. 92, Weijin Road, Nankai District, Tianjin 300072, China
| | - Yijie Yang
- Center for Applied Mathematics, Tianjin University, No. 92, Weijin Road, Nankai District, Tianjin 300072, China
| | - Fei Zhu
- Center for Applied Mathematics, Tianjin University, No. 92, Weijin Road, Nankai District, Tianjin 300072, China
| | - BingZhi Li
- Frontiers Science Center for Synthetic Biology and Key Laboratory of Systems Bioengineering (Ministry of Education), Tianjin University, No. 92, Weijin Road, Nankai District, Tianjin 300072, China
- School of Chemical Engineering and Technology, Tianjin University, No. 92, Weijin Road, Nankai District, Tianjin 300072, China
| | - Yuping Duan
- Center for Applied Mathematics, Tianjin University, No. 92, Weijin Road, Nankai District, Tianjin 300072, China
| |
Collapse
|
11
|
Cao B, Wang K, Xie L, Zhang J, Zhao Y, Wang B, Zheng P. PELMI: Realize robust DNA image storage under general errors via parity encoding and local mean iteration. Brief Bioinform 2024; 25:bbae463. [PMID: 39288232 PMCID: PMC11407442 DOI: 10.1093/bib/bbae463] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2024] [Revised: 09/01/2024] [Accepted: 09/04/2024] [Indexed: 09/19/2024] Open
Abstract
DNA molecules as storage media are characterized by high encoding density and low energy consumption, making DNA storage a highly promising storage method. However, DNA storage has shortcomings, especially when storing multimedia data, wherein image reconstruction fails when address errors occur, resulting in complete data loss. Therefore, we propose a parity encoding and local mean iteration (PELMI) scheme to achieve robust DNA storage of images. The proposed parity encoding scheme satisfies the common biochemical constraints of DNA sequences and the undesired motif content. It addresses varying pixel weights at different positions for binary data, thus optimizing the utilization of Reed-Solomon error correction. Then, through lost and erroneous sequences, data supplementation and local mean iteration are employed to enhance the robustness. The encoding results show that the undesired motif content is reduced by 23%-50% compared with the representative schemes, which improves the sequence stability. PELMI achieves image reconstruction under general errors (insertion, deletion, substitution) and enhances the DNA sequences quality. Especially under 1% error, compared with other advanced encoding schemes, the peak signal-to-noise ratio and the multiscale structure similarity address metric were increased by 10%-13% and 46.8%-122%, respectively, and the mean squared error decreased by 113%-127%. This demonstrates that the reconstructed images had better clarity, fidelity, and similarity in structure, texture, and detail. In summary, PELMI ensures robustness and stability of image storage in DNA and achieves relatively high-quality image reconstruction under general errors.
Collapse
Affiliation(s)
- Ben Cao
- School of Computer Science and Technology, Dalian University of Technology, No. 2 Linggong Road, Ganjingzi District, Dalian, Liaoning 116024, China
| | - Kun Wang
- The Key Laboratory of Advanced Design and Intelligent Computing, Ministry of Education, School of Software Engineering, Dalian University, No. 10 Xuefu Street, Dalian Economic-Technological Development Zone, Dalian, Liaoning 116622, China
| | - Lei Xie
- The Key Laboratory of Advanced Design and Intelligent Computing, Ministry of Education, School of Software Engineering, Dalian University, No. 10 Xuefu Street, Dalian Economic-Technological Development Zone, Dalian, Liaoning 116622, China
| | - Jianxia Zhang
- School of Intelligent Engineering, Henan Institute of Technology, No. 90, East Hualan Avenue, Hongqi District, Xinxiang, Henan 451191, China
| | - Yunzhu Zhao
- The Key Laboratory of Advanced Design and Intelligent Computing, Ministry of Education, School of Software Engineering, Dalian University, No. 10 Xuefu Street, Dalian Economic-Technological Development Zone, Dalian, Liaoning 116622, China
| | - Bin Wang
- The Key Laboratory of Advanced Design and Intelligent Computing, Ministry of Education, School of Software Engineering, Dalian University, No. 10 Xuefu Street, Dalian Economic-Technological Development Zone, Dalian, Liaoning 116622, China
| | - Pan Zheng
- Department of Accounting and Information Systems, University of Canterbury, Upper Riccarton, Christchurch 8140, New Zealand
| |
Collapse
|
12
|
Seo S, Tandon A, Lee KW, Lee JH, Park SH. Information Density Enhancement Using Lossy Compression in DNA Data Storage. ADVANCED MATERIALS (DEERFIELD BEACH, FLA.) 2024:e2403071. [PMID: 38779945 DOI: 10.1002/adma.202403071] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/29/2024] [Revised: 05/06/2024] [Indexed: 05/25/2024]
Abstract
This study develops two deoxyribonucleic acid (DNA) lossy compression models, Models A and B, to encode grayscale images into DNA sequences, enhance information density, and enable high-fidelity image recovery. These models, distinguished by their handling of pixel domains and interpolation methods, offer a novel approach to data storage for DNA. Model A processes pixels in overlapped domains using linear interpolation (LI), whereas Model B uses non-overlapped domains with nearest-neighbor interpolation (NNI). Through a comparative analysis with Joint Photographic Experts Group (JPEG) compression, the DNA lossy compression models demonstrate competitive advantages in terms of information density and image quality restoration. The application of these models to the Modified National Institute of Standards and Technology (MNIST) dataset reveals their efficiency and the recognizability of decompressed images, which is validated by convolutional neural network (CNN) performance. In particular, Model B2, a version of Model B, emerges as an effective method for balancing high information density (surpassing over 20 times the typical densities of two bits per nucleotide) with reasonably good image quality. These findings highlight the potential of DNA-based data storage systems for high-density and efficient compression, indicating a promising future for biological data storage solutions.
Collapse
Affiliation(s)
- Seongjun Seo
- Department of Physics and Sungkyunkwan Advanced Institute of Nanotechnology (SAINT), Sungkyunkwan University, Suwon, 16419, Republic of Korea
| | - Anshula Tandon
- Department of Physics and Sungkyunkwan Advanced Institute of Nanotechnology (SAINT), Sungkyunkwan University, Suwon, 16419, Republic of Korea
| | - Keun Woo Lee
- DNASTech, Industry-Academic Cooperation Center, Sungkyunkwan University, Suwon, 16419, Republic of Korea
| | - Jee-Hyong Lee
- Department of Artificial Intelligence, Sungkyunkwan University, Suwon, 16419, Republic of Korea
| | - Sung Ha Park
- Department of Physics and Sungkyunkwan Advanced Institute of Nanotechnology (SAINT), Sungkyunkwan University, Suwon, 16419, Republic of Korea
| |
Collapse
|
13
|
Welzel M, Dreßler H, Heider D. Turbo autoencoders for the DNA data storage channel with Autoturbo-DNA. iScience 2024; 27:109575. [PMID: 38638577 PMCID: PMC11024904 DOI: 10.1016/j.isci.2024.109575] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2023] [Revised: 01/04/2024] [Accepted: 03/25/2024] [Indexed: 04/20/2024] Open
Abstract
DNA, with its high storage density and long-term stability, is a potential candidate for a next-generation storage device. The DNA data storage channel, composed of synthesis, amplification, storage, and sequencing, exhibits error probabilities and error profiles specific to the components of the channel. Here, we present Autoturbo-DNA, a PyTorch framework for training error-correcting, overcomplete autoencoders specifically tailored for the DNA data storage channel. It allows training different architecture combinations and using a wide variety of channel component models for noise generation during training. It further supports training the encoder to generate DNA sequences that adhere to user-defined constraints. Autoturbo-DNA exhibits error-correction capabilities close to non-neural-network state-of-the-art error correction and constrained codes for DNA data storage. Our results indicate that neural-network-based codes can be a viable alternative to traditionally designed codes for the DNA data storage channel.
Collapse
Affiliation(s)
- Marius Welzel
- Department of Mathematics and Computer Science, University of Marburg, 35043 Marburg, Hesse, Germany
| | - Hagen Dreßler
- Department of Sustainable Systems Engineering, University of Freiburg, Fahnenbergplatz, 79085 Freiburg im Breisgau, Baden-Württemberg, Germany
| | - Dominik Heider
- Department of Mathematics and Computer Science, University of Marburg, 35043 Marburg, Hesse, Germany
| |
Collapse
|
14
|
Cao B, Zheng Y, Shao Q, Liu Z, Xie L, Zhao Y, Wang B, Zhang Q, Wei X. Efficient data reconstruction: The bottleneck of large-scale application of DNA storage. Cell Rep 2024; 43:113699. [PMID: 38517891 DOI: 10.1016/j.celrep.2024.113699] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2023] [Revised: 11/15/2023] [Accepted: 01/05/2024] [Indexed: 03/24/2024] Open
Abstract
Over the past decade, the rapid development of DNA synthesis and sequencing technologies has enabled preliminary use of DNA molecules for digital data storage, overcoming the capacity and persistence bottlenecks of silicon-based storage media. DNA storage has now been fully accomplished in the laboratory through existing biotechnology, which again demonstrates the viability of carbon-based storage media. However, the high cost and latency of data reconstruction pose challenges that hinder the practical implementation of DNA storage beyond the laboratory. In this article, we review existing advanced DNA storage methods, analyze the characteristics and performance of biotechnological approaches at various stages of data writing and reading, and discuss potential factors influencing DNA storage from the perspective of data reconstruction.
Collapse
Affiliation(s)
- Ben Cao
- School of Computer Science and Technology, Dalian University of Technology, Lingshui Street, Dalian, Liaoning 116024, China; Centre for Frontier AI Research, Agency for Science, Technology, and Research (A(∗)STAR), 1 Fusionopolis Way, Singapore 138632, Singapore
| | - Yanfen Zheng
- School of Computer Science and Technology, Dalian University of Technology, Lingshui Street, Dalian, Liaoning 116024, China
| | - Qi Shao
- Key Laboratory of Advanced Design and Intelligent Computing, Ministry of Education, School of Software Engineering, Dalian University, Xuefu Street, Dalian, Liaoning 116622, China
| | - Zhenlu Liu
- Key Laboratory of Advanced Design and Intelligent Computing, Ministry of Education, School of Software Engineering, Dalian University, Xuefu Street, Dalian, Liaoning 116622, China
| | - Lei Xie
- Key Laboratory of Advanced Design and Intelligent Computing, Ministry of Education, School of Software Engineering, Dalian University, Xuefu Street, Dalian, Liaoning 116622, China
| | - Yunzhu Zhao
- Key Laboratory of Advanced Design and Intelligent Computing, Ministry of Education, School of Software Engineering, Dalian University, Xuefu Street, Dalian, Liaoning 116622, China
| | - Bin Wang
- Key Laboratory of Advanced Design and Intelligent Computing, Ministry of Education, School of Software Engineering, Dalian University, Xuefu Street, Dalian, Liaoning 116622, China
| | - Qiang Zhang
- School of Computer Science and Technology, Dalian University of Technology, Lingshui Street, Dalian, Liaoning 116024, China.
| | - Xiaopeng Wei
- School of Computer Science and Technology, Dalian University of Technology, Lingshui Street, Dalian, Liaoning 116024, China
| |
Collapse
|
15
|
Zheng Y, Cao B, Zhang X, Cui S, Wang B, Zhang Q. DNA-QLC: an efficient and reliable image encoding scheme for DNA storage. BMC Genomics 2024; 25:266. [PMID: 38461245 PMCID: PMC10925009 DOI: 10.1186/s12864-024-10178-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2023] [Accepted: 03/01/2024] [Indexed: 03/11/2024] Open
Abstract
BACKGROUND DNA storage has the advantages of large capacity, long-term stability, and low power consumption relative to other storage mediums, making it a promising new storage medium for multimedia information such as images. However, DNA storage has a low coding density and weak error correction ability. RESULTS To achieve more efficient DNA storage image reconstruction, we propose DNA-QLC (QRes-VAE and Levenshtein code (LC)), which uses the quantized ResNet VAE (QRes-VAE) model and LC for image compression and DNA sequence error correction, thus improving both the coding density and error correction ability. Experimental results show that the DNA-QLC encoding method can not only obtain DNA sequences that meet the combinatorial constraints, but also have a net information density that is 2.4 times higher than DNA Fountain. Furthermore, at a higher error rate (2%), DNA-QLC achieved image reconstruction with an SSIM value of 0.917. CONCLUSIONS The results indicate that the DNA-QLC encoding scheme guarantees the efficiency and reliability of the DNA storage system and improves the application potential of DNA storage for multimedia information such as images.
Collapse
Affiliation(s)
- Yanfen Zheng
- School of Computer Science and Technology, Dalian University of Technology, Lingshui Street, DalianLiaoning, 116024, China
| | - Ben Cao
- School of Computer Science and Technology, Dalian University of Technology, Lingshui Street, DalianLiaoning, 116024, China
| | - Xiaokang Zhang
- School of Computer Science and Technology, Dalian University of Technology, Lingshui Street, DalianLiaoning, 116024, China
| | - Shuang Cui
- School of Computer Science and Technology, Dalian University of Technology, Lingshui Street, DalianLiaoning, 116024, China
| | - Bin Wang
- The Key Laboratory of Advanced Design and Intelligent Computing, Ministry of Education, School of Software Engineering, Dalian University, Xuefu Street, DalianLiaoning, 116622, China
| | - Qiang Zhang
- School of Computer Science and Technology, Dalian University of Technology, Lingshui Street, DalianLiaoning, 116024, China.
| |
Collapse
|
16
|
Zhang X, Qi B, Niu Y. A dual-rule encoding DNA storage system using chaotic mapping to control GC content. Bioinformatics 2024; 40:btae113. [PMID: 38419588 PMCID: PMC10937898 DOI: 10.1093/bioinformatics/btae113] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2023] [Revised: 02/21/2024] [Accepted: 02/26/2024] [Indexed: 03/02/2024] Open
Abstract
MOTIVATION DNA as a novel storage medium is considered an effective solution to the world's growing demand for information due to its high density and long-lasting reliability. However, early coding schemes ignored the biologically constrained nature of DNA sequences in pursuit of high density, leading to DNA synthesis and sequencing difficulties. This article proposes a novel DNA storage coding scheme. The system encodes half of the binary data using each of the two GC-content complementary encoding rules to obtain a DNA sequence. RESULTS After simulating the encoding of representative document and image file formats, a DNA sequence strictly conforming to biological constraints was obtained, reaching a coding potential of 1.66 bit/nt. In the decoding process, a mechanism to prevent error propagation was introduced. The simulation results demonstrate that by adding Reed-Solomon code, 90% of the data can still be recovered after introducing a 2% error, proving that the proposed DNA storage scheme has high robustness and reliability. Availability and implementation: The source code for the codec scheme of this paper is available at https://github.com/Mooreniah/DNA-dual-rule-rotary-encoding-storage-system-DRRC.
Collapse
Affiliation(s)
- Xuncai Zhang
- College of Electrical Information Engineering, Zhengzhou University of Light Industry, Zhengzhou 450000, Henan, China
| | - Baonan Qi
- College of Electrical Information Engineering, Zhengzhou University of Light Industry, Zhengzhou 450000, Henan, China
| | - Ying Niu
- College of Building Environment Engineering, Zhengzhou University of Light Industry, Zhengzhou 450000, Henan, China
| |
Collapse
|
17
|
Yang S, Bögels BWA, Wang F, Xu C, Dou H, Mann S, Fan C, de Greef TFA. DNA as a universal chemical substrate for computing and data storage. Nat Rev Chem 2024; 8:179-194. [PMID: 38337008 DOI: 10.1038/s41570-024-00576-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/10/2024] [Indexed: 02/12/2024]
Abstract
DNA computing and DNA data storage are emerging fields that are unlocking new possibilities in information technology and diagnostics. These approaches use DNA molecules as a computing substrate or a storage medium, offering nanoscale compactness and operation in unconventional media (including aqueous solutions, water-in-oil microemulsions and self-assembled membranized compartments) for applications beyond traditional silicon-based computing systems. To build a functional DNA computer that can process and store molecular information necessitates the continued development of strategies for computing and data storage, as well as bridging the gap between these fields. In this Review, we explore how DNA can be leveraged in the context of DNA computing with a focus on neural networks and compartmentalized DNA circuits. We also discuss emerging approaches to the storage of data in DNA and associated topics such as the writing, reading, retrieval and post-synthesis editing of DNA-encoded data. Finally, we provide insights into how DNA computing can be integrated with DNA data storage and explore the use of DNA for near-memory computing for future information technology and health analysis applications.
Collapse
Affiliation(s)
- Shuo Yang
- State Key Laboratory of Metal Matrix Composites, School of Materials Science and Engineering, Shanghai Jiao Tong University, Shanghai, China
- Zhangjiang Institute for Advanced Study (ZIAS), Shanghai Jiao Tong University, Shanghai, China
| | - Bas W A Bögels
- Laboratory of Chemical Biology, Department of Biomedical Engineering, Eindhoven University of Technology, Eindhoven, The Netherlands
- Institute for Complex Molecular Systems (ICMS), Eindhoven University of Technology, Eindhoven, The Netherlands
- Computational Biology Group, Department of Biomedical Engineering, Eindhoven University of Technology, Eindhoven, The Netherlands
| | - Fei Wang
- School of Chemistry and Chemical Engineering, New Cornerstone Science Laboratory, Frontiers Science Center for Transformative Molecules and National Center for Translational Medicine, Shanghai Jiao Tong University, Shanghai, China
| | - Can Xu
- State Key Laboratory of Metal Matrix Composites, School of Materials Science and Engineering, Shanghai Jiao Tong University, Shanghai, China
- Zhangjiang Institute for Advanced Study (ZIAS), Shanghai Jiao Tong University, Shanghai, China
| | - Hongjing Dou
- State Key Laboratory of Metal Matrix Composites, School of Materials Science and Engineering, Shanghai Jiao Tong University, Shanghai, China
- Zhangjiang Institute for Advanced Study (ZIAS), Shanghai Jiao Tong University, Shanghai, China
| | - Stephen Mann
- State Key Laboratory of Metal Matrix Composites, School of Materials Science and Engineering, Shanghai Jiao Tong University, Shanghai, China.
- Zhangjiang Institute for Advanced Study (ZIAS), Shanghai Jiao Tong University, Shanghai, China.
- Centre for Protolife Research and Centre for Organized Matter Chemistry, School of Chemistry, University of Bristol, Bristol, UK.
- Max Planck-Bristol Centre for Minimal Biology, School of Chemistry, University of Bristol, Bristol, UK.
| | - Chunhai Fan
- School of Chemistry and Chemical Engineering, New Cornerstone Science Laboratory, Frontiers Science Center for Transformative Molecules and National Center for Translational Medicine, Shanghai Jiao Tong University, Shanghai, China.
- Institute of Molecular Medicine, Shanghai Key Laboratory for Nucleic Acids Chemistry and Nanomedicine, Renji Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai, China.
| | - Tom F A de Greef
- Laboratory of Chemical Biology, Department of Biomedical Engineering, Eindhoven University of Technology, Eindhoven, The Netherlands.
- Institute for Complex Molecular Systems (ICMS), Eindhoven University of Technology, Eindhoven, The Netherlands.
- Computational Biology Group, Department of Biomedical Engineering, Eindhoven University of Technology, Eindhoven, The Netherlands.
- Institute for Molecules and Materials, Radboud University, Nijmegen, The Netherlands.
- Center for Living Technologies, Eindhoven-Wageningen-Utrecht Alliance, Utrecht, The Netherlands.
| |
Collapse
|
18
|
Lin W, Chu L, Su Y, Xie R, Yao X, Zan X, Xu P, Liu W. Limit and screen sequences with high degree of secondary structures in DNA storage by deep learning method. Comput Biol Med 2023; 166:107548. [PMID: 37801922 DOI: 10.1016/j.compbiomed.2023.107548] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2023] [Revised: 08/24/2023] [Accepted: 09/28/2023] [Indexed: 10/08/2023]
Abstract
BACKGROUND In single-stranded DNAs/RNAs, secondary structures are very common especially in long sequences. It has been recognized that the high degree of secondary structures in DNA sequences could interfere with the correct writing and reading of information in DNA storage. However, how to circumvent its side-effect is seldom studied. METHOD As the degree of secondary structures of DNA sequences is closely related to the magnitude of the free energy released in the complicated folding process, we first investigate the free-energy distribution at different encoding lengths based on randomly generated DNA sequences. Then, we construct a bidirectional long short-term (BiLSTM)-attention deep learning model to predict the free energy of sequences. RESULTS Our simulation results indicate that the free energy of DNA sequences at a specific length follows a right skewed distribution and the mean increases as the length increases. Given a tolerable free energy threshold of 20 kcal/mol, we could control the ratio of serious secondary structures in the encoding sequences to within 1% of the significant level through selecting a feasible encoding length of 100 nt. Compared with traditional deep learning models, the proposed model could achieve a better prediction performance both in the mean relative error (MRE) and the coefficient of determination (R2). It achieved MRE = 0.109 and R2 = 0.918 respectively in the simulation experiment. The combination of the BiLSTM and attention module can handle the long-term dependencies and capture the feature of base pairing. Further, the prediction has a linear time complexity which is suitable for detecting sequences with severe secondary structures in future large-scale applications. Finally, 70 of 94 predicted free energy can be screened out on a real dataset. It demonstrates that the proposed model could screen out some highly suspicious sequences which are prone to produce more errors and low sequencing copies.
Collapse
Affiliation(s)
- Wanmin Lin
- Institute of Computing Science and Technology, Guangzhou University, Guangzhou, Guangdong, China
| | - Ling Chu
- Institute of Computing Science and Technology, Guangzhou University, Guangzhou, Guangdong, China
| | - Yanqing Su
- Institute of Computing Science and Technology, Guangzhou University, Guangzhou, Guangdong, China
| | - Ranze Xie
- Institute of Computing Science and Technology, Guangzhou University, Guangzhou, Guangdong, China
| | - Xiangyu Yao
- Institute of Computing Science and Technology, Guangzhou University, Guangzhou, Guangdong, China
| | - Xiangzhen Zan
- Institute of Computing Science and Technology, Guangzhou University, Guangzhou, Guangdong, China
| | - Peng Xu
- Institute of Computing Science and Technology, Guangzhou University, Guangzhou, Guangdong, China; School of Computer Science of Information Technology, Qiannan Normal University for Nationalities, Duyun, Guizhou, China; Guangdong Provincial Key Laboratory of Artificial Intelligence in Medical Image Analysis and Application, Guangzhou, Guangdong, China.
| | - Wenbin Liu
- Institute of Computing Science and Technology, Guangzhou University, Guangzhou, Guangdong, China; Guangdong Provincial Key Laboratory of Artificial Intelligence in Medical Image Analysis and Application, Guangzhou, Guangdong, China.
| |
Collapse
|
19
|
Rasool A, Hong J, Jiang Q, Chen H, Qu Q. BO-DNA: Biologically optimized encoding model for a highly-reliable DNA data storage. Comput Biol Med 2023; 165:107404. [PMID: 37666064 DOI: 10.1016/j.compbiomed.2023.107404] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2023] [Revised: 08/13/2023] [Accepted: 08/26/2023] [Indexed: 09/06/2023]
Abstract
DNA data storage is a promising technology that utilizes computer simulation, and synthetic biology, offering high-density and reliable digital information storage. It is challenging to store massive data in a small amount of DNA without losing the original data since nonspecific hybridization errors occur frequently and severely affect the reliability of stored data. This study proposes a novel biologically optimized encoding model for DNA data storage (BO-DNA) to overcome the reliability problem. BO-DNA model is developed by a new rule-based mapping method to avoid data drop during the transcoding of binary data to premier nucleotides. A customized optimization algorithm based on a tent chaotic map is applied to maximize the lower bounds that help to minimize the nonspecific hybridization errors. The robustness of BO-DNA is computed by four bio-constraints to confirm the reliability of newly generated DNA sequences. Experimentally, different medical images are encoded and decoded successfully with 12%-59% improved lower bounds and optimally constrained-based DNA sequences reported with 1.77bit/nt average density. BO-DNA's results demonstrate substantial advantages in constructing reliable DNA data storage.
Collapse
Affiliation(s)
- Abdur Rasool
- Shenzhen Key Laboratory for High Performance Data Mining, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, 518055, China; Shenzhen College of Advanced Technology, University of Chinese Academy of Sciences, Beijing, 100049, China.
| | - Jingwei Hong
- Shenzhen Key Laboratory for High Performance Data Mining, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, 518055, China; College of Mathematics and Information Science, Hebei University, Baoding, 071002, China
| | - Qingshan Jiang
- Shenzhen Key Laboratory for High Performance Data Mining, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, 518055, China.
| | - Hui Chen
- Shenzhen Polytechnic University, Shenzhen, 518055, Guangdong, China
| | - Qiang Qu
- Shenzhen Key Laboratory for High Performance Data Mining, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, 518055, China.
| |
Collapse
|
20
|
Gimpel AL, Stark WJ, Heckel R, Grass RN. A digital twin for DNA data storage based on comprehensive quantification of errors and biases. Nat Commun 2023; 14:6026. [PMID: 37758710 PMCID: PMC10533828 DOI: 10.1038/s41467-023-41729-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2023] [Accepted: 09/18/2023] [Indexed: 09/29/2023] Open
Abstract
Archiving data in synthetic DNA offers unprecedented storage density and longevity. Handling and storage introduce errors and biases into DNA-based storage systems, necessitating the use of Error Correction Coding (ECC) which comes at the cost of added redundancy. However, insufficient data on these errors and biases, as well as a lack of modeling tools, limit data-driven ECC development and experimental design. In this study, we present a comprehensive characterisation of the error sources and biases present in the most common DNA data storage workflows, including commercial DNA synthesis, PCR, decay by accelerated aging, and sequencing-by-synthesis. Using the data from 40 sequencing experiments, we build a digital twin of the DNA data storage process, capable of simulating state-of-the-art workflows and reproducing their experimental results. We showcase the digital twin's ability to replace experiments and rationalize the design of redundancy in two case studies, highlighting opportunities for tangible cost savings and data-driven ECC development.
Collapse
Affiliation(s)
- Andreas L Gimpel
- Department of Chemistry and Applied Biosciences, ETH Zürich, Vladimir-Prelog-Weg 1-5, 8093, Zürich, Switzerland
| | - Wendelin J Stark
- Department of Chemistry and Applied Biosciences, ETH Zürich, Vladimir-Prelog-Weg 1-5, 8093, Zürich, Switzerland
| | - Reinhard Heckel
- Department of Computer Engineering, Technical University of Munich, Arcistrasse 21, 80333, Munich, Germany
| | - Robert N Grass
- Department of Chemistry and Applied Biosciences, ETH Zürich, Vladimir-Prelog-Weg 1-5, 8093, Zürich, Switzerland.
| |
Collapse
|
21
|
Zhao Y, Cao B, Wang P, Wang K, Wang B. DBTRG: De Bruijn Trim rotation graph encoding for reliable DNA storage. Comput Struct Biotechnol J 2023; 21:4469-4477. [PMID: 37736298 PMCID: PMC10510065 DOI: 10.1016/j.csbj.2023.09.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2023] [Revised: 09/04/2023] [Accepted: 09/05/2023] [Indexed: 09/23/2023] Open
Abstract
DNA is a high-density, long-term stable, and scalable storage medium that can meet the increased demands on storage media resulting from the exponential growth of data. The existing DNA storage encoding schemes tend to achieve high-density storage but do not fully consider the local and global stability of DNA sequences and the read and write accuracy of the stored information. To address these problems, this article presents a graph-based De Bruijn Trim Rotation Graph (DBTRG) encoding scheme. Through XOR between the proposed dynamic binary sequence and the original binary sequence, k-mers can be divided into the De Bruijn Trim graph, and the stored information can be compressed according to the overlapping relationship. The simulated experimental results show that DBTRG ensures base balance and diversity, reduces the likelihood of undesired motifs, and improves the stability of DNA storage and data recovery. Furthermore, the maintenance of an encoding rate of 1.92 while storing 510 KB images and the introduction of novel approaches and concepts for DNA storage encoding methods are achieved.
Collapse
Affiliation(s)
- Yunzhu Zhao
- The Key Laboratory of Advanced Design and Intelligent Computing, Ministry of Education, School of Software Engineering, Dalian University, Dalian, Liaoning 116622, China
| | - Ben Cao
- School of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning 116024, China
| | - Penghao Wang
- The Key Laboratory of Advanced Design and Intelligent Computing, Ministry of Education, School of Software Engineering, Dalian University, Dalian, Liaoning 116622, China
| | - Kun Wang
- The Key Laboratory of Advanced Design and Intelligent Computing, Ministry of Education, School of Software Engineering, Dalian University, Dalian, Liaoning 116622, China
| | - Bin Wang
- The Key Laboratory of Advanced Design and Intelligent Computing, Ministry of Education, School of Software Engineering, Dalian University, Dalian, Liaoning 116622, China
| |
Collapse
|
22
|
Schwarz PM, Welzel M, Heider D, Freisleben B. RepairNatrix: a Snakemake workflow for processing DNA sequencing data for DNA storage. BIOINFORMATICS ADVANCES 2023; 3:vbad117. [PMID: 38496344 PMCID: PMC10941317 DOI: 10.1093/bioadv/vbad117] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 05/03/2023] [Revised: 08/17/2023] [Accepted: 08/24/2023] [Indexed: 03/19/2024]
Abstract
Motivation There has been rapid progress in the development of error-correcting and constrained codes for DNA storage systems in recent years. However, improving the steps for processing raw sequencing data for DNA storage has a lot of untapped potential for further progress. In particular, constraints can be used as prior information to improve the processing of DNA sequencing data. Furthermore, a workflow tailored to DNA storage codes enables fair comparisons between different approaches while leading to reproducible results. Results We present RepairNatrix, a read-processing workflow for DNA storage. RepairNatrix supports preprocessing of raw sequencing data for DNA storage applications and can be used to flag and heuristically repair constraint-violating sequences to further increase the recoverability of encoded data in the presence of errors. Compared to a preprocessing strategy without repair functionality, RepairNatrix reduced the number of raw reads required for the successful, error-free decoding of the input files by a factor of 25-35 across different datasets. Availability and implementation RepairNatrix is available on Github: https://github.com/umr-ds/repairnatrix.
Collapse
Affiliation(s)
- Peter Michael Schwarz
- Department of Mathematics and Computer Science, University of Marburg, Marburg 35032, Germany
| | - Marius Welzel
- Department of Mathematics and Computer Science, University of Marburg, Marburg 35032, Germany
| | - Dominik Heider
- Department of Mathematics and Computer Science, University of Marburg, Marburg 35032, Germany
| | - Bernd Freisleben
- Department of Mathematics and Computer Science, University of Marburg, Marburg 35032, Germany
| |
Collapse
|