Asymptotically Optimal Codes for Correcting Burst Deletions and Insertions in Labeled DNA Sequences

Wenhao Liu, Zhengyi Jiang, Zhongyi Huang, Hanxu Hou

arXiv:2606.14573·cs.IT·Published 2026-06-12

Fluorescent labeling is a cornerstone of DNA visualization and a key enabler of random access in DNA-based data storage. However, the stochastic nature of biochemical processes, including synthesis, hybridization, and optical readout, induces \emph{burst} synchronization errors within the resulting labeling sequences. To address this critical challenge, we formally introduce \emph{burst $t$-deletion/insertion $\mathcal{A}$-labeling codes,} designed to correct a single burst of $t$ deletions or insertions in the label domain. Our contributions are threefold. \begin{itemize} \item \textbf{Fundamental limit.} We establish an information-theoretic lower bound of $\log_4 n + \mathcal{O}(1)$ on the redundancy of any such code for all $t \ge 1$ with $t \mid n$. To the best of our knowledge, this resolves the first information-theoretic lower bound even for the single-error case \(t=1\). \item \textbf{Explicit construction.} For $t \ge 2$, $t \mid n$, and $n \ge 7t + 3$, we propose explicit encoding and decoding algorithms, both running in $\mathcal{O}(n^2)$ time. A novel generalized Run-Length Limited (RLL) constraint is introduced to bridge the structural mismatch between the DNA encoding domain and the label error domain. \item \textbf{Asymptotic optimality.} The proposed scheme achieves redundancy $\log_4 n + (t-1)\log_4 \log_{8/3} n + \mathcal{O}(1)$, matching the dominant term of the lower bound up to a small $\mathcal{O}(\log\log n)$ overhead, rendering the construction asymptotically optimal for fixed $t$. \end{itemize}

TopicsProtein & Biomolecules

arXiv categoriescs.IT

arXiv abstract pagePDF