FCT-RP1: Practical Data Storage and Computation in DNA Molecules

Principal Investigator: Associate Professor Wong Weng-Fai, SoC

Our digital universe expands at an exponentially increasing rate, driving the constant need for denser and more efficient storage and processing of data. At the same time, the sharp slowdown in technology scaling simultaneously observed in a number of widely used compute and storage technologies (CMOS transistors, DRAM, hard disks, solid state drives) reduces our ability to even preserve the data we generate, let alone the ability to perform computation on it. The widening gap between the demand and supply for data storage and computation can be bridged using a new and radical technology that uses synthetic DNA as a chemical medium for both data storage and computation and offers a number of important and unique advantages:

• Unparalleled Density. To illustrate, all the data stored in Facebook’s datacenter in Oregon, which is entirely dedicated to storage of high-density archived data, could fit into size of a sugar cube when stored in DNA, whereas our entire digital universe could fit into several bottles of DNA.
• Unmatched Durability. Depending on the method of preservation, data stored in the DNA format could last for hundreds of thousands of years. This is in stark contrast to conventional storage technologies that retain data for a few years or decades, requiring perpetual acquisition of new hardware and data transfer.
• Eternally Relevant and Advancing Interfaces. While the read/write interfaces of all storage devices eventually become obsolete, humans will always have an existential interest to read and write DNA. Furthermore, during the past three decades, the performance and efficiency of DNA read-write interfaces has been improving at a much faster rate compared to Moore’s law, and the recent pandemic outbreak will only accelerate such trends.
• Efficient Random Access. One of the most fundamental reactions in biochemistry (polymerase chain reaction, or PCR) allows us to selectively extract and read only an object of interest among petabytes of data. The key implications are that both the cost and the latency of read operations are nearly constant, regardless of the amount of data stored in the system. PCR is also a major the mechanism used in medical diagnostics (e.g., PCR is the underlying mechanism of COVID-19 tests); consequently, improvements in PCR efficiency are an important common goal for both the synthetic DNA technology and medicine.
• Efficient data manipulation. A number of important data-intensive operations, such as content based and similarity search, which are part of many machine learning pipelines, intelligent queries, or copying vast amounts of data, can be conveniently performed in the molecular domain at a constant latency.

While a strong case for using DNA as a medium for data storage and computation can be easily made, the prohibitive cost of reading and writing DNA, high overheads of accessing smaller and variably-sized pieces of data, high susceptibility to unusual types of errors, and the lack of practical support for updates and other computational operators represent some of the major challenges for the adoption of the technology. The goal of this project is to define a practical system architecture that allows for cost-efficient, reliable, durable, and rewritable storage of unprecedented amounts of digital data in DNA molecules and efficiently supports extremely parallel computational operators that would not be feasible with conventional compute technologies or even quantum computers.