The 144 words of Robert Frost’s seminal poem “The Road Not Taken” fit neatly onto a single printed page. Or in a 1-kilobyte data file. Or, in Hyunjun Park’s hands, in a few drops of water at the bottom of a pink Eppendorf tube. Well, really what’s inside the water: invisible floating strands of DNA.
Scientists have long touted DNA’s potential as an ideal storage medium; it’s dense, easy to replicate, and stable over millennia. And in the past few years, researchers have encoded all kinds of things in those strings of As, Ts, Cs, and Gs: War and Peace, Deep Purple’s “Smoke on the Water,” a galloping horse GIF. But in order to replace existing silicon-chip or magnetic-tape storage technologies, DNA will have to get a lot cheaper to predictably read, write, and package.
That’s where scientists like Park come in. He and the other cofounders of Catalog, an MIT DNA-storage spinoff emerging out of stealth on Tuesday, have come a long way since encoding their first poetic kilobyte by hand a year and a half ago. Now they’re building a machine that will write a terabyte of data a day, using 500 trillion molecules of DNA. They plan to launch industrial-scale storage services for IT companies, the entertainment industry, and the federal government within the next few years—joining several much larger tech companies, like Microsoft, Intel, and Micron, which are funding their own DNA storage projects.
If successful, DNA storage could be the answer to a uniquely 21st-century problem: information overload. Five years ago humans had produced 4.4 zettabytes of data; that's set to explode to 160 zettabytes (each year!) by 2025. Current infrastructure can handle only a fraction of the coming data deluge, which is expected to consume all the world's microchip-grade silicon by 2040.
Most digital archives—from music to satellite images to research files—are currently saved on magnetic tape. Tape is cheap. But it takes up space. And it has to be replaced roughly every 10 years. “Today’s technology is already close to the physical limits of scaling,” says Victor Zhirnov, chief scientist of the Semiconductor Research Corporation. “DNA has an information-storage density several orders of magnitude higher than any other known storage technology.”
How dense exactly? Imagine formatting every movie ever made into DNA; it would be smaller than the size of a sugar cube. And it would last for 10,000 years.
The trouble of course, is cost. Sequencing—or reading—DNA has gotten far less expensive in the last few years. But the economics of writing DNA remain problematic if it’s going to become a standard archiving technology. DNA-synthesis companies like Twist Bioscience charge between 7 and 9 cents per base. That means a single minute of high-quality stereo sound could be stored for just under $100,000.
Catalog thinks it can rewrite those cost curves by decoupling the process of writing DNA from the process of encoding it. Traditional methods map the sequence of bits—0s and 1s—onto a sequence of DNA’s four base pairs. In 2016, when Microsoft set a record by storing 200 megabytes of data in nucleotide strands, the company used 13,448,372 unique pieces of DNA. What Catalog does, instead, is cheaply generate large quantities of just a few different DNA molecules, none longer than 30 base pairs. Then it uses billions of enzymatic reactions to encode information into the recombination patterns of those prefab bits of DNA. Instead of mapping one bit to one base pair, bits are arranged in multidimensional matrices, and sets of molecules represent their locations in each matrix.
“If you think of information as a book, you can record that information by copying it down by hand,” says Park. But instead of transcribing letter for letter, Catalog is in effect creating a printing press, where each typeface is represented by a small molecule of DNA. “By rearranging these premade molecules in different ways, we can organize all the different words into the original order of the book.”
Devin Leake, who recently left his role as head of DNA synthesis at Ginkgo Bioworks to be Catalog’s chief science officer, says this approach should have the company approaching costs competitive with tape storage within a few years, once it scales up automation. Zhirnov says that might be feasible with Catalog’s “library approach,” because it won’t have to synthesize new DNA for every new piece of stored information; instead the company can just remix its pre-fabricated DNA molecules.
If it achieves those economies of scale, Catalog could move beyond what most people have identified as early applications of the technology, namely storing data that needs to be archived for legal or regulatory reasons—like rarely accessed surveillance video, medical records, or historical government documents. According to Leake and Park, the company will start commercial pilots early next year, focusing on intelligence and space agencies within the federal government, as well as the IT sector and Hollywood.
Molecular data storage has become something of a pet project for the Defense Advanced Research Projects Agency. Last year it dropped $15.3 million in grants to discover new biochemical ways to store binary. And big tech companies have begun piloting their own projects as well. Microsoft plans to have an operational prototype storage system based on DNA working inside one of its data centers by 2020.
According to Doug Carmean, a partner architect at Microsoft Research, it will initially be offered to “boutique” customers, with data needs in the gigabyte to petabyte range. The long-term goal though, is much more ambitious. “We’re going after totally replacing tape drives as an archival storage,” says Carmean. By drafting the enormous waves of interest in consumer genetics and synthetic biology, he thinks that could actually happen sooner rather than later. “As people get better access to their own DNA, why not also give them the ability to read any kind of data written in DNA?” Data storage just might be a modern day problem looking for a 3.8 billion-year-old solution.