Think of the amount of data in a big data center compressed into a few sugar cubes, or all the publicly accessible data on the Internet slipped into a shoebox.
Microsoft and University of Washington researchers set record for DNA storage
UW Associate Professor Luis Henrique Ceze, in blue, and research scientist Lee Organick prepare DNA containing digital data for sequencing, which allows them to “read” and retrieve the original files. Photo by Tara Brown Photography/University of Washington
Posted July 7, 2016 By Mike Brunker
Researchers at Microsoft and the University of Washington have reached an early but important milestone in DNA storage by storing a record 200 megabytes of data on the molecular strands.
The impressive part is not just how much data they were able to encode onto synthetic DNA and then decode. It’s also the space they were able to store it in.
Once encoded, the data occupied a spot in a test tube “much smaller than the tip of a pencil,” said Douglas Carmean, the partner architect at Microsoft overseeing the project.
Think of the amount of data in a big data center compressed into a few sugar cubes. Or all the publicly accessible data on the Internet slipped into a shoebox. That is the promise of DNA storage – once scientists are able to scale the technology and overcome a series of technical hurdles.
Digital data from more than 600 basic smartphones can be stored in the faint pink smear of DNA at the end of this test tube. Photo by Tara Brown Photography/University of Washington.
The Microsoft-UW team stored digital versions of works of art (including a high-definitionvideo by the band OK Go!), the Universal Declaration of Human Rights in more than 100 languages, the top 100 books of Project Guttenberg and the nonprofit Crop Trust’s seed database on DNA strands.
Demand for data storage is growing exponentially, and the capacity of existing storage media is not keeping pace. That’s making it hard for organizations that need to store a lot of data – such as hospitals with vast databases of patient data or companies with lots of video footage – to keep up. And it means information is being lost, and the problem will only worsen without a new solution.
DNA could be the answer.
It has several advantages as a storage medium. It’s compact, durable – capable of lasting for a very long time if kept in good conditions (DNA from woolly mammoths was recovered several thousand years after they went extinct, for instance) – and will always be current, the researchers believe.
“As long as there is DNA-based life on the planet, we’ll be interested in reading it,” said Karin Strauss, the principal Microsoft researcher on the project. “So it’s eternally relevant.”
This explains why the Microsoft-UW team is just one of a number of research groups around the globe pursuing the potential of DNA as a vast digital attic.
The researchers acknowledge they have a long way to go.
Luis Henrique Ceze, a UW associate professor of computer science and engineering and the university’s principal researcher on the project, said the biotechnology industry made big advances in both “synthesizing” (encoding) and “sequencing” (decoding) data in recent years. Even so, he said, the team still has a long way to go to make it viable as an archival technology.
But the researchers are upbeat.
They note that their diverse team of computer scientists, computer architects and molecular biologists already has increased storage capacity a thousand times in the last year. And they believe they can make big advances in speed by applying computer science principles like error correction to the process.
Carmean, who was involved in development of Intel’s microprocessor architecture beginning in 1989, puts it this way:
“It’s one of those serendipitous partnerships where a strong understanding of processors and computation married with molecular biology experts has the potential of producing major breakthroughs.”
To get an idea of how the Microsoft-UW team does its work, flash back to high school biology and recall that DNA – or deoxyribonucleic acid – is a molecule that contains the biological instructions used in the growth, development, functioning and reproduction of all known living organisms.
“DNA is an amazing information storage molecule that encodes data about how a living system works. We’re repurposing that capacity to store digital data — pictures, videos, documents,” said Ceze, who is conducting research in the team’s Molecular Information Systems Lab (MISL), which is housed in a basement on the University of Washington campus. “This is one important example of the potential of borrowing from nature to build better computer systems.”
Storing digital data on DNA works like this:
First the data is translated from 1s and 0s into the “letters” of the four nucleotide bases of a DNA strand — (A)denine, (C)ytosine, (G)uanine and (T)hymine.
Karin Strauss. Photo by Scott Eklund/Red Box Pictures
Then they have vendor Twist Bioscience “translate those letters, which are still in electronic form, into the molecules themselves, and send them back,” Strauss said. “It’s essentially a test tube and you can barely see what’s in it. It looks like a little bit of salt was dried in the bottom.”
Reading the data uses a biotech tweak to random access memory (RAM), another concept borrowed from computer science. The team uses polymerase chain reaction (PCR), a technique that molecular biologists use routinely to manipulate DNA, to multiply or “amplify” the strands it wants to recover. Once they’ve sharply increased the concentration of the desired snippets, they take a sample, sequence or decode the DNA and then run error correction computations.
The lab tour complete, one question needed asking: Why an OK Go video?
“We like that a lot because there are many parallels with the work,” Strauss said with a laugh. “They’re very innovative and are bringing different things from different areas into their field and we feel we are doing something very similar.”
- Learn more about Microsoft’s DNA storage project
- Read the University of Washington story and Q&A on the project
- Read the Twist Bioscience press release
- Follow Karin Strauss on Twitter
- New York Times: Data storage on DNA can keep it safe for centuries
Mike Brunker is a freelance writer and editor. Follow him on Twitter.
How DNA data storage works, as scientists create the first DNA ‘RAM’
DNA data storage is a big deal. Partly, it’s because we’re based on DNA, and any research into manipulation of that molecule will pay dividends for medicine and biology in general — but in part, it’s also because the world’s most wealthy and powerful corporations are getting discouraged at cost estimates for data storage in the future. Facebook, Apple, Google, the US government, and more are all making astounding investments in storage (“exabyte” is the buzzword now). But even these mega-projects can only put off the inevitable for so long; we are simply producing too much data for magnetic storage to keep up, without a major unforeseen shift in the technology.
That’s why a company like Microsoft recently decided to invest in the prospect of storing information with a totally different sort of tech: biotech. It might seem off-brand for the software giant, but teaming up with academics to take on molecular biology has produced stunning results: The team was able to store and perfectly recall digital data with incredible storage density. According to an accompanying blog post, they managed to pack about 200 megabytes of data into just a fraction of a drop of liquid, including a compressed music video from the band OK Go. Even more impressive, that data was stored in a quickly and easily accessible form, making it more akin to computer RAM, than computer storage.
So how did they accomplish this incredible feat?
First, they had to convert the digital code of 1’s and 0’s to a genetic code of A’s, C’s, T’s, and G’s, then take this lowly text file and manually construct the molecule it represents. Each of these is a feat in and of itself. DNA storage requires cutting-edge techniques in data compression and security to design a sequence both info-dense enough to realize DNA’s potential and redundant enough to allow robust error-checking to improve the accuracy of information retrieved down the line.
Very little of the technology on display here is new, since the most important parts of the system have existed much longer than mankind itself. But if all the data necessary to code for Albert Einstein was contained within the nucleus of every single cell of Albert Einstein’s body, as it was, then this classical approach to data storage must have something going for it. Researchers in this field set out to understand and harness that something, and they’re getting better at it seemingly every couple of months.
At the end of the day, DNA’s key special attribute it data storage density: how much information can DNA fit into a given unit volume? The NSA’s largest, most notorious data-center is an enormous, sprawling complex full of networked racks of magnetic storage drives — but according to some estimates, DNA could take the volume of data contained in about a hundred industrial data centers and store it in a space roughly the size of a shoe box.
DNA achieves this in two ways. One, the coding units are very small, less than half a nanometer to a side, where the transistors of a modern, advanced computer storage drive struggle to beat the 10 nanometer mark. But the increase in storage capacity isn’t just ten- or a hundred-fold, but thousands-fold. That differential arises from the second big advantage of DNA: it has no problem packing three-dimensionally.
Sequencing has gotten much faster and cheaper over time — and that’s good, because we need to sequence DNA data to read it!
See, transistors are generally aligned on a flat plane, meaning their ability to fully use a given space is pretty low. We can of course stack many such flat boards one atop another, but at that point a new and totally debilitating problem arises: heat. One of the most challenging parts of designing new transistor-based technologies, whether they’re processors or storage devices, is heat. The more tightly you pack silicon transistors, the more heat you’ll create, and the harder it will be to ferry that heat away from the device. This both limits the maximum density, and requires that we supplement the cost of the drives themselves with expensive cooling systems.
With its super-efficient packing structure, the DNA double helix offers a great solution. Chromatin, the DNA-protein system that makes up chromosomes, is essentially a very complex mechanism designed to allow an inherently sticky molecule like DNA to roll up really tight, yet still unroll quickly and easily later on, when certain patches of DNA are needed by the body.
Here’s a simplified look at how DNA packs so tightly into three-dimensional space.
This at-hand nature of the chromatin system, which allows any gene to be “called” from any part of the genome with roughly equal efficiency, has led the researchers to dub their storage system a DNA version of a computer’s random access memory, or RAM. Like RAM, the physical location of a piece of data within the drive isn’t important to the computer’s ability to access that information.
However, storing information in DNA differs from computer RAM in some pretty significant ways. Most notable is speed; part of what makes RAM RAM is that its easy-access system is also a quick access system, allowing it to hold data the computer might need at an instant’s notice, and make it available on those timescales. On the other hand, DNA is significantly harder and slower to read than conventional computer transistors, meaning in terms of access speed it’s actually less RAM-like than your average computer SSD or spinning magnetic hard-drive.
That’s because the incredible abilities of evolution’s data storage solution were tailored to evolution’s unique needs, and those needs don’t necessarily include performing thousands of “reads” per second. Regular, cellular DNA data storage has to untangle the complex chromatin structure of stable DNA, then unwind the DNA double helix itself, make a copy of the sequence of interest, then zip everything right back up the way it was — it takes a while.
For our purposes, we must then add the extra step of reading the DNA. In this case, that’s achieved by using an age-old technique in biotech labs called the polymerase chain reaction (PCR) to amplify, or repeatedly duplicate, the sequence we want to read. The whole sample is then sequenced, and everything but the many-many-many-times repeated sequence we amplified is discarded. What remains is our sequence of interest. These stretches of DNA are marked with little target sequences that allow the PCR proteins to bind, and the replication process to begin.
In cells, genes are turned “on” and “off” largely by changing the availability of these target sequences to the always-waiting machinery of DNA replication. This can be done via the winding and unwinding of chromatin, the direct addition or removal of a blocker protein, or even interaction with other areas of the genome to promote or preclude transcription. In a man-made data storage system, we could theoretically make something better suited to our needs, stronger or more efficient or less wasteful on forms of security we don’t need for this purpose, but that would require a level of sophistication in protein engineering that still seem a ways out.
Now read: How DNA sequencing works
Microsoft Sets DNA Data-Storage Record: 200 Megabytes
By Megan Scudellari
Three weeks ago, we reported on a meeting of the minds in Virginia, where experts discussed the plausibility and requirements of using DNA as a hard drive. As the demand for data storage steadily grows, especially in biomedicine, the dense nucleic acid provides a promising new data depot.
At the time of the meeting, a source told us the attendees concluded that the “ambitious goal” of a prototype DNA storage machine was “possible” within five to seven years. Now, it seems they were a bit conservative with that timeline: Yesterday, researchers at Microsoft and the University of Washington announced a new record for the amount of data stored in (and read back out of) synthetic DNA strands.
In a dab of DNA smaller than the tip of a pencil, the team stored 200 megabytes of data—a thousand times more DNA storage capacity than was possible a year ago, they say. “Think of the amount of data in a big data center compressed into a few sugar cubes,” Microsoft proclaimed in a press release. “Or all the publicly accessible data on the Internet slipped into a shoebox.”
In those helical threads of biological material, the team encoded an eclectic range of information: A high-definition version of the popular Rube Goldberg Machine video by the band OK Go!, other forms of digital art, the Universal Declaration of Human Rights in more than 100 languages, the top 100 books of Project Gutenberg, and a seed database from the nonprofit Crop Trust.
At this rate, someday soon we’ll all be reaching over our thumb drives to grab a test tube.
IEEE Spectrum’s Eliza Strickland explains how DNA data storage works here.
Tech Companies Mull Storing Data in DNA
As conventional storage technologies struggle to keep up with big data, interest grows in a biological alternative
Photo: Getty ImagesTest Tube Bits: Biology’s data-storage method, DNA, might work for our data, too.
It was the looming sense of crisis that brought them together. In late April, technologists from IBM, Intel, and Microsoft joined an intimate gathering of computer scientists and geneticists to discuss the big problem with big data: Our data storage requirements are rapidly exceeding the capacity of today’s best storage technologies: magnetic tape, disk drives, and flash memory.
The closed-door meeting in Arlington, Va., was convened to explore the potential of a new storage technology that is actually as old as life itself. The experts came together to weigh the merits of DNA data storage, which makes use of the marvelously compact and durable DNA molecules that encode genetic information inside living things. By converting digital files into biological material, warehouse-size storage facilities could theoretically be replaced by diminutive test tubes.
While this idea has been kicking around for many years, meeting attendeeVictor Zhirnov says tech companies are now starting to consider DNA data storage as a real possibility. Zhirnov, director of cross-disciplinary research and special projects for the Semiconductor Research Corp. (which cosponsored the meeting), says he was encouraged by the presence of “luminaries” from industry and academia who took active part in the two-day workshop. “The question was, can we demonstrate a prototype DNA storage machine within five to seven years?” explains Zhirnov. “It is a very ambitious goal, but we concluded that it is possible.”
Here’s how DNA data storage works. First you take any digital data that would normally be stored in a binary code of 0s and 1s and translate it into the genetic code of As, Cs, Gs, and Ts that represent the chemical building blocks of DNA. Then you give that DNA code (for example, GATTACA) to a synthetic biology company, which manufactures strings of DNA to your specifications. Next you stash the test tube in cold storage and walk away. When you want to retrieve the information, you take out the test tube and use a standard DNA sequencing machine to decode the material inside. That gives you once again the DNA sequence GATTACA, which you can translate back into binary to read your original file.
Digits to DNA and Back Again
1. Any digital file—a movie, medical records, the Encyclopedia Britannica—can be converted to a “genetic file” and stored as strands of DNA. First the digital file’s binary code is translated into the four-letter genetic code, composed of the As, Cs, Gs, and Ts that represent the chemical building blocks of DNA strands.
2. Then a synthetic-biology company manufactures the strands to the customer’s specifications.
3. A test tube containing the genetic file can be stashed away in cold storage until someone wants to retrieve the information.
4. A standard DNA sequencing machine reads out the genetic code.
5. The code is then translated back into binary.
DNA is the densest storage medium in existence, able to store almost a zettabyte of data in a single gram of material. It’s also extremely long lasting, as demonstrated by remarkable feats of paleontological derring-do. In 2013, for example, a team reconstructed the entire genome of an early horse speciesusing DNA from a bone that was buried in the Arctic permafrost for some 700,000 years.
If DNA archives become a plausible method of data storage, it will be thanks to rapid advances in genetic technologies. The sequencing machines that “read out” DNA code have already become exponentially faster and cheaper; the National Institutes of Health shows costs for sequencing a 3-billion-letter genome plummeting from US $100 million in 2001 to a mere $1,000 today. However, DNA synthesis technologies required to “write” the code are much newer and less mature. Synthetic-biology companies like San Francisco’s Twist Biosciencehave begun manufacturing DNA to customers’ specifications only in the last few years, primarily serving biotechnology companies that are tweaking the genomes of microbes to trick them into making some desirable product. Manufacturing DNA for data storage could be a profitable new market, says Twist CEO Emily Leproust.
Twist sent a representative to the April meeting, and the company is also working with Microsoft on a separate experiment in DNA storage, in which it synthesized 10 million strands of DNA to encode Microsoft’s test file. Leproust says Microsoft and the other tech companies are currently trying to determine “what kind of R&D has to be done to make a viable commercial product.” To make a product that’s competitive with magnetic tape for long-term storage, Leproust estimates that the cost of DNA synthesis must fall to 1/10,000 of today’s price. “That is hard,” she says mildly. But, she adds, her industry can take inspiration from semiconductor manufacturing, where costs have dropped far more dramatically. And just last month, an influential group of geneticists proposed an international effort to reduce the cost of DNA synthesis, suggesting that $100 million could launch the project nicely.
The U.S. Intelligence Advanced Research Projects Activity (IARPA) cosponsored the meeting and may fund a research program to create a prototype “DNA hard drive,” but Zhirnov says that hasn’t been confirmed. “If such a program can be established, the teams are ready,” he says. Research would likely focus first on the most obvious application for a DNA hard drive: using it for archival storage, in which the data remain unchanged until the entire file is retrieved for readout. However, an IARPA program could also fund researchers who have recently demonstrated that DNA can be used forrandom-access memory and can even be made rewritable.
Biotech consultant Rob Carlson attended the meeting, and says he expects that the intelligence agencies of several countries will fund work on DNA data storage to grapple with the onslaught of information now being gathered by surveillance technologies. “They’re scratching their heads,” he says, “and there’s nothing else in the offing that can meet their storage needs.” Carlson has written skeptically about the commercial market for synthetic DNA, yet he says that DNA data storage may be the application that makes the young industry viable. “We can imagine storing massive amounts of data in a very small volume, and we already know how to read and write it,” he says. “Now the question is, can we read and write it at high throughput and at low cost?”
This article appears in the July 2016 print issue as “Tech Companies Mull Archiving Data in DNA.”