Tag Archives: ddos

Can you Compress AND Dedupe? It Depends

Posted on by

My recent post about Compression vs Dedupe, which was sparked by Vaughn’s blog post about NetApp’s new compression feature, got me thinking more about the use of de-duplication and compression at the same time.  Can they work together?  What is the resulting effect on storage space savings?  What if we throw encryption of data into the mix as well?

What is Data De-Duplication?

De-duplication in the data storage context is a technology that finds duplicate patterns of data in chunks of blocks (sized from 4-128KB or so depending on implementation), stores each unique pattern only once, and uses reference pointers in order to reconstruct the original data when needed.  The net effect is a reduction in the amount of physical disk space consumed.

What is Data Compression?

Compression finds very small patterns in data (down to just a couple bytes or even bits at a time in some cases) and replaces those patterns with representative patterns that consume fewer bytes than the original pattern.  An extremely simple example would be replacing 1000 x “0”s with “0-1000”, reducing 1000 bytes to only 6.

Compression works on a more micro level, where de-duplication takes a slighty more macro view of the data.

What is Data Encryption?

In a very basic sense, encryption is a more advanced version of compression.  Rather than compare the original data to itself, encryption uses an input (a key) to compute new patterns from the original patterns, making the data impossible to understand if it is read without the matching key.

Encryption and Compression break De-Duplication

One of the interesting things about most compression and encryption algorithms is that if you run the same source data through an algorithm multiple times, the resulting encrypted/compressed data will be different each time.  This means that even if the source data has repeating patterns, the compressed and/or encrypted version of that data most likely does not.  So if you are using a technology that looks for repeating patterns of bytes in fairly large chunks 4-128KB, such as data de-duplication, compression and encryption both reduce the space savings significantly if not completely.

I see this problem a lot in backup environments with DataDomain customers.  When a customer encrypts or compresses the backup data before it gets through the backup application and into the DataDomain appliance, the space savings drops and many times the customer becomes frustrated by what they perceive as a failing technology.  A really common example is using Oracle RMAN or using SQL LightSpeed to compress database dumps prior to backing up with a traditional backup product (such as NetWorker or NetBackup).

Sure LightSpeed will compress the dump 95%, but every subsequent dump of the same database is unique data to a de-duplication engine and you will get little if any benefit from de-duplication.   If you leave the dump uncompressed, the de-duplication engine will find common patterns across multiple dumps and will usually achieve higher overall savings.  This gets even more important when you are trying to replicate backups over the WAN, since de-duplication also reduces replication traffic.

It all depends on the order

The truth is you CAN use de-duplication with compression, and even encryption.  They key is the order in which the data is processed by each algorithm.  Essentially, de-duplication must come first.  After data is processed by de-duplication, there is enough data in the resulting 4-128KB blocks to be compressed, and the resulting compressed data can be encrypted.  Similar to de-duplication, compression will have lackluster results with encrypted data, so encrypt last.

Original Data -> De-Dupe -> Compress -> Encrypt -> Store

There are good examples of this already;

EMC DataDomain – After incoming data has been de-duplicated, the DataDomain appliance compresses the blocks using a standard algorithm.  If you look at statistics on an average DDR appliance you’ll see 1.5-2X compression on top of the de-duplication savings.  DataDomain also offers an encryption option that encrypts the filesystem and does not affect the de-duplication or compression ratios achieved.

EMC Celerra NAS – Celerra De-Duplication combines single instance store with file level compression.  First, the Celerra hashes the files to find any duplicates, then removes the duplicates, replacing them with a pointer.  Then the remaining files are compressed.  If Celerra compressed the files first, the hash process would not be able to find duplicate files.

So what’s up with NetApp’s numbers?

Back to my earlier post on Dedupe vs. Compression; what is the deal with NetApp’s dedupe+compression numbers being mostly the same as with compression alone?  Well, I don’t know all of the details about the implementation of compression in ONTAP 8.0.1, but based on what I’ve been able to find, compression could be happening before de-duplication.  This would easily explain the storage savings graph that Vaughn provided in his blog.  Also, NetApp claims that ONTAP compression is inline, and we already know that ONTAP de-duplication is a post-process technology.  This suggests that compression is occurring during the initial writes, while de-duplication is coming along after the fact looking for duplicate 4KB blocks.  Maybe the de-duplication engine in ONTAP uncompresses the 4KB block before checking for duplicates but that would seem to increase CPU overhead on the filer unnecessarily.

Encryption before or after de-duplication/compression – What about compliance?

I make a recommendation here to encrypt data last, ie: after all data-reduction technologies have been applied.  However, the caveat is that for some customers, with some data, this is simply not possible.  If you must encrypt data end-to-end for compliance or business/national security reasons, then by all means, do it.  The unfortunate byproduct of that requirement is that you may get very little space savings on that data from de-duplication both in primary storage and in a backup environment.  This also affects WAN bandwidth when replicating since encrypted data is difficult to compress and accelerate as well.