Not All Deduplication Ratios Are Made The Same

So, you or your customer are about to shell out for a new deduplicating storage system. What will the main metric by which you chose your new system be? Why, deduplication ratio, of course!

What if I told you that the least important number about a deduplicating storage system was its deduplication ratio? Ok, there may be a little hyperbole in that statement, but it’s not far from the truth.

So, why isn’t the deduplication ratio important and what are these other important numbers?

The deduplication ratio isn’t so much unimportant as potentially wildly misleading, consider these scenarios:

1.     A backup system takes a full backup every single day. This system will have almost 100% correlation of data between backups, therefore will achieve dedupe ratios of 100s:1.

2.     A backup system takes a full every week and differentials between the full backups, with a four-week retention period. This system will have four full backups in retention at any one time which are substantially similar, then a series of backups which are substantially dissimilar. It is likely that a deduplication ratio in the region of 20-30:1 can be achieved with this configuration.

3.     Now consider a four-week retention period on a progressive incremental forever backup. There will be all the data to make up a single full backup image, with each day being an incremental change from the previous day. The upshot is that because the data is being carefully selected from the client, there is much less data as a candidate for deduplication, therefore deduplication ratios in the region of 3-5:1 can be expected, maybe a little higher.

In each scenario, the same data are backed up, however the selection criteria and the amount of data moved from the client are different. All this and we’re not even going in to the different types of data which can affect the dedupe ratios. However we can already see that the method by which the data are selected for backup is a major factor in the deduplication ratio that can be expected. It is therefore a good idea to take deduplication ratios with a pinch of salt, unless detailed test methodologies are provided.

Ok, so that’s dedupe ratios ruined, what about these other numbers you were on about?

There are many numbers to look at, power, cooling, ability to scale out indefinitely, licensing costs, TCO, but when it comes down to it, there is a single metric more important than anything:

Real-world restore speed.

How fast does the system restore? I’m not talking about going to a test centre and running a few trivial bandwidth tests for individual restores, how does it really go?

When you’re doing a restore of a rack of failed servers, when those servers were all doing their own deduplication processing and while the rest of the datacentre is still backing up. How fast does your system go then? What you’ve now got is a system which was happily backing up data, having hugely increased CPU, database and disk subsystem demand, due to re-hydrating the outgoing restore data in addition to actually moving it to the clients. This will inevitably have an impact on both backup and restore bandwidth, which may cripple the system. You need to be able to specify the performance of your system to match your worst-case scenario, not some just-about good enough to do the backups.