Data reduction explained


What is Data Reduction?

Before we get into the benefits for your end users or the best-suited application workloads of data reduction, let’s talk technology. Data reduction is the process of removing redundant blocks of data to reduce the amount of storage required and with some vendors’ solutions also includes compressing the data after the redundant blocks are removed. Solutions with inline data reduction process the incoming data upon ingest, and then write to primary storage (usually all flash arrays). Post-process data reduction writes the data to a workspace, then processes the data to reduce it (and in some solutions, also compresses it) and finally writes to the primary storage. Think of data reduction as a five (sometimes six)-step process:

1) Ingest the data and write to workspace

2) Break the data into chunks for processing

3) Create a hash code from the chunks for faster processing

4) Compare the hash codes: Unique data gets written; duplicate data only gets a pointer to the original data.

5) For post process: add a step to write data, and then re-write to a workspace.

6) For both in-line or post-process with some vendors then compress the data that has been reduced

Both inline and post-process data reduction may take either of two approaches: data reduction for all application workloads, and data reduction for appropriate application workloads.

The benefit of data reduction varies

When your customers place an application workload that can benefit from data reduction – such as virtual desktop infrastructure (VDI) or virtual server infrastructure (VSI) – which is approximately 14% of most data centre application workloads, your customers can take advantage of the reduced writes and save money by using less storage. In some cases, your customers may be able to reduce the amount of space required by 90% (a 10:1 data reduction).

While VDI and VSI benefit from data reduction technology, many other application workloads do not: examples are OLTP, Big Data Applications, Data Warehouse to name just a few.

Workloads that don’t benefit from data reduction

Databases are just one example of an application workload that should not be used with data reduction technology. Relational databases use tables to improve performance and manage operations. A relational database such as Oracle has no duplicate data blocks, because each block in a tablespace (the logical container in which tables and indexes are stored) contains a unique key at the start and a checksum containing part of that key at the end. As a result, most shops are going to see little space saving, while paying the price of increased latency as the hardware pointlessly attempts to find matching blocks.

Another workload that does not make sense for data reduction is encrypted data.

Encryption is, by design, a unique data stream where data reduction only adds latency. Think of credit card numbers as a common workload to be encrypted, and you’ll understand it also has low affinity for data reduction. If there is a need to deduplicate encrypted data, you must have access to the unencrypted data so that the storage system can identify duplicates. This implies data encryption can’t be performed within the application if your customer wants to deduplicate that data. Any storage processing of encrypted data needs to be architected very carefully to preserve the security of the data.

Can your customers turn data reduction features on and off per workload?

When your customer has an application workload that will have a small or no benefit from data reduction (for instance, databases or encrypted data), well-designed data reduction solutions have the ability to turn data reduction off. This approach gives your customers the ability to decide on a LUN-by-LUN or share-by-share basis if the application workload will benefit from the data reduction technology.

Many of the solutions in the market today always leave the data reduction technology on and this can actually impact the performance of some application workloads. In short, the best solutions for your customers are those that give them granular control over when to execute data reduction technology and when not to, based on the benefits of each application workload.

There are some solutions on the market that will provide your customers with a data reduction dashboard that displays the effective data reduction rate. If data reduction is right for a specific application workload, your customer will know to go looking for similar workloads to deduplicate. If the application workload is clearly not a deduplication candidate, wouldn’t it be great to give your customer the flexibility to throw it out and make room for something that will actually benefit from the technology?

Christian Putz is Violin Memory’s VP, EMEA channel sales