About the speaker
Marc brings over 15 years of software and hardware experience in the high technology sector to ExaGrid, where he is part of the team that drives product strategy and execution and is responsible for managing product operations.
Data deduplication in virtualized environments
In this session you will learn:
- What is Deduplication?
- Why Use Deduplication in Backup and Recovery?
- Challenges of Deduplication in Virtualized Environments
- Deduplication approaches (two camps)
- Summary ‒ Deduplication’s Role in Data Protection and Disaster Recovery
More sessions by Marc Crespi
Hello, and welcome to this program of Backup Academy. The topic today is Data Deduplication in Virtualized Environments. My name is Marc Crespi, and I am the Vice President of Product Management at ExaGrid Systems.
Before we begin today's content, let me just tell you a little bit about myself. I have over 20 years of software and hardware experience in the high technology sector. I am part of the team here at ExaGrid that drives product strategy and execution, and I am responsible for managing our product operations. Prior to joining ExaGrid in 2006, I was the Director of Product Management for Security Management Products at Altiris.
The objective of this program is first and foremost to explain to you what is data deduplication. We're going to talk about why you should use deduplication in your backup and recovery operations, talk about some of the specific challenges found in virtualized environments. We'll go over the major components of a successful backup and recovery infrastructure, including data deduplication. We'll talk about the various approaches to deduplication, the pros and cons of each approach, and finally we'll summarize with the role of data deduplication and data protection in disaster recovery.
First, what is data deduplication? Data deduplication in its simplest form is a means to reduce the amount of data we have to store down to just the unique bytes of data that are changing. When used in backup and recovery, data deduplication often has a dramatic effect because backups, by their very nature, are redundant. We typically backup the same environmental components, the same virtual machines, the same systems over and over again, and as we do this, very little of the data is actually changing. So data deduplication is the technology that recognizes this fact and stores only the changing data, dramatically reducing the amount of disks we would need to store our backups, as well as the amount of bandwidth we would require if we wanted to transmit those backups off-site for disaster recovery protection.
So this means that deduplication, when used in your backup and recovery operations, can help with enhanced speed and performance. It can deliver faster backup times as there is a lower volume of data to be backed up. It can provide a dramatic savings in disk costs. Average reduction rates will be as high as 20 to 1, reducing the amount of disk space required to store your backups. It makes backup scalable, because we can backup ever-increasing data volumes while maintaining the same backup window, and it allows us to implement disk based off-site disaster recovery through efficient use of bandwidth because of deduplication.
So what exactly does deduplication do? It eliminates the redundancies in your virtual server backups. You know, without data deduplication, each of your virtual servers gets backed up in its entirety over and over again, and we're not only backing up the data associated with those virtual machines, but the virtual machines themselves, which includes the guest operating systems and the other virtual components found in those machines, a lot of which are highly redundant and repetitive. By using data deduplication, we deduplicate these backups down just to the changed bytes. This delivers a dramatic savings in disk and bandwidth, and provides us a path to integrated replication for off-site disaster recovery. So as a result, we get a reduced storage footprint. In some instances, you may see ratios of reduction as high as 1000 to 1, due to the high degree of redundancy found in virtual backups. Deduplication will allow you to store only the bytes that change in your VM servers, eliminate the redundancy found in most typical VMware backups, and in some implementations, allow you to restore quickly from your most recent VMware backup.
What are some of the specific challenges we find in backup and recovery in virtualized environments? Obviously, as the number of virtual machines continues to increase, management of backups becomes a bigger and bigger challenge, and this is driven by the traditional method of backup of putting an agent on every virtual machine. We also have increasing data volumes, and handling the volume of backup data efficiently is very difficult in a virtual environment. Take a simple example of someone backing up 10 disk operating system instances at 50 gigabyte. This would drive 500 gigabyte of backup image bit data daily. All of these changes and challenges are driving a need for better tools to allow us to more reliably and easily backup and restore our virtual machines.
Now where would you deploy data deduplication amongst those components? There are two locations within the network where data deduplication can be performed. One location is referred to a source base data reduction. Here some or all of the data redundancy is removed prior to transmitting the data over the network. The pros of this can be that it will reduce impact on the virtual machines if, in fact, the technique is implemented correctly. It could shorten your backup window as less data needs to be transmitted. If bandwidth between the virtual machine and the backup target is an issue, this will reduce bandwidth needed to send data to the backup target, and it does have the impact of reducing storage usage. The cons of it, it can be slower for large amounts of data, and if the implementation is such that it requires an agent in every single one of your guests SOS's, it can actually increase the workloads on your servers.
Target based data reduction usually refers to the fact that you're going to transmit all of the data to a target site disk space backup appliance, and all the data deduplication is going to be done on the appliance. The pros of this is this will also often shorten your backup window because now you're backing up to a high speed target that is tuned and optimized for backup and recovery. If you are replicating from this target to an off-site location, it does reduce the replication bandwidth and allow you to use disk based backup for your off-site disaster recovery, and it performs a dramatic reduction in storage usage as the data deduplication reduces the amount of disks required in the appliance. The cons of this approach are you must transfer this entire data set to the device, and therefore, there's no bandwidth reduction between the client and the local backup target.
So what is really the right answer? The right answer is actually to use a combination of techniques, provided that the source based data reduction and the target based data reduction are the right implementation.
When you combine products that do things correctly, using both data deduplication techniques provides tremendous complimentary benefits. What you want to look for is a source based data reduction technique that leverages the built-in infrastructure in your virtual machine environment to reduce the data, change block tracking as an example. By removing the data at the source, we deliver a very short backup window. By then sending that reduced data to a disk based backup appliance that does further deduplication, you can achieve an additional 80 percent data reduction. So between the source based data reduction and the target based data reduction, you can see data reduction totals as high as 98 percent. This delivers a further reduction in bandwidth, a further reduction in storage usage, and a further reduction in your backup window and provides you with integrated replication of your virtual servers. So again, the combination of source based data reduction, the right implementation, and target based data reduction, the right implementation can yield all these benefits.
What are some of the architectural considerations you should look at when looking at appliances? There are many appliances on the market today, though they're sold to customers as backup and recovery appliances, really look like and have primary storage architectures. They typically have a single controller, and their expansion is achieved by adding shelves of disks. This means that the controller is the only part of the equation that ingests data. It is the only part of the equation that processes and deduplicates the data, and it's the only part of the equation that can replicate the data. Often in these types of architectures, once you install them, your only route to expansion is to add just disks, increasing the workload on the system as your data grows. However, the performance of the system does not increase with that increased workload. As a result, your backup window over time will begin to grow, and eventually that controller will be completely outgrown and will need to be replaced through an expensive forklift upgrade.
Another approach is to use small, fixed appliances. In this case, once you outgrow a small, fixed appliance, you add a second small, fixed appliance, and you now have two totally separate deduplication silos to manage.
A better approach is a grid or a clustered architecture where, with each significant increase in workload, there is a significant increase in the performance of this system, and this can be accomplished by making expansion include not only disks, but include network ports, processor, memory, and disks together. In this paradigm, as your data grows, your backup window does not grow. Your deduplication times do not grow. Your replication times do not grow, because each time your data grows, the system's performance grows right with it. This delivers you linear performance throughout your data growth and a stable backup window. The system is simple to manage as capacity is virtualized across nodes. It can be managed through a single user interface, and you avoid costly forklift upgrades as there is no component to outgrow.
As important a consideration as where to do the data deduplication, be it source or target, as we concluded, a complimentary approach using the right type of source based data reduction with the right type of target based data reduction provides the best outcome. Equally as important as that discussion, is the discussion around the type of architecture that a target side appliance should have. Many of the appliances in the industry, though they're sold to and marketed to customers that have backup and recovery challenges, actually have more primary storage-like architectures, and what I mean by a primary storage-like architecture is one that typically has a single controller, which is a single server that has all of the processing power for the entire system. It has the network ports, the CPU, the memory, and maybe some disks in the controller, but the controller is literally the only component in the architecture that has any kind of processing power. Then you typically expand these systems by adding disk trays or disk shelves to be able to increase the capacity of the system, but at no time do you really increase the performance of the system because none of the disk shelves have the typical elements of performance, especially CPU, network ports, etc. They typically just have disks. As a result, what you're really doing is you're increasing the workload of the system over time, but you're not increasing the performance of the system over time. So as you can see on the left here, as your data grows, the ingest rate and the processing rate of the system do not grow along with your data. As a result, by the very definition, your backup window will grow as your data doubles, triples, quadruples. If there is no more performance added to the system, naturally it will take longer to land, deduplicate, and replicate your data.
One of the other types of architecture you can see in the market are small, fixed appliances. Here you have relatively low capacity appliances that have all of the elements of processing in them – CPU, memory, network ports, and disks – and when you outgrow them, you simply add another small, fixed appliance. This creates a silo of data deduplication which is extremely difficult to manage and very complex to predict growth, etc. Because the data is deduplicated, it's very difficult for you to divide up the data between the multiple appliances and, on an ongoing basis, to watch and monitor your data growth.
So really, the right type of architecture for backup and recovery would be based on the grid or clustered architectures that we now see in the storage world, and what these grid or clustered architectures bring to the table is linear performance, cost-effective scalability, and the avoidance of technology obsolescence of forklift upgrades, because what a grid or clustered architecture is based on, it's based on the premise that as your data grows, not only should the capacity of the system grow, but the performance profile of the system should grow as well. So as your data doubles, triples, quadruples, etc., you bring more processing power to the equation.
In this grid based example here on the right, as the data has grown, the throughput of the system has grown in lockstep. The result is that you have linear performance as your data grows. The ability to have a stable backup window, if you achieve a six hour backup window at 20 terabyte of data, the system will accommodate a six hour backup window at 40 terabyte of data and 60 terabyte of data. However, it's important that the grid based architectures provide simple management layers. Capacity should be virtualized across the nodes so you don't have separate pools to manage. Deduplication should be shared across nodes, ensuring that all of the backups be duplicated across their entire history. All this should be managed through a single user interface. This allows the system to be right-sized to your current needs, allows you to plan expansion very cost effectively, avoids forklift upgrades, and avoids technology obsolescence.
Let's look at one example of this type of grid or clustered approach. I'm simply going to use the ExaGrid product architecture as an example of what a grid based architecture can do for you. In the ExaGrid approach, we provide a landing zone within each node, which is a high speed cache that lands your backups in their entirety, delivering the shortest possible backup window. As I mentioned, it's important that the storage be virtualized. So we virtualize all of the long-term storage of backups. The days, weeks, months or even years of retention that customers want to keep is stored in a single, virtualized repository which can span all of the nodes in a grid. This ensures full utilization of all capacity in the grid through automated load balancing of the data. All of this is wrapped with a simplified management layer that allows the grid based architecture to be managed all from a single user interface. So this is just a simple example of how you can implement a grid based architecture, achieve the scalability that we talked about on the previous slide, but have a simplified management layer to avoid complexity.
So let's summarize what we've covered. We talked about what data deduplication is, why specifically you should use it in your backup and recovery operations. We talked about the specific challenges of deduplication in virtualized environments. We talked about the two various approaches to data deduplication, and we concluded that a combination of source based data reduction and target based data reduction is a powerful way to deliver on the results for your environment, and we finally talked about the overall role that deduplication can play in your data protection and disaster recovery operations.
I want to thank you for spending time with me today. Again, my name is Marc Crespi, Vice President of Product Management at ExaGrid Systems.