About the speaker
Core technologies used for virtual machine backup
In this session you will learn:
- Virtual Environment Backup Methods
- Virtual Machine Snapshots
- Disk-to-Disk Backups
- Volume Shadow Copy Service (VS
- vStorage APIs
- Data Deduplication
- Data Compression
More sessions by Eric Siebert
Hi, my name is Eric Siebert. I’ll be your instructor for this part of Backup Academy. This lesson will be covering core technologies that are used for virtual machine backup. Before we begin, a little bit about me. I’m a 25 year IT industry veteran. I’ve been focusing on virtualization for the last five years. I’m the author of two books on virtualization: "Maximum vSphere" and "The VMware VI3 Implementation and Administration." I’m the proprietor of vSphere-land.com and VMware Information website. I’m a regular contributor on many tech target websites, including SearchServerVirtualization.com and SearchVMware.com. I presented at VMworld twice, in 2008 and 2010, and I’ve been recognized as a V-expert by VMware in 2009 and 2010.
Here’s our agenda today. We’re going to cover first the different methods that you’re going to use when you backup virtual environments. Next we’re going to cover virtual machine snapshots. We’re going to cover disk-to-disk backup, which is a common method of backing up data in a virtual environment. We’re going to cover the Microsoft Volume Shadow Copy Service, otherwise known as VSS, a key component of getting proper backups. We’re going to cover the vStorage APIs, which are some APIs that VMware introduced that are helpful for storage and data protection related functions. Then we’re going to cover data deduplication and data compression.
First we’re going to give you just a general overview of virtual environment backup methods. Virtualization technology is pretty unique stuff. When you insert that virtualization layer between the hardware of a server and the operating system, it gives you a lot more options and greater flexibility for doing backup and recovery on your servers. The architecture of a virtual environment is pretty dramatically different. The hypervisor controls all the hardware and the resources on the host. You can use traditional methods to backup your virtual machine, like installing an agent inside of the operating system. But if you do it that way, it can create bottlenecks. It’ll cause performance problems. Your backup just won’t be efficient. It will take a lot longer to complete.
So, when you’re in a virtual environment, you need to think outside of the box and use methods that are developed specifically for backing up virtual machines. If you continue to use those traditional methods, then you’re not going to have the efficiencies and take advantage of the virtualization architecture. So you want to stop using those and go to the methods that are specifically developed for backing up virtual machines that not only increase the efficiencies of the virtual machines, but also have minimal impact on the virtual machines. We’ll talk about that more in a little bit.
When I talked about how you need to do things differently in a virtual environment, the way that backups are done in virtual environments is at the image level. The image level means that you’re going in and you’re backing up the whole image of the virtual machine’s disk file. You’re not going through an agent anymore, through the operating system, to backup data. Doing it this way is just a lot more efficient. Instead of going through the guest OS, where you have another layer to go through, you’re going through the virtualization layer to get to the guest OS. Why do that? Why go through that extra overhead of going through the guest OS, when you can just go to the virtualization layer instead, because that’s where the disk is anyway?
So, like I said, with image level backups, you’re backing up the whole image of the virtual machine’s disk file. That VMDK file, that’s the virtual disk of a VM, that resides on a host data store, when a backup server goes to backup a VM, it goes to that file and backs it up raw and doesn’t go inside the file at all. This is done at the disk block level, instead of the traditional methods, which are done at the file level, where you’re actually going through the operating system and backing up individual files. Image level backups will go at the disk block level. They can’t see inside the operating system, so they can’t backup individual files. So they’re going at the VMDK, at the virtual disk, and backing up each individual disk block.
When you’re doing it this way, you’re going to create some inefficiencies because you don’t know what’s inside that virtual disk. There may be tons of deleted data, empty data, disk blocks that aren’t used anymore. If you’re going outside of it, and aren’t aware of what’s inside the operating system, then how do you know what to back up? The way the backup applications get around this, by doing image level backups, and become more efficient, so they don’t have to go inside the OS is they look at all the disk blocks that come through. If they see empty disk blocks, they’ll just ignore them. If you have a virtual disk that only has, maybe it’s a 20Gb virtual disk, but only 2Gb of space is in use, the backup application’s looking at each disk block when it’s doing a full backup. So it’s going to look through each disk block and see all these empty ones, and it’s just going to discard them. I’m not going to back those up, because they’re empty. I’m only going to backup disk blocks that have data on them.
You also need to track the blocks that have changed since the last backup for incremental backups. Once you do a full backup, you need to know, again, since you’re not at the traditional file level, you don’t have archive bits that you can sit on files so you know when they’re modified. The backup application has to go through there and check each disk block to see, when it does an incremental backup, have you changed since the last backup that I’ve taken of you? There are also methods that we’ll talk about in a little bit, as well, that allow that backup application in a virtual environment to track the blocks that have changed since the last backup, so it can do efficient incremental backups.
Now we’re going to cover virtual machine snapshots, which are the core technology that auto-backup applications use. It’s not to be confused with other types of snapshots, like storage array snapshots and operating system snapshots that are taken inside the OS. A virtual machine snapshot is taken at the virtualization layer. It’s basically a point in time picture of a VM that preserves both the virtual disk itself, and optionally, you can also preserve the system memory of a VM at a certain point in time. You can take multiple snapshots, as well, if you want to have multiple pictures of that VM, for having multiple recovery points.
Backup applications, they take advantage of these virtual machine snapshots, because when you take a snapshot, the VM disk files become read only. No more writes can occur to that virtual disk. Now, of course, the operating system isn’t aware that this is happening, because it’s done at the virtualization layer. So the operating system’s going to continue to try to write to that virtual disk file. What happens is the hypervisor basically creates a new delta disk. That delta disk, all the writes that are going to occur when the operating system tries to write to that virtual disk file, it’s going to get deflected from there and not be allowed to go to the original VMDK file and will happen in that delta file instead.
Each virtual disk of a virtual machine, if a virtual machine has two disk files, each disk will have its own delta file. All those changes that occur while the snapshot is active reside in that delta file. Once you delete snapshots and you no longer need the snapshot, let’s say that the backup is finished and you don’t need that snapshot anymore, what happens at that point is those delta files then are merged back into the original VMDK file and then deleted. One by one those delta files will all be merged back. Then the delta file will be deleted, and you’ll only have the original VM disk file, and it won’t be read-only anymore. It will be unlocked and become read-write.
We talked about deleting snapshots and how that delta file is rolled back into the original disk file. If you have multiple snapshots active, the process for deleting those multiple snapshots has actually changed across the different vSphere versions. With all versions, what happens is a special helper file, a helper snapshot file, another delta file, is created. It’s a little container to hold any disk writes that occur while the snapshots are being written to the original disk. As each snapshot is being committed back to the original disk, that helper file will contain all the writes that are occurring while the whole process is running. Then, eventually, once all the delta files are all written to the disk, that helper file will also be committed to the original disk as well. That helper file usually isn’t active for too long. It’s only active while the snapshot’s being committed. So it usually doesn’t get too large in size. It usually stays pretty small.
For old versions of vSphere, the way the process worked was newer snapshots were copied to each of the older snapshots in order, and then finally the helper file was copied to the original disk file. While this worked, it just wasn’t as efficient, because what happened was, because each snapshot data was copied to the next snapshot in succession, before it gets back to the original, each snapshot would grow because it’s taking the data of the previous snapshot and putting it inside of it. What happens then is each snapshot grows in size. So you need a lot more disk space on your data stores to be able to delete those snapshots, because all of those snapshots are going to grow while they’re getting rolled in one to the other. So, while that worked, it wasn’t as efficient. We’ll talk about how that was improved in later versions of vSphere.
Starting in vSphere 4.1, and also in later 4.0 versions, the way that process worked was changed, because VMware recognized that doing it the way they were doing it just wasn’t as efficient. It took up more disk space, which in some cases, if you’re running low on disk space, it becomes a problem because you don’t have that extra disk space that you need to commit snapshots. What they did was change the method that snapshots were merged back directly into the original disk file.
As you can see here on this slide, this disk file has three snapshots running. What happens now when you merge those is the first snapshot is merged back into the original disk. Once that completes, then the second snapshot is merged back into the original disk. Finally, the third snapshot is merged back into the original disk. So they’re not rolled into each other anymore. They’re rolled directly, all of them, directly back into the original disk in turn, and once one is rolled back, the other one, and then the other. Then finally, once that is all complete, the helper snapshot that’s created while the snapshot process is running is rolled back into the original disk file. Then all the delta files, snapshot files are deleted.
Now we’re going to talk about virtual machine snapshot sizes. When you first create a snapshot, that snapshot file starts out small, with an initial size of 16Mb. Then it grows in 16Mb increments as writes are made by the operating system to that virtual machine disk file. As each 16Mb allocation or increment gets filled up, all the blocks get written to, it then extends that snapshot by another 16Mb, and will continue to do that as more and more data is written to that snapshot. So a single snapshot file can never exceed the size of the original disk file. Why that is, is because if a disk block was written to once inside the snapshot file, if it’s written to again, it doesn’t create another disk block for that particular block. What it does is update the existing disk block in a delta file. As a result, if all of the blocks inside the virtual disk got written to while that snapshot was active, it would equal the exact same size as the original virtual disk. If you have a 20Gb virtual disk and every single block was written to by that snapshot, by the operating system that went and wrote to every single disk block, that snapshot would equal the size of the virtual disk and not exceed it.
If you have multiple snapshots running, the combined space that those take up could exceed the size of the original disk file, because each individual snapshot could grow up to the size of the original disk file. If you have multiple ones of those, well they could all get large enough where if you combined and added up all the disk space of those, it could exceed the original disk file length.
When it comes to how these snapshots grow, and like I said, they grow in 16Mb increments, the rate of growth will basically be determined by how much disk write activity occurs on your server when the snapshot is taken. If you have a simple server, maybe it’s an application server or a web server, not a lot of disk IO activity is occurring, maybe it’s a lot of reads, but not a lot of writes. What happens then is that snapshot would probably be pretty small, because while that snapshot’s active, a lot of those disk blocks aren’t getting changed. Let’s say we had that snapshot active while backup was running for an hour. Since there’s not a whole lot of activity, a lot of disk writes happening, that snapshot may be, I’d say no more than, it’s going to vary based on every single VM, but it will be under a Gb probably. Now, if you have something that has very heavy IO, maybe it’s a database server or an exchange server that you’re backing up, well that snapshot could get pretty big pretty quick, because there are a lot of disk writes happening on that VM. As a result, during that short time period, you could have a lot of IO activity that would make that snapshot grow pretty quick.
That’s why it’s important, if you have a snapshot active on a VM, don’t do things like a defrag inside the operating system or something that is really disk intensive, because it’s really going to cause that snapshot to grow really quick and get large at a real rapid rate which you don’t want to happen. So, whenever you have virtual machine snapshots running, don’t do anything that could cause a lot of disk IO, or write IO, on that VM, which will make the snapshot grow pretty large in size.
Let’s talk a little bit about virtual machine snapshot usage, when you should use snapshots. Snapshots, I’ll tell you right now when you shouldn’t use them. They should not be considered a primary method for backing up your VMs. I could say that over and over, because there are some people that do that, and you can’t rely on them to be a primary backup method. They’re useful for things like short-term backups, where you have to do an ad hoc or on-the-fly backup of a VM to preserve it. Let’s say you’re going to patch that VM or upgrade an application on that VM, and it’s useful to be able to roll back to a certain point in time. Maybe that upgrade had problems and you need to go back to the previous version. So, they’re great for things like that, but don’t use them as a primary method for backing up VMs.
When snapshots are running, they slightly degrade the VM performance. As they grow and that delta file is continually increased in size, they consume extra disk space on your data stores. When you commit them and delete them, it also takes a lot of resource usage on the host. Because you’re extending that virtual disk into multiple disks, it can create complications of that virtual disk, where things can happen to that virtual disk in some situations. Also, some features don’t work if you have a snapshot running. There are certain features in vSphere that you can’t do things if the VM has a snapshot running on it. Snapshots, you should always just use them for ad hoc, on-the-fly backups and not really use them for a primary backup mechanism.
Now, having said that, snapshots are a primary enabler for performing image level backups. What happens is, when I’m backing up the VM at the VMDK level, or the image of that VM, what I need to do is I need to pause that disk so the backup application can read it without any changes that occur to a disk. A snapshot is great for that, because what it does is it makes the VM’s disk read-only, so the backup app can mount it and have exclusive access to that disk. It doesn’t have to worry about the operating system writing to that disk while the backup is being completed. So, snapshots are really used by almost all backup applications that do virtual machine backup at the virtualization level. Once that snapshot is taken of the VM, then what happens is, when the backup’s completed, the backup application automatically deletes that snapshot. The changes that have occurred while that backup was running are rolled back into the original disk file. It just removes that snapshot like it wasn’t even there.
When it comes to backing up your VMs, typically in the old days, you stored that data on a tape target. You would take those tapes, put them off-site, or somewhere safe, and you always have them there in case you need to ever do a restore. Today, a lot of the backup applications, instead of using tape, they’re using disk targets. How that works is, instead of writing to the tape, data is written to a special backup repository that’s created on a storage device. Typically that’s somewhere on a network. It could be a local storage device, but typically it’s on a network where you have a dedicated storage device where you can dump all that data and put it into a backup repository.
A lot of the backup apps now support writing directly to disk targets. Before, to be able to do that, they had a special thing called a virtual tape library, a VTL. What that did was it emulated a tape drive. Say you have NetBackup and it needs a target. Install the VTL as a tape drive, but the VTL had a storage backend where it emulated a tape drive, but it actually stored the data on a storage device instead of on a tape. Today, most of the applications don’t need that anymore. The VTL technology is used less and less. Most of the apps can write directly to a disk target.
That backup repository that sits on that disk target, it’s typically de-duped and compressed. It’s basically the virtual disk files of the VM, and it’s all crunched together. It puts it all together into this large repository. It stores all these backups as files on whatever storage device that you’re using for the target. That target storage, it can be local or remote. You can have a local disk on your backup service where it’s storing all that data, or it could be remote. That backup server could be accessing it via an NFS share or a SYS share or a FTP. The backup server is then taking all the data from the VM that it backs up and putting it in the backup repository that's located on that target storage device, whether it be local or remote.
Once that data’s written to a backup repository, it’s got some advantages, where you can actually take that data and then replicate it, either via storage replication. I have an array that supports replication, or there are some products that can do that at the virtualization layer as well to an offsite location. Those backups are great for DR purposes, where as soon as that data’s backed up, it’s automatically replicated off-site. You don’t have to do anything with tape media and putting it off-site. All that data automatically can be transferred to another location.
If you have that disk repository, there’s still a need for tape in a lot of cases. So, what do you do? How do you get that data to tape? A lot of the backup applications that support doing disk-to-disk backup also support writing that data to tape as well. It’s called disk to disk to tape. Basically, you’re writing to a disk target first, and then you’re writing to a tape target afterwards. You may have a tape system that would come through then and backup that repository. All it is, is files that reside on a storage device somewhere. So that tape drive can come through and access and backup that data that resides in the repository. That way, if you want to put that off-site, or lock it up, you actually have another copy of your data, and maybe a longer term archive that you can store to tape. You might use the disk for shorter term storage, where you can keep a month’s worth of data in there, and then the tape would be used for longer retention periods.
Backing up to disk makes restoring the data much easier and quicker, because it’s already there on your network. You don’t have to go finding tapes or recalling tapes. You can access that repository and pull back whatever you need really quickly. With tapes, it could take a while, because you may no longer have that tape on site. It could take days to get that back, but with disk you can get that data back almost right away.
The total cost of ownership can be a little bit higher with disk-to-disk backups, compared to traditional tape backups, because typically these storage systems are a lot more expensive than buying tapes and that. You continually have to add storage, or you typically need more of a higher end storage device if you’re going to do replications and things like that. The price of that disk system can add up compared to the price of implementing just a simple tape array with tapes. Tapes are definitely a lower cost solution with more capacity compared to disk.
Another key core technology that’s used in backups pretty frequently is called the Volume Shadow Copy Service, otherwise known as VSS. This is Microsoft product only. It’s a mechanism for creating consistent point in time copies of the data. It's also known as shadow copies. It’s basically a Windows service that runs. It was first introduced in Windows XP, and all the versions of Windows since then have that service running.
It works at the block level of the file system. It doesn’t work at the file level. It works at the disk block level. It ensures data cannot be changed during backups. So, when you have a backup running, you know how we talked earlier about virtual machine snapshots, where you would freeze the virtual disk of a VM. This is the equivalent inside the operating system, where the VSS service does pretty much the same thing. It creates a snapshot and ensures that the data cannot be changed during the backup. It also helps avoid problems with file locking by creating that snapshot. The application that might be writing while that backup is running, is writing to a different area, and isn’t writing back to the original disk file. If you go into the services on any Windows machine that has this service, you’ll see it. It’s called Volume Shadow Copy. It’s start up type is manual, but as long as it’s not disabled, it will run automatically. So I don’t actually have to have it running, started. Backup applications that use that can start it on their own when it’s set to manual.
The Volume Shadow Copy service works with both file system and applications. It pauses them and puts them in a proper state for snapshots and backups to occur. When a request is made, it’s going to tell the application or the OS to pause for a second. Hold on a second. I need to prepare my data. I might flush data to disk, out of memory, and things like that, before I do the backup, so the data’s in a consistent state to be backed up.
VSS is made up of several core components. First is the requestor. What the requestor does is it initiates the request to the VSS service that’s running in the operating system. Typically, the requestor is a backup application or some type of application that needs to do something inside the OS to back it up or whatever it’s going to do. Requestors then work with the writers to collect information about the data to be backed up. That requestor is going to query the writer and say, "I need to know information about everything on the disk so I can prepare it in the proper state to be backed up."
The writer piece, then, is a part of the application of services that are designed to work with VSS. A writer prepares the application to quiesce their data, to ensure that data isn’t written until the shadow copy is created. It does the actual work here, where it pauses the application and quiesces it and says, "Don’t write any data while I’m doing this, because I’ve got to get this data written to make sure it’s in a proper state." So, the writer does all the hard work there. Doing that basically ensures that the backup will be in a consistent state, and any data that might be written in memory, that needs to be written to disk, so the data can be in a consistent state, is written to.
Finally, the provider, this does the actual work of creating the shadow copy. Once the writer, like the traffic cop, has ensured that the apps are all quiesced properly, the provider goes out and creates and maintains the shadow copies, hence they’re no longer needed. It’s kind of like a snapshot. You can consider a shadow copy, where disk writes occur in a different area while this is active.
We talked about providers. There are different types of providers that can do the work of creating the shadow copies. There are hardware providers, which can offload that shadow copy to a hardware storage device. This is typically used on things like shared storage or maybe a special array card in a server, where it’s more efficient at doing this than the actual operating system is. So, by offloading that whole process to a hardware device, it makes that whole process a lot more efficient.
Then there are software providers. These work at the software layer. They intercept and process IO requests. Anytime a request, just like with the VM snapshot, comes through, it’s going to say, "Hey, you can’t write this to the original spot, because this shadow copy is active." So, what you’re going to do is write them to a special area. It can be any type of storage device. The system providers are built into the Windows OS and write to an NTFS volume on the system.
You can see there on the bottom a little depiction of how this works. The shadow copy service runs on the operating system. A requestor, like a backup application, comes in, makes the request, and the service is what directs all the different components. It will go to the writer, and the writer will work with the applications and the operating system to prepare them to be quiesced properly so they’re in a proper state. Then, from that point, it will turn it over to the providers that work at the disk level to do things like intercepting all the IO requests, the writes that occur, and moving them to a different area while that shadow copy is active. Then, again, the process is kind of the same as the VM snapshot. Once that process is over, the data is simply taken and written back to the original location.
When it came to backups, VMware recognized that we can’t do things the way we traditionally did. We can’t use agents inside the operating system. We need to leverage virtualization to make backups a lot more efficient. What they came up with was, in VI3 they introduced the VMware Consolidated Backup Framework, also known as VCB, which essentially was a proxy server that acted as a middle man. A backup server, instead of going directly to that VM to back it up, it would go through the proxy server. The proxy server would then mount the virtual disk of that VM onto the VCB server and not involve the VM at all. The backup server would backup the disk that’s attached to the VCB server. So all the resources and all the workload is on that VCB server. This basically shifted the backup overhead from the VM, where if you backed up inside the OS, you have overhead there. You have an agent running CPU memory disk and all that. It shifted that from the VM and the host itself, because the VM is running on the host. So, it’s using host resource. It shifted to the backup server, the proxy server instead. This initial method basically became more efficient, because you no longer had to go directly to that VM and involve the VM and the host. The guest OS on the VM didn’t even know it was being backed up because the proxy server would just mount the virtual disk of that VM and the backup server would contact the proxy server instead.
So that more efficient way of doing backups became the vStorage APIs, which VMware introduced in the vSphere, which basically eliminated VCB and that proxy server. Instead of using that method, it leveraged a combination of APIs, SDKs, and also their virtual disk development kit which allows applications to interface directly with virtual disks. These vStorage APIs, what they did was allow third party applications to directly connect to virtual storage data stores, instead of going through that proxy server. Why would we want to go through the proxy server, which is another hop, when we can go there directly? vStorage APIs allows this, which improved efficiency and better management, because you can go direct to those VMs and not involve external sources to do that.
vStorage APIs, they are basically just a marketing term. They are a collection, in various ways, in which the APIs integrate with the different areas of storage in vSphere. There are four categories of these APIs we’re going to cover in the next few slides.
The first category of the vStorage APIs are the vStorage APIs for array integration, also known as VAAI. There were not initially, while they were there in vSphere, it took some time to develop these. VMware actually had to work with specific storage vendors to enable storage array based capabilities directly from within vSphere itself. This enabled vSphere to act more efficiently for certain storage weighted operations. The storage array does this stuff a lot more efficiently. Instead of trying to do this stuff in vSphere, it makes more sense to offload this stuff to the storage array, and vSphere just tells the storage array, "Go do this. Go do this for me and go do this." The storage array can do it much quicker than vSphere can.
This included certain operations, like doing array-based snapshots, where you could actually call an array-based snapshot directly from within vSphere. Copy, offload, where if you’re going to clone a VM, or something like that, the copy process happens a lot quicker at the storage array, instead of trying to do it through vSphere itself. Essentially, then, when we try to clone something or create a copy, it’s going to tell the storage array, "Hey, do this for me," and it goes ahead and takes care of it much quicker. Same offload for things like zeroing blocks, if you have a lot of disk blocks to zero on a VM, a virtual disk, let the storage array do it. It does it a lot quicker. Hardware assisted locking, instead of SCSI reservations, which can cause performance problems, let the array handle locking of the virtual disk files. It can do it a lot more efficiently. Things like storage provisioning integration, where I can actually go through vSphere now to provision storage in that. So I don’t need two separate consoles. I can do everything, all the storage related tasks directly from within vSphere.
The second vStorage APIs are called the vStorage API’s for multi-pathing, also known as VAMP. These were specifically designed for third party storage vendors to leverage multi-pathing functionality in arrays through specific plug-ins that each storage vendor develops. So, what this allows is vendors to more intelligently utilize multi-pathing. Typically, to your storage device, you have multiple paths. If you have a back-end SAN, you’ll probably have at least two fiber channel controllers instead your host. Because you have multiple paths available to that, vStorage APIs can work with the storage array to get the best possible multi-pathing, so you can get the best storage IO throughput and path failover for each specific storage array. Each storage array is different, and they all handle multi-pathing in a little bit different way. To get these to work, each vendor must certify, again, with their multi-pathing extension modules with VMware. Currently this only supports ISCSI and fiber channel storage. A lot of things in the vStorage APIs don’t support NFS yet, but NFS support will most likely be added to this in the future.
Here are the other two categories of vStorage APIs. First there’s a vStorage API for Site Recovery Manager. These APIs work with VMware’s Site Recovery Manager product, which allows you to do site-to-site replication of virtual machines for use for business continuity or disaster recovery purposes. This product relies on array based replication to do all the replication at the array level. It can’t do any of the replication through the VM level. What this is basically is vendors develop their own site recovery adapters that plug into Site Recovery Manager for their specific storage sub-systems. Then this allows integration where the Site Recovery Manager product can actually control replication and keep on top of all the replication that happens, to keep each site updated.
The final one is the vStorage APIs for data protection. This is the big one to backups. It’s the successor to VCB. It makes up for a lot of those shortcomings of VCB. A lot of people were unhappy with VCB. It was a real pain to set up and use. Where VCB was a separate standalone framework, the vStorage APIs are built directly into vSphere, and they don’t require any additional software installed like VCB did. So they include all the VCB functionality, but they also added a lot of new functionality that we’ll talk about. These vStorage APIs for data protection are basically targeted at third party backup and data protection applications, that provide those applications better integration and greater flexibility for backing up virtual machines and recovering data when needed.
The vStorage APIs for data protection introduced some more efficient methods for backing up virtual machines. Remember, before we had to go through a proxy server using that VCB to get to the virtual disk of a VM. We no longer have to do that with the vStorage APIs. The vStorage APIs allow backup applications and servers to directly mount a virtual disk of a VM.
There are two ways it can do that. Maybe if your backup server is a physical server, that physical server can access directly that VM data store and mount that virtual disk so it could back it up. The other way is if your backup server is a virtual appliance running on a host. What it can do in that case is hot add the target VM’s disk to that backup server virtual appliance, so that it appears as a local disk, just another disk on that appliance. It could back it up directly that way as well. So, both of these methods allow for a lot more efficient backups of your virtual disk compared to involving the VM or going through a proxy server like you had to do with VCB.
The change block tracking feature is probably one of the most significant and standout features of the vStorage APIs. For incremental backups and replication, typically you have to know what blocks have changed since the last backup or replication cycle occurred. So, applications normally have to compute that on their own. They have to look at all the disk blocks and figure out, since the last backup or the last replication cycle, what exactly has changed, so I can only backup that data now?
CBT allows third party apps to simply query the VM kernel to find that information out. The VM kernel now keeps tracks of all the disk blocks that have changed from a particular point in time, like a last backup operation or a replication cycle. The application can now just query and ask the VM kernel, "Hey, what do I need to back up this time?" And the VM kernel gives him the answer right away. So, you no longer have all that overhead of an application trying to figure that out on its own. It can simply query the VM kernel and find that out instantly. This allows for much faster incremental backups. It also allows for near continuous data protection, because you’re replication cycles can be a lot quicker now. You no longer have to compute every that’s changed. The VM kernel’s going to tell you right away.
This change block tracking supports any type of storage except for physical mode, raw device mapping, RDMs. It supports virtual mode REMs, but not physical mode. It also works with thin and thick disks. They can be on any type of storage, whether it’s NSF storage, ISCSI storage, fiber channel, or local storage.
How does change block tracking work? What CBT does is it stores all that change block information into a special, if you look at a VM’s home directory, you’ll see a special file there with a dash CTK extension, a VMDK file. That’s created for each virtual disk of a VM. This file is a fixed length file. It doesn’t grow like a snapshot file does, because, essentially, it’s just a mapping of a VM’s virtual disk blocks. The size of this file will vary based on the size of the virtual disk. It’s not really that big. It’s about half a Mb for every 10Gb of virtual disk size. So, these files are relatively small and won’t take up a lot of room on your data storage. The state of each block is stored using sequence numbers that can tell applications if a particular block has changed or not. The CBT can be enabled, a backup application can do this automatically. Typically in each backup application, there’s a setting there for if you want to use change block tracking or not. Or, if you want to enable it yourself, there’s a virtual machine advanced setting that you can set, that will enable the feature as well.
Another common technology used in backups, whether it's in the physical or virtual environment, is data deduplication. Your backup repositories, where all the images of your VMs are stored, can grow pretty large over time. So, what do you do when you run out of space on those backup repositories? Well, you could purchase more storage. That means either upgrading your current storage device, adding more storage to it, or buying a new storage device. Both options can be pretty expensive. You could also limit the number of backups that you store in the repository. By doing this though you limit your ability to quickly restore data from that repository.
Since neither solution is desirable, instead of increasing the amount of storage that you use to store your backups on, it’s better to reduce the size of the data that’s stored on your backup repositories instead. Data deduplication eliminates duplicate blocks of data from being stored in the backup repository. Every single block of data that gets sent to the backup repository. It’s going to look at every single block and pick out ones that are already stored there to avoid restoring them repetitively.
There are different ways of doing data deduplication. You can do it inline versus post-processing. Inline you're actually doing it as the backup is occurring. Post-processing is where you’re actually doing it after the backup is finished, where you’re going to the target storage device and deduping that.
There’s source versus target, where the source is actually doing it at the source. Typically, this is through the operating system, where the target, you’re actually doing it on the target storage device. There are different chunking methods and half sizes. When you’re deduping data, you look at data in a certain size, maybe it’s 16kb chunks or 64kb chunks. You’re taking all the data and putting it into those chunks, then looking at those chunks to see which ones are duplicate. There are a lot of different ways. Each vendor has their own preferred methods of doing data duplication that you’ll find for different backup products.
Out of the different methods that you can use to do data deduplication, the inline method is the most common. With this method the hash calculations are done before the blocks are stored in the backup repository. The backup server is looking at every single block that comes through it and doing a hash calculation on each block. If that hash matches a block that’s already stored in the data store, then it’s not going to store it there again. It’s simply going to reference that existing one. So, doing all these extra hash calculations requires extra overhead, because the backup server is no longer just moving data from point A to point B. It’s actually having to do math as well, to calculate all these hashes on all these blocks that are going through it. You can tweak the hash calculations to either get maximum deduplication or a better performance. When you use smaller hash block sizes, that mean you’ll get the maximum deduplications. The block are smaller, so the chances of more duplicate blocks occurring in your data store are greater. But you also get slower backups as a results, because it has to process a lot more blocks of data to do all those hash calculations.
If you use larger hash block sizes, that means that you’re going to more of a minimum deduplication, because those blocks are a lot larger in size. Because those blocks are larger, there’s less of a chance of it matching existing blocks that are already stored in the data store. Doing this still gets you some level of deduplication, but it also gets you the best backup job performance. So, you’ve got to find a balance of what works in your environment to meet your backup windows. If you have a lot of time to do backups, maybe do the maximum. That way you have less data stored in your data stores, but your backups will be slower. If you’re windows are shorter and you have to get backups done quicker, then maybe you’d use a larger size to get the best performance possible of your backups.
Not all backup products support deduplication natively or include it. A lot of times you have to buy that extra, or do it not within the application itself, but have to maybe do it on the storage device end. So, always check your backup products to see what their level of support is for doing data deduplication.
Data compression is another technology that can be used to reduce the space that data takes up on your backup repositories. Data compression essentially squeezes the data so that it takes up less space. To do this, it uses particular algorithms to reduce the number of bits that data normally would take up. Doing all this compression, it’s very CPU intensive, and again, your backup server isn’t just moving data. It’s having to do a lot of different math to get those backups done. By doing compression, it can really increase the backup times, because it’s got to compress all of this data before it can store it in your backup repositories. Because it’s so CPU intensive, you really can’t skimp on the resources on your backup server if you want to use compression. You’re going to need a pretty beefy backup server to do all this extra math that’s going to occur while the backup is running. Typically, maybe eight CPU cores on the backup server would be a good amount if you’re going to use your maximum compression. Like dedupes, there’s also multiple compression levels that can vary the amount of compression that’s done. So, again, if you have the largest, if you have a big backup window where you have a lot of time to be able to shrink this data down, you can do that. But if your window’s smaller and you don’t want to do as much compression on the data, then you can vary the level to whatever meets your requirements.
Between data dedupe and compression, you can dramatically shrink your backup repositories, which results in you being able to store more data for a longer period of time without having to look at alternate methods, like increasing your repository size or taking data off of there.
Now let’s review what we’ve covered in regards to the core technologies for virtual machine backups. We took a look at what virtual environment backups are used, and specifically at technology such as a virtual machine snapshot. We then focused on disk-to-disk backups, and then we moved onto the Volume Shadow Copy Service, or VSS provider, which is a way to take an application consistent backup. We then talked about the vStorage APIs, and specifically change block tracking, which can enable the fastest virtual machine backups. We then talked about data deduplication, which can lead to storage savings on your backup targets, and we also talked about data compression.