About the speaker
VMware backup integrity tools
In this session you will learn:
- Why backup integrity is critical
- Traditional backup verification methods
- How virtualization makes verification easier
- Quiescing is critical to consistency
- VMware Tools is important
More sessions by Eric Siebert
Hi, my name is Eric Siebert, and I’ll be your instructor for this part of Backup Academy. In this lesson, we’ll be covering VMware backup integrity.
Before we begin, a little bit about me. I’m a 25 year IT industry veteran. I’ve been focusing on virtualization for the last five years. I’m the author of two books on virtualization: “Maximum vSphere”, and “VMware VI3 Implementation and Administration.” I’m the proprietor of vSphereland.com a VMware information website. I’m a regular contributor on many tech target websites, including SearchServerVirtualization.com and SearchVMware.com. I’ve presented at VMworld twice, in 2008 and 2010, and I’ve been recognized as a vExpert by VMware in 2009 and 2010.
So here’s our agenda for this lesson. First we’re going to cover why backup integrity is so critical and why you need a good backup to be able to do a good restore. Next we’ll cover some of the traditional backup verification methods that you use in physical server environments. Then we’ll cover how virtualization is different and how the unique architecture of virtualization makes verification a lot easier. Then we’ll cover quiescing and why that’s critical to having consistent data. Then we’ll cover how VMware Tools plays into the picture and why that’s important also.
So why is backup integrity so critical? Well, having good backups is really important, but having good restores is even more important. I mean, what’s the point of backing up data if you can’t restore it? So the integrity of your backups is absolutely critical to being able to do a proper restore when needed. Backups are essentially worthless if you can’t properly restore files when necessary. If you spend months and weeks and years backing up data only to find out when you go to restore the data that there was something wrong with the integrity of that data and you can’t actually do a restore, then your backups are essentially worthless.
You shouldn’t trust that since your backup software doesn’t report any errors when backing everything up that everything’s okay. As Ronald Reagan was fond of saying, “Trust, but verify.” You trust that your backup software is working, but you really need to verify that it’s actually working and backing up data properly. So that means you need to regularly test and verify that your backups are working properly by actually doing a restore of the data to ensure that it can be read properly and that the data is being backed up in a way that when you actually do the restore that everything works okay and that you can actually use that data.
So you really need to ask yourself a question, “How important is your data to you?” In almost all cases, your data is critical. That’s all your information that you need to be able to backup and restore if something happens to it.
Okay. Let’s go over why verifying your backups can be a real challenge. So verification means more than just verifying that your backup software said that the job successfully completed and that the media is error free, there were no physical errors with whatever media you use to store your backups on. The reason that you’re backing up your data is that you need to be able to properly restore the files, your applications, and databases and servers whenever needed. So the problem with verifying backups is that it can really be a complicated and time consuming process. It’s not something easy where you can’t just, you know there’s verification built into backup software, but all that does is really just verify that the data was written correctly. It doesn’t verify that the actual data that’s on the tape is in an okay state to be able to be restored when needed.
So restoring files may be simple enough. You typically just restore a file back to any type of location. But restoring applications or a whole server can be challenging. There’s a lot more to it than when you’re doing an application where you have to make sure that application is in a working state or that a whole server, the guest operating system, is in a working state. So you also need to ensure that applications are in a working state, so restoring multi-tier applications is difficult. So if you have an application that spans multiple servers, something that maybe has a database server and an application server that all work together, how do you know that that application as a whole is working, and how do you verify that what you have backed up in is a good state to be able to restore if needed?
So let’s now go over some of the traditional backup verification methods that you would typically use in a physical server environment to verify that the data that you have for your backups is in a state that it can be restored when needed. So with verification you can’t really impact your production systems. You don’t want to overwrite what’s already there on your production systems unless you want to take that whole server down, and then you’re going to run into a lot more problems when you try to do that.
So typically you need one or more unused physical servers to perform the restore on. So if you want to test things like applications, more than simple file restores, you want to test applications to see if they’re in a working state when you restore them or a whole operating system, like a bare metal backup, you want to be able to test that to see if that’s in a working state, you need extra server hardware laying around to be able to do this on. So testing bare metal restores can be difficult because typically for doing that you need similar hardware. If you try to restore this data to dissimilar hardware, you’re going to run into problems with things like drivers and disk partitions, and things like that where they don’t match up. So you could have things like the operating system could blue screen because it’s got different drivers on it, and it’s just really difficult to be able to verify that that bare metal restore worked properly. You really need similar hardware. That problem is even multiplied in the data center where you have more models and generations of server, where if you’re trying to verify maybe five, six, seven different models of servers, well you need similar hardware for all of those to be able to verify that that data, that bare metal restore can be done on whatever hardware so it matches what you have there already.
So backup verification in the traditional physical server environment is really a time consuming, tedious, and difficult process. It’s not easy. It’s something you have to do, but in physical server environments, you just need a lot of effort to be able to take those backups and to be able to verify that they’re in a state where they can be restored if needed.
So when we actually verify our data, we need to do it at different levels. We talked about in another lesson how the different levels that we back up data at, we have the file level, we have the application level where we’re backing up objects inside of a database or whatever, and also the server level where we’re doing the image level or bare metal backup. So we’re doing different types of backups in our environment typically. So we need to verify data at different levels to make sure that all of our data and all of our servers can all be restored, and that all that data that’s there is in a state that it’s able to be restored when needed.
So for files, how do you typically restore files? Typically, you restore them to an alternate location. In most cases, you don’t want to overwrite what’s already there because it’s typically production data that you want to leave alone. So to verify files, you typically just need to take those files, restore them to an alternate location, and then just open them and verify that you can get into them.
For applications, it becomes a little bit more difficult because you need to restore the data files, whether it’s a database or whatever container that that data’s in, and then open them in the application to ensure that the data can be read. So you typically need to restore that application as well to be able to check the data that’s inside those files to make sure that you can read your email files, your database files, or whatever type of transactional data or data that’s stored inside of a container. You need to be able ensure that you can go in there and be able to read that data and it’s backed up in a good state.
Then for servers you need to actually boot that server. You need to make sure that the backup state that it’s in that the server can actually boot, that the operating system has no issues when it boots, it doesn’t blue screen when it boots. If you’re doing bare metal backups, you really need to ensure that once you try to restore that server somewhere that it’s a good state where it’s going to boot without issues.
So these are all the different levels that you really need to go through and verify. You can’t just do it at one in a lot of cases because you need to ensure that all of your data and everything in your whole environment is in a good enough state where it can be restored if needed.
So we talked about how verification in traditional environments is kind of difficult because you typically need extra hardware to be able to restore if you want to verify things like applications or a whole bare metal backup was successfully backed up so you can restore it when needed. So virtualization can make this verification process a lot easier. With virtualization, you can easily, since the backup is typically at an image level, you can easily restore that whole image of a VM to a host and power it on. You don’t need any extra hardware. You’re just up and running right away. As soon as you restore it, you power it on and you’re good to go. You can log into that server that you restored, open up the console to it, make sure that the operating system is okay and the applications inside of it are running okay.
You can use network isolation to prevent your restored VMs from impacting your production VMs. So with traditional physical servers it’s more difficult to isolate physical hardware on the network because you have to go to the network layer. You typically have to involve your network group to isolate those VMs and maybe put them on their own separate segments that don’t impact your production servers. With virtualization, it’s a lot easier because with vSwitches that have no physical niche, you can create your own little sandbox environment and not impact your production VMs.
You can also restore multiple VMs. This is applications that have dependencies on other servers. So if you have a web server that has an app server component and a database, you can restore all those VMs to that isolated environment to be able to test that application as a whole and have all the other servers that the application depends on available so you can test that application and make sure that it’s running properly.
So with virtualization you don’t need any extra physical hardware like you do in a traditional server environment. Typically, in our virtual environments, we have a lot of extra host resources in a lot of cases where we can just restore a VM when needed to a host without having to get extra hardware to do that. So it becomes a lot easier since you don’t need any extra hardware, where you can just restore a VM to any host that has sufficient resources for that VM to power up on and you’re good to go.
So verifying backup with virtualization is still easier from a process level, because, as I said, you don’t need the extra physical hardware. The VMs are already backed up at the image level. So the whole process is easier with virtualization. But it still can be time consuming because there are still a lot of manual steps that you have to do to actually go out and do the verification and make sure that everything from your files to your applications to your bare metal images, everything is working and all your data is verified so it can be restored when needed.
So when you’re performing backups, it’s all about consistency. Consistency is really the key to having good backups, because your backups are only as good as the data that you’re reading from on your servers. If that data isn’t in a good state, where there’s something wrong with that data, it hasn’t been properly prepared to be backed up, then your backups really aren’t consistent and when you try to restore them, well then you’re going to find out that that data that you backed up might not be good enough to be able to do a restore with.
So the operating system, the applications, and data, they all need to be put into a proper state and prepared for when the backup occurs. This is especially critical for transaction sensitive applications like Active Directory or Exchange or SQL Server, where they typically have databases or some type of container where they’re constantly reading and writing to that. Those types of applications they really need to be put in a proper state before they’re backed up to make sure that you aren’t missing any data, there’s no corruption, and things like that. So the server must be prepared. When you typically kick off a backup, a VM snapshot occurs where it actually pauses that virtual disk or the virtual machine and makes it read only. But before that can happen, you need to have that virtual disk put in a proper state so any pending transactions or data that might be in the memory on that server or they haven’t completed yet, those must be finished and written to disk before you can actually take that VM snapshot and begin your backup. If you don’t do this, then you could have missing data, which can cause corruption or incomplete backups. Essentially, in a lot of cases, it can make your backups worthless.
So to be able to achieve that consistency that we need to ensure that our server and our applications and data are in a proper state to be backed up, we need to quiesce them. What quiescing does is it’s basically a fancy term for pausing a computer. So you’re basically telling all the operating system and the applications running on that computer to, “Hold on a second, just give me a minute to do some things.” So it pauses everything running on the computer, and it takes any outstanding writes, maybe there’s still something happening that to be able to make a database in a good state you need to write data to that database, so any outstanding writes are then written to disk. If there are things in memory that maybe need to be written to disk as well, those will get flushed to disk so everything that is outstanding gets back into the server so you have nothing left that may prevent that backup from not being a complete backup or might have missing data.
So once that quiesce operation completes, the VM snapshot can be taken and then the OS and the applications can proceed as usual. So typically, the quiesce process itself really doesn’t take too long. It may take a second or two based on whatever operating system or application is running and how much data is kind of pending that needs to be written back. So it really doesn’t take too long, but once that completes, we’re in a safe spot now. Our data is in a consistent state and a VM snapshot can be taken. So that read only disk that happens after the VM snapshot occurs is in a good state. We don’t have to worry about missing or corrupt data. Then the applications just proceed as usual while that backup occurs from that point on.
So for Windows VM, quiescing is done through the Microsoft Volume Shadow Copy Service, which we talked about in another lesson. So that service is the one that’s responsible for actually doing the quiescing at the operating system level and taking the operating system and applications and having them pause and write their data. For Linux VMs, they don’t have the equivalent of that Volume Shadow Copy Service, so there’s something in the VMware tools that can actually do that for Linux VMs.
Though the Widows operating system supports quiescing, but not all application support quiescing. So a lot of common application, like Microsoft SQL Server and Exchange, and a lot of those transactional applications do support it through the Windows VSS service, but not everything does. So for those other applications that don’t actually support quiescing, there are alternatives to that where you can actually use maybe some native tools that you can call via a pre-backup script that’ll actually maybe pause the application, flush the data, or just do whatever needs to be done to prepare that data to be in a proper state to be backed up.
So when it comes to consistency and backup states, there are actually several states that a server can be in, so how well prepared it is when that snapshot is taken. The first state is called crash consistent. Now this is the worst state. This is basically the same as a VM having its power turned off without being properly shut down or just yanking the power cord on a computer. That’s crash consistent. That operating system, those applications that are running on that server haven’t had any chance at all to actually prepare themselves to be backed up. All the data and memory is essentially gone. It’s not written. So in a crash consistent state, any pending transactions or data that’s in the memory are lost. In a lot of cases it’s okay. If you have open files, there may be some stuff that’s lost and things like that, but for some of your basic backups, a crash consistent state is okay. But for other more sensitive applications the crash consistent state can cause lost and corrupt data.
The next state is file system consistent. This is a better state than crash consistent. This is where the operating system is quiesced, which allows any pending data to be written to disk before the snapshot is taken. So this basically just tells the operating system, “Hey, OS, prepare yourself to be backed up.” So the OS does what it knows, which is just the operating system itself and prepares that for backups and flushes anything, any data, anything in memory that the operating system is aware of to prepare itself to be in that proper state for the backup to be taken. This doesn’t take into account any applications that may be running that might need additional steps to properly write all their data to disk. So this only prepares the operating system itself to be in the proper state. This state is better than crash consistent, but it’s not the best state.
So the best state, where you actually have all your data in the most consistent state, it’s basically the application consistent state where both your OS and applications are quiesced so everything on that server can be properly restored. So you don’t have to worry about anything being corrupt or missing because application consistent actually goes out and quiesces both the operating system and the applications that support quiescing to prepare everything so it’s in a proper state. So this is the best state for preparing your server for a backup. Again, it only works with applications that specifically support quiescing. So if there are some apps out there don’t support it, basically the file system consistent would be what would happen in those cases because those applications aren’t aware of how they should prepare themselves to get in the proper state to be backed up.
So if you’re doing backups, you really always want application consistent. It is the absolutely best state. Everything that can be prepared is prepared. You have less chance of having lost or corrupt data, and your server as a whole is in a proper state then from that point to be backed up.
Okay. Now we’re going to cover VMware Tools and give you a little bit of information on what it is, what it does, and what role it plays in preparing your servers to be backed up and to be in a good state to be backed up. So VMware Tools is basically is a package essentially that you install onto a virtual machine. It contains drivers and applications that help optimize and gets the operating system to run on a VMware host. So everything that gets installed in VMware Tools essentially goes above and beyond, where a normal operating system has the typical drivers for the virtual hardware that it is seeing, but it doesn’t have drivers that are optimized for that hardware. So VMware tools actually contains optimized drivers that are specifically designed to work better with the VMware host. Some of the things included in VMware Tools, there’s a guest VM that’s used for controlling communications between the guest and host. So this allows a conduit, like a data conduit, between the VM and the host so the host can communicate inside of the guest operating system. This is also good for things like time synchronization where we want to synchronize our VM’s clock with the clock of the host.
There’s also a VMware toolbox for Linux or a VMware tray for Windows for controlling various VMware Tool settings. Basically it’s a little application that runs, where you can actually interact with VMware Tools. Then there is the Memory Balloon driver. This is an important one for memory management where this driver will inflate and deflate, so basically put pressure on an operating system to enforce it to invoke its own memory management controls to be able conserve memory on the host. Then there are things like the paravirtualized network VM driver, which is basically just a high performance network driver that gets you better throughput.
Then the Sync Driver, this is the one that kind of relates to backups, which is used for freezing and thawing file systems. So it’s essentially kind of the same thing that the Volume Shadow Copy Service does.
VMware Tools is bundled with all hosts. The actual install for your VM is included in all your hosts. It’s basically kept as a little image that gets mounted on to the VM, so when you go to install VMware Tools it pulls it from that image off the host to install it.
There are different versions of VMware Tools, and they’re specific to different operating system types, because we want the right drivers for the right operating system. So they’re actually packaged into different little bundles where, depending on what operating system you have running on that guest VM, it’ll pull the appropriate version for that operating system and install it on that VM.
So VMware Tools isn’t required, but it’s highly recommended that you install it. You get nothing but benefits out of installing VMware Tools. Sure your guest will run fine without it. You typically won’t experience any problems without it, but you just won’t get the performance and the benefits that you’ll get with using VMware Tools. So you always want to make sure that you install VMware tools on your VMs to get the absolute best performance and the most interaction between the guest OS and the VM and the VMware host.
So we covered what VMware tools is, but we haven’t really covered how that relates to backup and verification and consistent backups. One feature of VMware Tools is that it can interact with the Volume Shadow Copy Service that runs on the guest operating system. So image level backup apps that operate at the virtualization layer and don’t really go inside the guest operating system, they typically can’t trigger that quiescing process because they're only talking at the virtualization layer and not at the guest operating system layer. VMware Tools has the ability because it runs inside the guest operating system to actually quiesce the operating system via the Volume Shadow Copy Service.
So VMware Tools can be slow to support new operating systems. So this feature is there and it works fine in a lot of cases. But for some cases, when a new operating system, for example Windows Server 2008, it took them quite a while to actually be able to support that operating system and some versions of 2003. They finally do support those operating systems, but for a lot of backup applications, they don’t want to wait. They want to be able do that quiescing itself and not have to rely on VMware to update VMware Tools to be able to support those new operating systems.
So a lot of backup applications, what they’ve done is also provide their own operating system aid then that runs inside the guest OS. Basically it’s a small little agent that can actually act as a conduit that that backup application can contact that little agent and quiesce the operating system from that, so not having to rely on VMware tools to do that. In most cases, VMware Tools works fine, and that’s what the majority of backup applications use. But there are some cases where you may want to look at maybe an agent provided by that backup application if the operating system that you have isn’t supported yet by VMware tools to be able to do a quiesce inside the operating system.
For Linux VMs and also a feature of VMware Tools, is there is a special VM sync driver that can provide the same functionality of the Volume Shadow Copy Service. This driver can do basically the equivalent of quiescing the disk for Linux VMs, provide that same functionality. So it’s important to have VMware tools on Linux VMs because you get that same quiescing driver where you can actually prepare your backups to be in a consistent state on those Linux VMs.
So kind of the whole process here, as we’re going to outline below, is basically the backup server starts the backup job. From that point, the backup server contacts VMware Tools or the VMware Sync Driver to quiesce the VM and prepare it to be in a proper state to be backed up. From that point, on Windows VMs, the Microsoft VSS service will actually do the quiescing of the VM. Once that happens and the operating system and the applications are in a good state to be backed up, then the backup server will go ahead and create that VM snapshot to freeze that disk because we’ve got a good disk. It’s in a great state. We want to freeze that disk now and not have any writes occur to it. So the VM snapshot occurs. The backup server triggers via the standard APIs that are built into VMware and it creates that VM snapshot. So now that disk is read only and our backup server then actually begins that image level backup, reading that virtual disk and backing it up. Then from that point, the backup server once the backup is finished simply deletes that VM snapshot and your backup is complete.
So to kind of summarize, earlier we told you to trust but verify, but in reality you shouldn’t trust at all. You always want to verify that your backup’s completed successfully and that data written is in a good state to be able to be restored. Your data is the most critical thing that your company owns, and you want to avoid that, “Oh, crap,” moment where you actually go to try to restore data and find out that all those backups that you’ve been doing all this time there’s something wrong with them. Your data wasn’t in a good state to be restored. So verification is key for being able to verify that everything that you’ve been doing all this time is able to be restored. So verification needs to be done on a regular basis. You can’t just do it once. Things can happen over time. You need to do it on a regular basis, whether it be daily, weekly, monthly, but you need to go through and not just verify one server, but you need to verify multiple servers, multiple applications, and things like that to really make sure that all the data that you’re backing up is in a good state.
So you can leverage virtualization to do this verification, and you can develop your own verification process where you can maybe write scripts or something to actually go through and verify that the data that we have here, kind of create your own methods to verify that that data is in a good state. Automation can make your life a lot easier. Leverage your backup. It’ll take all that hassle of manual verification out of there and make it all automated. So the end result is that when you verify all that data, you know that you have good backups. It’ll allow you to sleep good at night, and it definitely avoids any of those moments where you go to restore that data and find out that data is no good.
So don’t trust. Verify and always make sure that you have good backups.