About the speaker
Rick Vanover (vExpert, MCITP, VCP) is a product strategy specialist for Veeam Software based in Columbus, Ohio. Rick is a popular blogger, podcaster and active member of the virtualization community.
Best practices for VMware backups
In this session you will learn:
- Backup requirements and how-to
- Difference-making technologies
- Considerations of image-based backups
- Virtual machine design and provisioning
- Organization of backups
- Ensuring application consistency
- Testing recoverability
- Considerations for multiple OSes and applications
- Retention, availability and integrity
More sessions by Rick Vanover
Welcome to this Backup Academy lesson, "Best Practices for VMware Backups". I'm Rick Vanover, and I'm going to be presenting this material.
Before we start this lesson, I'd like to take a moment and tell you a little bit about myself. I do work for Veeam Software as a software strategy specialist, but as is the case with all the Backup Academy sessions, it is very neutral and non-product specific. In addition to working for Veeam, I also am a blogger in many current IT publications. I do blog at the Veeam blog at Veeam.com/blog. I also blog the Everyday Virtualization Blog at Virtualization Review, and also, I blog on the Servers and Storage Blog at TechRepublic.com. In addition to those blogging endeavors, I also am a VMware Certified Professional on both Version 3 and Version 4, and I am a VMware vExpert two years running for 2010 and 2011.
Okay. Let's take a look at the agenda for this lesson. We have a lot of material to cover, and it's a very broadly applicable set of information. The first thing I'm going to do is talk a little bit about VMware vSphere as a platform that can make a big difference in how we approach our virtual machine backups. I'm going to then focus on some of the requirements and the how to meet these requirements for virtual machine backup. I'm going to then highlight some of the difference making technologies that are mentioned in other lessons that are very critical to approach a virtual machine backup using VMware vSphere. I'm going to talk about some of the differences amongst the different types of backups, such as an image-based backup, an agent-based backup, or a file level backup. I'm going to then talk about how a virtual machine design is a critical indicator in how the backups are going to go.
The next thing I'm going to focus on will be the separation between a virtual machine's source and its backup target. Then we're going to talk about how to organize backup jobs in the most efficient way. The next topic is a very important one around application consistency within a virtual machine backup. Next, we're going to talk about verifying the recoverability of a virtual machine backup using any number of tools. Then we're going to focus on some of the considerations that go around backing up virtual machines with multiple operating systems and complicated applications. Towards the end of the material, we're going to focus on retention, availability, and integrity considerations that go along with backing up VMware vSphere virtual machines. Lastly, we're going to talk about job notifications, alerts, and management, which are critical to the ongoing success of a virtual machine backup solution.
Okay. The first thing we're going to do is talk a little bit about VMware vSphere. VMware vSphere is a very popular virtualization platform. There are a number of robust features that make VMware vSphere very popular for choices in virtualization today. VMware vSphere 5, the current version, as well as VMware vSphere 4, and the previous virtual infrastructure 3, were revolutionary technologies that changed the landscape of data centers worldwide. A number of the features that are available with a VMware vSphere environment include the distributed resource scheduler, which was a breakthrough technology that allowed any number of measurable events to be managed centrally through the management server, VMware Virtual Center, or vCenter. This includes migration of virtual machines. This includes resource allocation. And further, a number of protection solutions have been brought about through VMware vSphere. This includes VMware HA, which can protect virtual machines against host failure.
One particular technology that came out with VMware vSphere and the previous VI3 was the virtual machine file system. I want to focus on this for just a moment here. The VMFS file system was for many people the first introduction to shared storage. A lot of environments did not use shared storage, such as a Fibre Channel SAN and ISCSI SAN or an NFS storage network to provide shared storage across a number of virtual machines. Many people, including myself, first got exposed to shared storage in the course of learning virtualization. That's a central technology to do a lot of different things with virtualization.
The basic arrangement of a typical VMware vSphere environment is shown below. There are a number of virtual machines, and each of these virtual machines has individual icons to indicate that all virtual machines are not created equal. For example, there might be a database or a directory server or an application server, such as an email environment, all housed within virtual machines of one or more different operating systems. These virtual machines live on ESX or ESXI hosts, which may or may not be connected to shared storage resource, such as a VMFS data store. Local storage is supported and used in a number of configurations, as well as network attached storage resources using the NFS file system. We're not really going to dig into architecture, but just as a general reference, this design principle will be used throughout this material.
One of the first and most important topics when we go about determining how to back VMware virtual machines is to really focus on the key requirements of the backup. There are two metrics that are really central to all data protection strategies, and that's the RTO and the RPO. These are discussed in a couple other lessons as well, but just as a recap, basically the RTO is the recovery time objective. That's the amount of time that's allowed or allocated to a virtualization infrastructure to get back up and running. The recovery point objective is the amount of data loss that's tolerable during an incident.
We want to identify these requirements, apply these to the technology design, and then ensure that that configuration can be met to all the stakeholders involved. This sometimes becomes a business discussion, becomes a financial discussion, and more importantly, becomes a technology configuration discussion. We need to ensure that all the tools, from the hypervisor to the storage to the training of the individual admins, are all on par with these levels. Further, we need to ensure that all of these solutions are configured in a way that the virtual machine administrator can take and deliver the best RTO and RPO that the organization has agreed upon.
Once we've identified the requirements of our VMware backup solution, we then need to take the time to identify how these requirements are going to be met. First of all, we need to make sure that we leverage the platform to deliver the best possible solution. In terms of VMware vSphere, there are a number of technologies that we should really consider enabling with our virtual machine backup approach. This is one of those things that is materially different than how we may protect operating systems and applications in the physical world. Simply put, the VMware vSphere environment gives us a number of APIs and logic from those APIs that can greatly enhance our backup approach.
Some of those include APIs, such as the vStorage APIs or the VMware APIs for advanced data protection, which gives us changed block tracking. That in particular is a very important feature that allows us to backup virtual machines by only looking at regions of a VMDK, the virtual machine disk file associated with a virtual machine that have changed since the last backup. The platform, VMware vSphere, tracks that for us and can save us a lot of work in our backups. Further, we need to make sure that we are performing the backups in the fastest way. Additional APIs can let us do that by communicating directly to the storage processor rather than moving data over the network or through an ESX/ESXI host or virtual center.
Lastly, we want to consider, do we want to use agents? Are agents really the right way to go today? If we can meet our requirements, do it in the fastest way possible, we should definitely consider leveraging the new platform to do our backups. And the bottom line is that we need to have confidence in our restorability. This can be by ensuring application consistency, verification that our backups are in good shape, and reducing the amount of time to perform our backups.
All of these need to be done in a way that our virtual machine backup solution is in clear alignment with a technology direction. That can be agentless on the host, agentless on the guest, and fully supporting all of the virtual machine offerings today, which in terms of hypervisor is ESXI, which is the clear direction of VMware vSphere, starting with 4.1, which was announced to be the last full edition of ESX, which has a service console for things such as a host agent. We can take additional steps to improve the quality of our backup by leveraging all of these solutions to reduce traffic on our network for a LAN-free backup, or to separate our backup traffic outside of the guest operating system.
All of these technologies, when we line them up, are truly difference making compared to how we would protect physical machines. In this graphic below, we have our typical VMware vSphere environment with a mix of different virtual machines, but our backup and our restore, if they're fully aligned to the virtualization platform, we can make significant changes to how we backup. That's not just the mode, but also the speed and also the features that go along with it. VMware backups, generally speaking, are used to backup any number of operating systems. While typically most operating systems within the typical data center for VMware vSphere installations are Windows server operating systems, there also needs to be support for Linux and other virtual machine types. Our backup solution that we select needs to fully support everything that's supported within the VMware vSphere platform.
There are three main ways to do a backup of a VMware vSphere virtual machine. The first is an image-based backup, which is the way I'd recommend. Simply put, it can encapsulate the entire virtual machine within an image and then transport that, reproduce the virtual machine or other features that may go along with our data protection strategy.
The agent-based backup approach was popular within the physical server world. It's the same way that they've always done backups. It is possible to do these backups within a virtual machine using agent software, but there are a number of additional headaches or considerations, whichever way you look at it, that may go along with that. This includes having to update and manage software, the agent, within each virtual machine. This may incur costs as well for licensing and compatibility reasons. The software may also bog down the Ethernet network. So, that may not be a problem in the physical world. But in the virtual world when everything is consolidated, we may have network contention issues if the network is flooded from a single host with a number of virtual machines performing their backups at the same time. We can throw more ports at it, or we can perform the backup more efficiently.
The last style of backup that I have identified here is a file level backup. While this may be good for certain things, like a file server, but it's not really the best for all solutions. This only moves files and may not consider application consistency or even any applications to be backed up at all.
Each of these styles are different, and in the end we want to select an approach that best meets our requirements. I recommend an image-based backup simply because we can encapsulate the entire virtual machine and centralize all of our communication to something like VMware vCenter. This centralized management component of vSphere will allow us to coordinate all of the backup jobs centrally versus managing individual configuration items within each virtual machine.
Within each virtual machine, there are a number of design considerations that can impact how we approach our virtual machine backups. Some of these include the size of the virtual machine. Larger virtual machines are harder to work with. They may also be harder to backup. The operating system may be a factor as well. We may not be able to use all the smart features around something like a Windows server virtual machine that has a lot of features such as volume shadow service. We may not be able to do that on a Linux virtual machine, for example. Our backups may also be impacted by the virtual machine environment. This can include the source storage as well as the target storage where our virtual machine backups are going to go.
Simply put, a virtual machine is a collection of virtual resources, the CPU, disk, memory, and network interfaces. And on top of those resides an operating system and applications. Those all sit within a VMware virtual machine running on an ESX or ESXI server as a physical machine. These virtual machine characteristics can vary widely across a number of different configurations and environments. Another way to put it is that all virtual machines are not created equal. All of these factors are part of our strategy to backup VMware virtual machines.
Some additional considerations include how the storage is provisioned. For example, one of the features that VMware allows is a raw device mapping, or RDM. There are two types, a physical mode and a virtual mode. We have to ensure that our backup strategy can accommodate that type of configuration. Further, a virtual machine may have an ISCSI initiator configuration. The ISCSI initiator in a number of different operating systems will present an ISCSI storage resource directly to the virtual machine over the Ethernet network. This communication would have the guest virtual machine connect directly to the ISCSI storage processor where a specific target is made available to that virtual machine. If we have data on that ISCSI volume that we need to protect, we need to make sure that our backup solution can address that or go about provisioning storage a different way.
The next thing I want to talk about is the separation between the source and target within the virtual machine backup strategy. Simply put, the source would be the shared storage or direct attached storage where the virtual machines are living. This can be called primary storage or production storage, depending on how the virtual machine environment is named. What we want to do is put the backups, of course, on a different storage resource. So, I may refer to this as a target or backup repository. We want to ensure that we can put as many domains of failure between these two environments as possible. Simply put, it doesn't protect us very well if our backups are on the same VMFS volume as the virtual machine in the event of a SAN failure. By putting the backup repository on a different storage resource, we can address our virtual machine data protection strategy to accommodate a failure or otherwise inaccessibility of a shared storage resource where the virtual machines may live.
This configuration can exist in a number of different ways. Basically, the way I approach it is to put in as many domains of failure, levels of separation as possible. In this graphic here, we have a generic four-host cluster with a number of virtual machines within it. The shared storage resources, presumably VMFS, would be on a storage processor that's zoned to those ESXI hosts. On the right, we have our backup software, and the green volumes are the backup repository where our virtual machine backups go. In the best levels of separation, this would be on a separate storage fabric as well as a separate storage resource.
So the production environment may be fibre channel SAN, VMFS volumes zoned to the host, and the virtual machines living within there. And the backup software may access a separate ISCSI storage network that received the backups in a separate volume. This separation could even go so far as to be in another building or as much separation as otherwise possible. In this fashion, the backups are protected against a failure within the VMware vSphere environment. In this configuration, the backup data is protected from a failure within the VMware vSphere environment. This configuration is critically important. It's too easy to just set up a job, run the backups to a certain place, and then determine that we've compounded or overlapped our domain of failure to have one failure item knock out both our production and our backup environments. So don't underestimate the value of this separation. Like I said, this can include additional storage resources.
If additional storage resources aren't an option, one approach is to separate within drive trays. On modular storage, for example, you could have the primary virtual machines, the primary storage be, say, a tier one storage resource, like SAS drives or something like that. And then the next tray, the next drive shelf, would be the lower tier storage, maybe for backups, like a SATA drive pool, and the backups go there. So that is one level of separation where full separation is impossible, and we want to make sure that the storage resource supports the addition of those drive shelves to a foreign controller should the primary controller fail. That way we can have the data preserved and accessible by another storage processor if needed.
There are a lot of design considerations that go into this, and it's very difficult to make a broad recommendation. But my goal is to really ensure that we have the options identified and that we just ensure that we remember to include as many separation layers as possible between the primary storage or production storage environments and our backup targets.
One of the best practices for VMware backups is how we organize our jobs. Every tool has a backup job that backs up the VMware vSphere virtual machines to the backup repository. The terms may change, but basically, the fundamental unit is a backup job. That hasn't changed across generations of products, and VMware virtualization hasn't materially changed that either. One way that I like to do this is to have the source, the backup software, and the backup target all on the same page. What I mean by that is to have an intuitive flow of these jobs from the source to the target.
This can leverage a number of different organizational attributes. Basically, we want to have the same name for everything. On top of that, we want to use self-documenting nomenclatures. This way, we can avoid a number of different problems. It doesn't really help us to look through a backup software and see something called "Job 1". If we called it "prod-G1-Job1," then we know that's the group 1 job for production virtual machines.
Further, we can leverage things like a VMware vSphere folder that can function as a container for our backup jobs. We don't necessarily escape the granularity of individual VMs being backed up, but we can organize more appropriately that way. Further, depending on how the backup software works, organic growth of new virtual machines being added to the environment can be backed up if we simply backup the folder. A good example is if all virtual machines in a production class that need to be backed up are placed in the designated folder, as that backup job progresses, the net new virtual machines will be backed up according to the schedule. This approach can trickle down through the backup software and the backup target.
I mentioned the VMware vSphere folder, and here in this screenshot, we see a "prod-G1" folder, and I have a lot of different virtual machines lined up in there. And then there's a "prod-G2". That might mean group 1, group 2. But effectively, the folder is the organizational unit that will contain these virtual machines. This view is accessed by going to the VMs and templates view within the vSphere client. I like to refer to the VMware vSphere folder as one of the best organizational tactics that aren't really used that often within VMware vSphere. The folder can have permissions assigned to it also, so be careful when assigning complicated folder structures. I recommend folders over DRS resource pools, because resource pools aren't really an organizational metric. Resource pools are a resource governor. Don't find yourself using resource pools as folders. Use folders as folders and use resource pools as resource governors.
Further, folders can be nested, so we can have additional granularity in these virtual machine organizational tactics. And if the virtual machine backup solution can read these folders, it will back up the virtual machines within these, including the net new. So, a good way to think about it as part of the virtual machine deployment process is to put it in the right folder that would get backed up. We can tweak these, move them around, all that kind of good stuff, as long as the backup software is aware of the folder and the objects within it.
This same logic can apply to the backup target. Depending on how we address our backups, we want to make it simple yet separated. So if we have a job named whatever, "prod-G1", we want to have a backup share with the same name. That way we're self-documenting that this job, "prod-G1", which stands for production group 1, is backed up with this VMware vSphere folder backup job and is backed up to this target maybe by the same name. By doing that, we self-document our backups from source, the VMware vSphere environment, to the backup software to the share or the target that our backups go. And depending on our configuration, it could be a network resource. It could be a local disk. It could be a virtual machine with a dedicated VMDK. In all cases, we want to make sure that we have our backups clearly documented in terms of their name so that the application flow is very clear from source to backup software as a target.
And the concept of a container applies here as well. We've used the concept of a container with the VMware vSphere folder. That same process can be applied in the backup target. By having a share or a folder name that is the same name as the backup job, the same name as the VMware vSphere folder, the backup share can additionally function as a target. In the same sense that we have the folder on the VMware vSphere environment functioning as a container for the virtual machines, which then leads to the backup job with the same name, we can then make the backup target have a share or folder name of that name to make sure the backups follow that same logic.
Another critical design element of a virtual machine backup solution for VMware virtual machines is to ensure that we have application consistency. Simply put, application consistency is the confirmation that the application is in good shape for our backups. The opposite of this is a crash consistent backup. Crash consistent sounds like a good situation, simply because it has the word consistent in it, but it is actually a bad backup state. It's effectively the same as pulling the power out of a physical server while it was running. Application consistency, on the other hand, will ensure that the application is properly prepared for the backup. Sometimes that's referred to as quiescing the application.
This is most commonly done with something like volume shadow copy service, which is explained in detail in one of the other Backup Academy lessons. The takeaway is that the applications on a virtual machine are backed up in a correct state with application consistency. When application consistency isn't available from a tool, such as volume shadow copy service, the virtualization administrator can do a number of tricks to ensure that we get that application consistency. This could be scripting options, scheduled tasks, pre and post steps, stuff like that, that can ensure that the application is in a proper state to be backed up.
One of the other critical best practices for backing up VMware virtual machines is to ensure that the virtual machine is recoverable. This is different than a backup verification. Simply put, the virtual machine may be backed up yet not in a good state. A couple of different situations can occur to make this happen. But basically, if we have high uptime on a virtual machine, some errors may not be manifested until that virtual machine is rebooted. This can be things like a registry issue that isn't manifested until the virtual machine is rebooted, and we might have issues where the retention policy is exhausted by high uptime.
Further, we want to make sure that the applications are backed up correctly. The difference between the recoverability and the verification of a backup is that it's usable. Simply put, we may be able to back something up, yet it is not usable. And we may be able to verify that it's backed up, but we don't know that it's actually usable until we test that recovery. This can be doing a test restore in an isolated environment, booting up the virtual machine, making sure everything's okay, or we could put some automation around that and do it a little bit smarter. This can be done a number of different ways, and we want to make sure that when we select the backup tool, we ensure that recoverability is pinnacle in the approach in how virtual machines are backed up.
In today's virtualization environments, we have a number of different applications and operating systems to deal with. We need to make sure that when we organize our backup jobs and we arrange our backup solutions, we ensure that all of these different applications and operating systems are accounted for. Simply put, things are a lot easier when it's only Windows that we're dealing with. But chances are we have some amount of Linux or other operating systems that may need to be protected to the same level as the Windows systems.
There are a number of tools that can help us accomplish this goal. Of course, in a Windows world, we like the volume shadow copy service to ensure application consistency. But when we have other platforms or if the application on the Windows virtual machine doesn't support VSS, we'll have to alter the geometry a bit to get our virtual machine protected to our requirements. This can include things like pre and post scripts or organizing backup jobs by different operating system type or recovery type or backup requirement time frame. For example, if one virtual machine only needs to be backed up once a week, we may put it in a separate job protected to that level. In summary, we have to make sure that each individual workload or class of workload is protected to the best way within the platform and its capabilities.
Another critical design element in how we address our requirements is to ensure critical aspects, such as retention, availability, and integrity of our virtual machine backups are met. Simply put, we have to define the specific requirements of how our virtual machine and data protection strategy is to be delivered. Frequently, a lot of people, bloggers like myself and backup vendors are trying to go away from tape. Simply put, a lot of people don't like tape, but a lot of people do. Tape is a very portable, inexpensive acquisition cost in terms of unit media solutions, but it has a number of issues with it that many people don't like.
Disk-based backups are quicker, but we need to ask ourselves: What is the retention media requirement? Is it that it's portable? Is it that it's off-site? Is it that it's read-only? Whatever these requirements are, we need to make sure that we address it in the best way. Tape is still a good option for some environments due to bandwidth considerations or the easiest way to pick up a backup and move it, whatever the reason may be. But we need to ensure that our backup infrastructure is designed in a way that meets these requirements.
Some different ways of meeting these goals can include data replication. And what I mean by that, let's take our backup file. The backup repository that was identified earlier has a bunch of files on it that contain our virtual machine backups and our points in time. That data, in terms of the raw backup files, may be replicated through a storage solution or a number of different products that can move data from one site to the other. That may be an acceptable retention, availability, and integrity solution. Another approach is to do individual virtual machine replication. This is a little bit different than the solution before in that we have runnable, inventoried virtual machines located in the ESX or ESXI server presumably at another site. We need to ensure that these requirements are very clearly identified when we address these topics of our virtual machine backup strategy.
This is different than the RTO and RPO discussion that we had earlier in that those are focused on recoverability, whereas these are focused more so on the larger protection scheme. Recovery is important as well, don't get me wrong, but we may have additional requirements, depending on the organization, for how our backups are to be protected in this sense. So this quickly becomes a business discussion with the stakeholders of the virtual machine infrastructure. This shouldn't be omitted from the discussions and should be addressed early on, because design principles of the data protection solution are a lot easier to do at the start of a project rather than midstream or as an afterthought.
Our last core topic of today's lesson revolves around job notifications, alerts, and ongoing management. This is really important to know that the backups are operating as planned. Any solution can have interruptions that can affect the ongoing success of a process. Backups are no different in a virtual machine world. So, a number of different solutions can be employed. Ideally, things like an SNMP feed or an email alert that the job is successful or has failed would be a good idea. But we also want to round out the stack to make sure that we have everything addressed.
This could be everything, such as updates, meaning if a backup tool identifies that this virtual machine has a bad configuration and needs updated. A good example is an old version of VMware tools or obsolete hardware version. That may require us or may give us a symptom of missing out on a cool new feature like changed block tracking. These types of interactive notes and log messages and all that kind of good stuff is really important to ensure that our backups operate as planned.
An additional option could also be a virtual machine attribute. That's a nice little visual indicator of a virtual machine status that's visible within the vSphere client. Depending on how we organize our virtual machine backups, we may be able to quickly see within the vSphere client that this virtual machine has been backed up at this time with this job, all that kind of good stuff. The objective of an attribute display of the virtual machine backup status is to allow a very quick, easy to see status of a virtual machine backup. While we can go into the tool, it would be nice also if we have a quick display as an ongoing confirmation as we navigate through the vSphere client. Further, we can do search and that kind of filtering and reporting or whatever based on the attribute status so that we can also have another view into the success of the backups.
Specifically, within our backup tools, we want to make sure that our backup console has a quick and easy view of the status of a backup job. We need to be familiar with the normal operation of a backup job so that we can troubleshoot correctly. It doesn't do us any good if we see that the backup job has failed but we don't really know what normal is. Or maybe the backup job is taking too long or that type of thing. So, we want to make sure that each different type of workload gives us a predictable and expected behavior for each type of job.
So let's recap a bit. In this lesson, we covered a lot of different material. We took a minute to talk about VMware vSphere and how it’s fundamentally changed our typical data center with virtualization. We kind of looked at the different requirements and how to go about getting those requirements met with different backup strategies. We focused a little bit on the difference making technologies for virtual machine backups. Then we talked about the considerations of different approaches, mainly image-based backups versus file and agent-based backups. We then talked about how to design, configuration, and provisioning of a virtual machine can make a difference on how that virtual machine is backed up.
We then talked about the separation of the virtual machine primary storage, production storage, or source from that of the backup repository or target. This is very important to ensure resiliency and increase the domains of failure between the two. We then talked about the organization of backup jobs and how that's important to have a very intuitive flow from hypervisor to the backup software to the backup target, so things are self-documenting and organic growth is accommodated. We then talked about the importance of application consistency as well as the ability to test the recoverability of a virtual machine.
We then talked about multiple operating systems and applications and some tricks and tips to approach backing up these virtual machines without some help from the platforms. We then talked about retention, availability, and integrity considerations that go along with our virtual machine backup strategy that aren't necessarily addressed by the RTO and RPO conversations. Lastly, we talked about job notifications, alerts, and management that goes along with the ongoing administration of our virtual machine backup solution.