About the speaker
David Davis is Virtualization Review's "How-To Guy" and VMware Evangelist. As a video training author for TrainSignal.com, he has produced more than 10 video training courses, including the popular VMware vSphere video training series.
Disaster recovery of VMware workloads
In this session you will learn:
- Preparing for Disaster
- Types of Disasters
- RTO vs. RPO
- VMware High Availability (VMHA)
- VMware Fault Tolerance (FT)
- VMware Site Recovery Manager (SRM)
- SRM Requirements
More sessions by David Davis
Hello and welcome to Disaster Recovery of VMware Workloads. My name is David Davis, and I'll be your instructor for this lesson. You can reach me over on Twitter @DavidMDavis, as well as over on my blog at VMwareVideos.com.
Before we get started on this lesson, first let me tell you a little bit about myself. I'm a previous VMware customer. I've served as an IT manager and a server and network admin with over 18 years of experience in the IT industry. I was also awarded the vExpert award by VMware two years running for my evangelism of VMware virtualization. I'm a VCP4. I've got the new VCAP-DCA certification as well. And then in the past, I've done a lot of work with Cisco where I achieved the CCIE. I've authored hundreds of articles on virtualization around the web, including on websites like Virtualization Review as well as in the print edition of Virtualization Review. I've served as a VMworld speaker and a judge, and I'm best known for my vSphere video training that you'll find over at Trainsignal.com.
In this lesson, we'll talk about why you should prepare for disaster in your virtual infrastructure as well as the different types of disaster that might occur. I'll talk about the difference between RTO and RPO, because they're very critical terms when it comes to planning for disaster recovery. We'll cover the differences between VMware high availability, fault tolerance, and site recovery manager, and I'll go into the requirements that you'll need to put in place if you're going to use site recovery manager. I think those are important to understand, because they vary drastically from, let's say, using a software-based disaster recovery solution. So with that, let's get started.
Let's start this lesson off by talking about preparing for disaster in your VMware virtual infrastructure. Now for very small organizations, many of them, they just struggle even to get their backups done, much less to plan for a total disaster. But there's a lot more to it than just getting the backups completed. You also need to get those backups off site. Now for a small company, that might mean putting those backups on an external USB drive and then physically taking them off site, taking them home, taking them to another company location. For a handful of virtual machines, that may be all that you really need to do. That may be all that the company requires.
But at most companies, from the small businesses moving on up to medium and large enterprises, there's a lot more to disaster recovery of the virtual infrastructure than just making backups. Just making backups is the absolute bare minimum. That backup data needs to be moved off site You have to get the virtual machine backups off site in case, let's say, there's a hurricane or a tornado that knocks down the entire data center. You also have to prepare to get those backups loaded and running in the time frame that the business requires. Just having them off site, if it's going to take you a week to get everything back up and running, you might lose your job before then, because they'll find somebody else who can do it faster.
You also want to make sure that the backups are as current as the business needs to eliminate the likelihood of lost data. In other words, if you just take backups once a day, but there are a lot of critical transactions that happen during that day, that may just be plain unacceptable for your company. That data may be worth so much money that you need to do something much more frequent. You need to replicate virtual machines, let's say those critical production virtual machines with frequently changing data much more frequently than every 24 hours.
So there are many different types of disaster that you need to prepare for. You could have a single virtual machine that becomes corrupt. You could have network outage. Let's say you could have a data center power or cooling outage where the whole data center goes down. Or you could have that catastrophic disaster where an earthquake, let's say, hits the data center. And you need to be prepared for all those different types of disasters. But honestly, most disasters are caused by humans, and they're localized to a local side. Maybe a virtual machine is accidentally deleted or it's accidentally corrupted. Unfortunately, those are some of the most difficult disasters to prepare for, because we just don't ever expect those to happen. But still, localized human-caused disasters still cause companies millions of dollars of losses every year.
Fortunately, VMware admins, like you and I, we have virtualization to step in and help us make the job of backup and disaster recovery just tremendously more simple than it ever was before them with physical servers. You now have hardware independence, and you have that complete portability of your virtual machines. You have image-based backup, so you can back up an entire virtual machine and restore that entire virtual machine without having to worry about, let's say, restoring the operating system, restoring a backup agent, then restoring files, and so on and so forth that you had to worry about with physical servers. You also have load-balanced resource pools that can help you to ensure that these virtual machines get the resources they need.
So let's say you have a disaster, you go to your disaster recovery site. You have a couple ESXI servers. Maybe you can just get some of your most critical servers up and running. But very quickly, you add a few more ESXI servers to that DRS load-balanced resource pool and throw some more virtual machines in there, and very quickly and very easily, DRS load balances those resources and provides you the resources that you need at the disaster recovery site. So, my point is these load-balanced resource pools make the most out of the physical server resources you have, and sometimes when it comes to disaster recovery, those physical resources from the servers are extremely limited. You also have automated replication and disaster recovery tools, like Veeam replication that can replicate individual virtual machines across a wide area network to a disaster recovery site on a set schedule, or even provide you near continuous data replication.
Then at a higher level, you have Site Recovery Manager from VMware. We'll talk about both of these products in this lesson on disaster recovery of VMware workloads.
I already touched on some of the different types of disasters that could occur in any data center, whether you're using a virtual infrastructure or not. But let's now match up some of the specific types of disasters along with the specific products that might help to mitigate those types of disasters. So first off, let's say that you have a single physical server failure. That server, let's say, has a bad CPU. It just crashes, blue screens. It won't come back up. You also could have a physical server that just has one power supply. Power supply goes out. Physical server won't come back up. In that case, products like VMware High Availability and Fault Tolerance would help to mitigate that sort of disaster.
With VM HA, the virtual machines running on that physical server would have to be rebooted on other servers. So it's going to take some time, and there's a potential there for data loss or data corruption. On the other hand, at a much higher scale, VMware Fault Tolerance can provide real-time replication of that virtual machine's memory across two physical servers, so that if one physical server fails, the virtual machine never hiccups. It never loses a beat. And Fault Tolerance is compatible with any operating systems supported by vSphere. So it's a very cool feature. There are also a number of limitations around using Fault Tolerance. Limitations such as you have to have specific hardware, and you can only have so many virtual machines being protected by Fault Tolerance at one time.
Then the second type of disaster I have on the list here is virtual machine corruption. So let's say that you try to do a Windows software update. You reboot a virtual machine, and then the virtual machine blue screens, it won't come back up. The operating system is corrupt. It's a real pain but also a common occurrence. In that case, Veeam Backup could simply restore that virtual machine back to its previous state, the state that it was backed up at during the last backup. So you would just run a restore job and get that virtual machine back up and going again.
And then finally, the third type of disaster here would be a data center failure. So a total data center, maybe the data center loses power, it loses cooling, it has to be shut down, or there's a catastrophic event, like an earthquake, that hits the data center. In that case, Veeam replication would be put in place to replicate virtual machines from one data center to another. And then at a higher scale, you've got VMware Site Recovery Manager, which would be much more expensive, thanks to the requirement to have dedicated SAN hardware doing the replication. But it could restore very quickly an entire virtual infrastructure at the backup or secondary data center site. Your disaster recovery site could bring the entire virtual infrastructure back up and running in a matter of minutes.
So those are some of the different types of disasters that could occur in a VMware virtual infrastructure, and you need to be prepared with a variety of different products. Being a superhero, a super VMware admin here, you need to have a variety of tools in your tool belt to help you prepare and your company prepare for any sort of disaster that happens in the virtual infrastructure.
When disaster occurs in a data center, whether or not it's a localized disaster or a large-scale catastrophic disaster, and whether or not it's a virtual infrastructure or a traditional physical server data center, there are two metrics that are used to measure recovery, and they are RTO and RPO. So RTO is the recovery time objective, and this is the amount of elapsed time allowed by the business or a service level agreement to recover the data. In other words, how long do you have to get that virtual machine back up and running at the disaster recovery site? Is it 1 minute, 10 minutes, an hour? Do you have 24 hours or even more to get that virtual machine back up and running and available to the end users? The second metric here is the recovery point objective or the RPO. The RPO is the point in time to which the data must be recovered.
So let's say you took a backup at midnight last night, and then the disaster occurred at 10:00 the following morning. You don't have any backups since midnight, so is it going to be acceptable to just restore the backup that was taken at the last backup time, in this case which was midnight. In other words, you're losing 10 hours’ worth of data. That could be 10 hours’ worth of critical transactions, 10 hours of e-mail for thousands of users. Is it acceptable to get that server back up and running and lose those 10 hours’ worth of critical company data?
That recovery point objective is something, along with the recovery time objective, both of these need to be defined and negotiated with the business. Once you define those and they're part of your disaster recovery plan, it would be considered a service level agreement between the IT department and the business units that helped you to define those objectives. And so different servers, different applications, they can all have different RPOs and RTOs, and this is just something you have to work out, because of course, the less downtime that you are tolerant to in your organization, the more money that it's going to cost your organization.
So of course, if you ask, let's say the marketing group or the sales group, "How long is it okay to have the website down for," they’re going to say, "It's never okay to have the website down." You say, "What if we had an earthquake and the data center was destroyed?" And they say, “Well, you know, it could be down for about five minutes." And you could say, "Okay, that's fine, but that's going to cost you $500,000, because we have to buy two SANs, and we have to buy Site Recovery Manager, and all those things." And then they might back off and say, "Well, in that case, then it would be okay if the website was down for 12 hours or 24 hours." And then you could say, "Okay. Well, we could do that for just $100,000, let's say." So these need to be negotiated with whoever the application owners are for your business critical applications, and you need to document them along with your disaster recovery plan.
Now I want to talk about some of the different products that people commonly associate with virtual infrastructure disaster recovery and high availability. And the first one is VMware High Availability. My point in explaining these different products is so that you understand the purpose of these products, because they have very specific, unique purposes, and they're not just general disaster recovery products. You need to know which tool to use to fulfill the needs of your company, because just because it says VMware High Availability doesn't mean it's just going to keep the servers up and running no matter what sort of disaster occurs. VMware High Availability actually is for a localized physical server outage.
So in the graphic here, you see you've got three servers, you've got a failed server. They're all running VMware ESX. You've fortunately implemented those three servers, put them in a VMware HA enabled cluster. That cluster has a resource pool of different resources, and all the virtual machines are inside that resource pool. So the server in the middle, it failed, and thanks to VMware High Availability, the other two servers are made aware that that server has failed. And then all these virtual machines are stored on the storage area network, and all the ESX servers in the cluster have access to those virtual machines on the SAN. So when the middle server there, in this case, fails, the other two servers are selected to take over running those virtual machines for the failed server.
Those virtual machines are mounted or added to the other servers in the cluster and powered on. And because those physical servers can access those virtual machines on the SAN, the SAN never moved, the SAN never failed, the virtual machines are restarted on other physical servers, and perhaps the end users using those virtual machines experienced one or two minutes of downtime. Just the downtime that it took for the virtual machines to be rebooted. Basically the Windows operating boots process, it took that long for the virtual machines to come back up. So that's what VMware High Availability does, and it's one of the more low-end disaster recovery HA features available in VMware vSphere.
So moving from VM HA, the next feature up is VMware Fault Tolerance. You'll notice that the diagram here for VMware Fault Tolerance looks very similar to the diagram for VMware High Availability, and it works very much the same. But there is one huge difference, and that is that instead of the virtual machines protected by Fault Tolerance having to be rebooted, like they have to be rebooted with VMware High Availability, those virtual machines actually never miss a beat. The end users can keep using those virtual machines the whole time, even though they were running on a physical server and that physical server has completely failed. Someone could just walk up and pull the power cord on it, and those virtual machines would just keep on running. That's because those virtual machines were protected with Fault Tolerance.
Fault Tolerance is a much more high-end disaster recovery or high availability service from VMware. Actually, it's a feature of vSphere of the higher end versions of vSphere. And what it's doing is you select the virtual machines that you want to be protected. Most likely, you're not going to be selecting every virtual machine. The lesser priority virtual machines would be protected with VMware High Availability, but the most critical virtual machines would be protected with VMware Fault Tolerance. So VMware Fault Tolerance is actually replicating the memory of the virtual machine from one physical server to another, so that if one of those physical servers fails, the other physical server can just keep on chugging, keep on processing that virtual machine, and the virtual machine never loses a beat.
There are very specific hardware requirements for Fault Tolerance. You have to have certain CPUs, and it's not just the CPUs that are on the VMware hardware compatibility list. You have to make sure that yes, they're on the hardware compatibility list, but they also have to be compatible with VMware Fault Tolerance. For example, I have a server that will run vSphere, but it won't work with Fault Tolerance. So I purchased two Dell T610s that have Xeon 5500 series processors. They are on the Fault Tolerance hardware compatibility list, and I'm able to test and demonstrate Fault Tolerance with those Dell T610 servers. Of course, there are many other brands of servers on that compatibility list, many other brands of servers that work with Fault Tolerance. Also, like I said, you're not going to protect every virtual machine with Fault Tolerance. It's just for very specific, very high-end, business critical virtual machines.
So we just talk about VMware High Availability and VMware Fault Tolerance. Both of those services are for localized physical server failure. So they're high availability solutions. You might consider that a disaster if a physical server fails, but that's all that they're good for to be honest with you. It's a great function. It's a very necessary function in the virtual infrastructure, but that's all that they're going to do. They're not going to work across a wide area network. They're not going to work to fulfill restoring an entire virtual infrastructure at another data center when there's a total catastrophic disaster. They're just for individual physical server failure protection.
So now we can move on and talk about VMware Site Recovery Manager, or SRM, which is an additional software application outside of the typical VMware vSphere feature set. So VMware Site Recovery Manager, or SRM, is a software application you would load, along with vSphere and vCenter, and it would require SAN-based replication. So you would have two data centers at minimum. You've got a primary data center and a backup data center, or a primary data center and a disaster recovery site, let's say. And each of these two data centers has their own completely independent, fully functional virtual infrastructure that includes physical servers, storage area network, VMware vSphere, vCenter, Site Recovery Manager, and at the primary data center, you've got your virtual machines. So those are your production business critical virtual machines.
Then you use VMware Site Recovery Manager to protect those virtual machines should a catastrophic failure occur at the primary data center. But don't misunderstand. Site Recovery Manager isn't going to replicate any data. It isn't going to copy the virtual machines from site A to site B. It's going to depend on the storage area network, the SAN based replication that you're going to have to purchase to replicate those virtual machines. It's just going to basically run the recovery plan that you create. So, the storage area network's going to have to replicate that data across a wide area network. You're going to have to have the bandwidth to do it. It's going to replicate individual LUNs regardless of which virtual machines are on those LUNs. You just specific the LUNs.
Then, Site Recovery Manager is going to be used to specify which virtual machines from those LUNs should be brought up at the disaster recovery site when a disaster occurs and in what order those virtual machines would be brought up. So that's VMware Site Recovery Manager, and it's one of the most high-end disaster recovery products you can purchase. There still will be some downtime, don't get me wrong. There still will be some downtime for those virtual machines, because they're going to have to be restarted.
Of course, there are even higher availability solutions for virtual machines so that they have no downtime, like VMware High Availability Stretch clusters. That's something you can go and Google. Scott Lowe has a great presentation on VM HA stretch clusters. It's not something I'm going to get into in this lesson, because it’s just a very extreme case, and it's not necessarily the best practice. But it is the highest availability disaster recovery solution you can get.
VMware Site Recovery Manager is probably the best solution for most enterprises. Site Recovery Manager, like I said, isn't going to be for the SMB, because you have this requirement to have SAN-based replication. Speaking of those requirements, let me briefly touch on some of the Site Recovery Manager requirements that you need to ensure you have if you plan to use VMware SRM. VMware Site Recovery Manager being one of the more high-end solutions for virtual infrastructure disaster recovery has a number of very strict requirements that can also be quite expensive.
So first off, you must have storage area network hardware, or SAN hardware, at each site with the replication license that allows you to replicate or do hardware replication from one SAN to another SAN across a wide area network or a secure VPN tunnel. You also must have, like I said, that secure network between the two DR sites, and that network must have enough bandwidth to perform real-time synchronization of the changes that are made to the SAN LUNs where the virtual machines are stored. So that's another critical piece of this is if you don't have the bandwidth available, or if you have too much latency between the sites to handle the amount of transactions or megabytes or gigabytes or terabytes even that you'll be replicating, then you're going to run into problems, and you're going to have to spend more money to make that possible.
Your SAN hardware must meet the vSphere requirements even just to work with VMware ESX or ESXI. And then they also have to meet the VMware SRM compatibility matrix that you'll find at that VMware support website. Again, it's very critical to ensure that you have the pieces you need. Many things like, let's say virtual storage appliances, just aren't going to be compatible with VMware Site Recovery Manager, and you'll need physical storage area network hardware and licenses to make this happen. Finally, you'll need vSphere 4 and vCenter 4 on your physical servers, and then you also much purchase VMware Site Recovery Manager Version 4 from VMware.
I talked about how VMware Site Recovery Manager requires that you use SAN-based replication, but you can actually use SAN-based replication on its own, of course without Site Recovery Manager. You just don't have an automated disaster recovery plan for your virtual infrastructure. SAN-based replication is hardware-based replication that's inside your storage array, and it may require a separate license other than the software licenses you have now for your SAN. Many times it's called CDP, or continuous data protection, when you use SAN-based replication. It requires expensive hardware and the CDP licenses at both sites. Keep that in mind. You can't just replicate to some low-end SMB storage array from your high-end storage area network at the primary site. Usually, you'll have to have the same storage area network at each site with the same license at each site to make SAN-based replication possible.
Also, SAN-based replication may require you to replicate the entire SCSI LUN, not just the individual virtual machines. So, you might have to replicate things that you don't want or don't need, and those things can take up a lot of bandwidth. Also, you'll want to consider how often you're replicating your virtual machines across the wide area network, because the most frequent replication possible, of course, will also be the most expensive replication possible. VMware's Site Recovery Manager, like I said, requires SAN-based replication, but you need to make sure that your SAN and your SAN-based replication is even going to be compatible with VMware's SRM.
Now let's review what we've learned in this lesson. We started off by talking about the importance of preparing for disaster in your VMware virtual infrastructure. I discussed how there's a lot more to it than just taking backups of virtual machines. You also need to ensure that those backups are taken off site, and replication is a great option to do that. Replication is also used for more critical virtual machines to ensure that they get back up and running quickly after a disaster. From there, we discussed how there are various types of disasters that you could be hit with in the virtual infrastructure. Those range all the way from a single physical server failure to a total catastrophic loss of your entire data center.
There are different products available for each of those different types of disasters, including VMware High Availability, VMware Fault Tolerance, VMware Site Recovery Manager, and software-based replication solutions. From there, we compared the difference between the recovery time objectives, or the RTO, and the recovery point objectives, or the RPO. Both objectives need to be defined between IT and the business owners of your company, and this will happen through a negotiation where you weigh the needs of the company versus the cost of the recovery solution. Based on the RTO and the RPO that are chosen in those negotiations, that's going to dictate the solution that you choose to prepare for disaster in the virtual infrastructure. In other words, that's going to help you to choose the right tool for the job.
After that, we moved on and discussed how VM HA and VMware Fault Tolerance are local server failure solutions that don't solve disasters across an entire site and how Site Recovery Manager and replication would be the solutions that you would choose in that case. VMware Site Recovery Manager is an excellent disaster recovery solution, but it has its own set of requirements and likely a rather high cost. Those requirements focus around SAN-based replication or hardware-based replication of your data. That hardware-based replication will likely only be able to replicate entire SAN LUNs, not individual virtual machines. And the cost of that SAN at both the primary and the secondary data centers, the replication licenses, VMware Site Recovery Manager license, and the bandwidth for real-time data replication of your data is going to be very high.
I hope that you learned a lot about disaster recovery of your VMware workloads in this lesson. Thanks for watching.