About the speaker

MVP, vExpert
Senior Partner at Concentrated Technology

Greg Shields is an independent author, speaker, and IT consultant, as well as a Partner and Principal Technologist with Concentrated Technology.

Multi-site clustering for Hyper-V disaster recovery

  Email to a Friend    Download    Rate  Views: 299

Short overview

In this session you will learn:

  • What makes a disaster
  • Multi-Site Hyper-V vs Single-Site Hyper-V
  • Constructing Site-Proof Hyper-V
  • Replication Processing Location
  • Target Servers and a Cluster
  • Multi-Site Cluster and Quorums
  • Multi-Site Cluster Tips/Tricks

More sessions by Greg Shields

Transcription

Hello, and welcome to another learning experience here with Backup Academy, this time on multi-site clustering for Hyper-V disaster recovery.

My name is Greg Shields, and I'm an MVP and a V expert. I'm also a senior partner with Concentrated Technology, that's concentratedtech.com. I've been asked today to come and just sit down with you for the next 45 minutes or an hour or so, and talk about this whole notion of getting Hyper-V servers brought together into a cluster.

If you work with clustering, you know that clustering all by itself can be a challenging activity. When you take the basics of clustering and you expand them into this whole notion of multi-site clustering, you add that much more complexity over the top.

I think a lot of people out there are really confused about how Hyper-V can be constructed in a cluster, first and foremost, and then secondly how you can pick that cluster and extend it to a second site.

Before we get there however, a little about me. As I said, my name is Greg Shields, and I'm with Concentrated Technology, that's concentratedtech.com. I've been doing IT for 15 years or more, across the breadth and depth of IT.

Literally, administration, consultancy, lately been doing a lot of analysis and speaking and columnist and those sorts of things. I had an opportunity to see the entirety of the IT landscape, both from the vendor's side and from the analyst's side, even from the administration's side, sitting in and getting real work done in IT.

What we're here for is to really come to grips with how we might implement Hyper-V today in such a way that might protect us from a disaster.

Now, let's step back for just a moment. What makes a disaster?

Let's have a little fun. Which of the following would you consider to be a disaster? A naturally occurring event like a tornado, a flood or a hurricane that impacts your data center, causes damage? That might probably be a disaster.

What about a widespread incident like water leakage or long term power outage that interrupts your data center functionality? That might be one as well. Maybe a problem with the virtual host. You get a blue screen and it seizes the processing on that server. That's not real great.

Or a bad piece of code causing problems with a service, or a power connection problem, that you lose a server, an entire rack. All of these are very obviously, well, they're not that great, but not all of these are really the classic disaster in terms of how you would think of the more traditional disasters.

The ones on here on the top, we have a natural, big widescale event like a tornado, flood, hurricane, that's a big deal. So are these widespread events that are very data center centric and data center wide impacting.

What's important to differentiate here is that some of these events are indeed disasters, and some of them are just a bad day. Two of them effect a host or a machine, as opposed to the entirety of your data center.

This is an important differentiation, because you're going to make different decisions and you're going to use very different technologies to protect you against the bad day than are you against the wholesale disaster.

Now, I say this because your decision to declare a disaster and to move to disaster operations is really a major one. The things that you put into place, both protectionary, both technologically, even procedurally, are going to be different for a server goes down, a rack goes down, as opposed to the data center is underwater.

The technologies you use for DI are very different, they're more complex. They tend to be more expensive, they require more moving parts than you do for high availability.

Most importantly also is the fact that the fail over and very specifically the fail back processes involve a lot more thought. With many of the solutions today, particularly in the Hyper-V space, you might not be able to fail back with just a click of a button.

You have to really determine whether or not you're going to declare that disaster before you do, because once you do, you set in line a huge chain of events that may be difficult to get yourself back into normal operations.

What's particularly interesting about Hyper-V and the single site configuration as opposed to the multi-site configuration is that they both kind of really look the same. To Hyper-V, the multi-site configuration doesn't look all that much different from a single site configuration.

Microsoft does not do a really good job of explaining this fact. You've got to have some hosts, and you've got to have some network and some storage, and then your VMs just migrate around. In some ways, multi-site Hyper-V is a natural extension of single site Hyper-V.

There are some major differences, though. You have the ability to live migrate across sites, but you do have to have two different data stores. You have to have storage on both sides, and you have to have some replication mechanism in p lace in order to keep those sites replicated.

Today, with Windows Server 2008 R2, we can also have sites that are on different subnets, too. While that first and foremost might seem like a great thing, it introduces some interesting issues that you have to be aware of.

Not the least of which is that when your servers fail over to a secondary site, they're going to change IP addresses. Your clients need to know where your servers are going to go.

All of these things you have to prep for whenever you're prepping for your DR plan.

I will argue that if you pore through all the Microsoft documentation and all the blogs and whatnot, really constructing a site-proof Hyper-V requires three really simple things.

You have to have storage. You have to have replication, so some way to take the storage that's at your main site and get it over to your backup site. Then you have to have a set of target servers on a cluster on the other side to receive those virtual machines and their data.

That's really it. Once you have these three things, layering Hyper-V on top of them is relatively easy. You'll find in this presentation I'm not going to spend too much time talking about the actual button pushing inside of Hyper-V, because if you know how to create a Hyper-V, you've got most of what you need in order to create a multi-site Hyper-V cluster.

There are a lot of architectural level things that you need to really understand before you really start clicking those buttons.

Here's a picture of just what I talked about. Over on the left, we have our primary Hyper-V, and they are attached to some sort of storage device. There's some storage processors there.

Then we also have this back-up site over here with some target servers, and another storage device and replication mechanism.

What you do is you essentially create an environment upon which your virtual machines can fail over to, should you lose that primary site. Really, that's all that multi-site Hyper-V really is.

With this in mind, let's take a look at these three things. The storage mechanism, the replication mechanism, then also the target servers, with a little bit more detail. Because there are some kind of unique characteristics that you have to plan for, because different solutions are going to approach this architecture in a much different way.

Let's start in the storage mechanism. When we talk about storage, we're typically talking SANs, and we're typically talking SANs in two different locations. They can be fiber channel ISCSI, fiber channel over Ethernet, an array of USB drives, whatever kind of SAN makes sense for you.

You're going to need to have two of them obviously, one on the primary site and one on the secondary site. These tend to be similar models and manufacturers. The requirements for them to be exactly the same these days, depending on manufacturer and depending on the products you select, they're going away.

The similarity is necessary but not required, so that the replication can occur appropriately. I'll talk about replication and some of the things you have to be aware of, the different layers in which replication can occur here in just a little bit.

That backup SAN doesn't necessarily need to be the same size or even the same speed as a primary SAN. I put down here at the bottom, DR environments are typically where old SANs go to die.

You've brought in this brand new shiny SAN, and you've got to do something with the old one, let's go stick the other one out in Poughkeepsie or wherever the backup site is.

It's a little slower, it's got a little bit less storage, but it can serve as the disaster recovery site in case we lose the primary site.

Now storage is one thing obviously, but the next piece of this puzzle is the replication mechanism. The way in which you get the data transferred from one site to the other. Replication as to occur between those two SANs.

What's interesting and is probably not very well known when it comes to Hyper-V is this is not something that Microsoft does. You can throw out the different Microsoft technologies for replication, like DFS, that are commonly known, but these are not replication mechanisms that will work for a multi-site cluster.

You have to have some sort of third party solution in order to be able to do this. These solutions typically have one of two different types of modes. One mode is synchronous replication. The change gets made, and then it gets acknowledged at the other side.

The other mode is asynchronously, where the change is made and then it's queued up and they're sent in a batch over to the other side.

You may think that this doesn't have really much to do with what you're doing with your multi-site construction, but the determination of which one you're going to use here is really, really important.

Let's look at synchronously. Here are those two storage devices, there's the one at the primary site on the left, on the right is the backup site. You can see here I've got an ordered list of activities that occur whenever a change occurs.

It starts out there with the change committed at the primary site. Once that change is committed, it then talks to the secondary site, which says, 'Hey, I just made a change.' Then the change has to be committed at the secondary site, and the secondary site has to holler back and say, 'I got it.'

Now, this acknowledgment from the backup SAN is really important in synchronous replication. This ensures that you have no loss of data, none. Because the next change will not occur on the primary site until the previous change is acknowledged from the backup site.

Very obvious impacts there. If your backup site takes a period of time in order to announce that change has occurred, then your primary site is sitting waiting for that. Synchronous replication, great for assurance of data, maybe not so great for performance. Particularly as the SANs get separated by large distances.

Now contrast this with asynchronous replication. Asynchronous replication, we don't have to have that acknowledgment with every single change that occurs. What that allows us to queue up those individual changes as they occur on the primary site, and when it makes sense then to replicate those over to the secondary site.

Obviously if I lose my primary site, if I get the nuclear bomb that hits, the tornado that hits, the tsunami that impacts my data center, whatever those queued up changes are, they're going to get lost.

I lose a little bit of data with asynchronous replication, but what I gain is I'm not bound by the performance of that acknowledgment from the backup site.

Again, you're going to typically choose one of these two types of replication, and they may be called different things depending on who your vendor is that supports that. You have to have one of these two generally in your architecture.

This obviously brings up some very interesting food for thought. Which would you choose, and why would you choose that?

Let's think about that. Let's think about the pro-cons associated with both synchronous and asynchronous data.

On the synchronous side, I get the assurance of no loss of data. In order to do that, I really have to have a high bandwidth, low latency connection. That low latency means that I really have to have shorter geographic distances between those storage devices.

If I have to replicate this SAN right here with another SAN on Mars, the replication latency is going to take a really long period of time. I'm not going to get great performance.

Synchronous is great when I've got two SANs that are sitting very close together. Asynchronous works a lot better when I don't care about losing a little bit of data, and it's a very small amount of data, usually.

I don't care about losing data. I could use a smaller bandwidth connection, more tolerant of latency, no performance impact, and I can stretch it across essentially much larger distances.

Really, what your sensitivity is to the amount of data you can lose is what really makes this decision for you. If you're the Dow Jones, you can't afford to lose a single piece of data. If you're Bob's Biscuit Factory, losing a couple of bits of data or a couple of block changes is probably not going to impact you dramatically.

Making the right decision here in synchronous vs. asynchronous replication is really, really important, because as I said, with Hyper-V, the replication process is not anything that's handled by Microsoft. You're going to have to find some solution to do this for you.

Now, I kind of lied. There really is three and a half things you have to be aware of. The half thing, associated within number two, is this whole notion of the location where replication processing actually occurs.

That can really be in one of two places. Down at the storage layer, right on your storage itself, some storage contains the ability to process the replication all by itself. This takes all of the effort off of the servers, but you have to be very conscious about crash consistency.

A lot of these SANs have to have some sort of agent or something that's installed onto the hosts to ensure that the data is being transferred in a crash consistent way.

When you implement these things, they tend to be a little bit easier to set up. A lot fewer moving parts, because it's this SAN, that SAN, OK, turn on and go.

There's a real play there for simplicity and scalability, too. If your storage doesn't support it, or you're not interested in replicating it that way or you're concerned about the crash consistency, there is another layer at which replication can occur.

That is at the OS or the application layer. This is typically when the replication processing is handled by some sort of software or agent that's in the virtual machine OS. The VM's got some sort of agent, and that agent takes care of looking at the blocks that they change and then transferring them over to somewhere else.

You lose a little bit of the scalability there, and your costs also tend to be more linear. A lot of the applications that do this will do this where you pay by the virtual replicated machine.

They're also more challenging to set up, simply because there's more to do. You have more agents to deal with, you've got more agents to monitor, and more moving parts.

You typically tend to have a little bit less concern about crash consistency. You're benefiting and you're paying from both of these approaches for replication, and again, you've just got to come to the determination to say which of these makes most sense for the environment that is important to you.

Another impact here is, how many VMs do you intend to replicate over? If you're just talking a couple, you might not want to replicate the entire storage layer over. If you plan on doing everything, then maybe you do.

Again, your use case will make a lot of the determination here.

Once we've got all this infrastructure in place, we've got the networking and we've got the servers and the replications all set up, we have to have some target servers and their cluster that's set up in the backup site.

It's the same cluster. You're creating one cluster, but you have to have cluster members that are sitting in that backup site so that the cluster can go, 'Look, we've got a problem. We can move those virtual machines over to the backup site so that they can continue processing with very little downtime.'

That's really all there is as it relates to this whole notion of cluster. I have my servers here on my left, I have some network switches, I have the storage. I have all these different interconnects there for production networking and the storage networking, and I have my V motion networking, and I have my management networking.

I've got my storage replication, and then poof, I have a cluster that now spans multiple sites.

That's the architecture behind clustering. That's the pieces, the hardware, the infrastructure that has to get put in place in order for you to even start installing the [??] clustering role or feature onto any of your Windows Server 2008 R2 computers.

I can't talk about clustering unless I give at least a nod to the somewhat sordid history associated with clustering. Let me ask you this, raise your hand if the clustering itself has been a black mark on your personal record. I'm raising my hand here.

In fact, I've been dealing with clustering since Windows NT 4, if not before, back when cluster services was called wolf pack. There was a point in time where we actually went through and implemented a failover cluster for a file server, huge enormous, enormous file server.

Some of the early Windows clustering technologies were not very well implemented, to say it nicely. We found that actually have a clustered file server was in many ways had less uptime than the regular file server did.

Back then there was a quote that we had written up on our quote board that says, 'As the corporate expert in Windows clustering, I recommend you don't use Windows clustering.'

Thankfully the technologies have improved quite a bit since the Windows NT 4 and the wolf pack days. With Windows 2000, we added in some greater availability and scalability, but things were still very, very painful.

Windows 2000 was still not a great operating system with which to cluster on top of. 2003 is where things just barely started to get more interesting, not the least of which was adding an ISCSI storage to traditional fiber channel as a mechanism to actually connect the servers with the storage.

Remember, with the cluster you have to have single storage that everybody points to.

The problem with Windows Server 2003 was a technique was still used called SCSI resets, which is a SCSI command for last resort mechanism to determine which of the hosts actually owned the one that the virtual machines or whatever it was were connected into.

Those SCSI resets were very painful. They could actually create some problems that made Windows Server 2003 itself still a little challenging of a tool for clustering.

I'll put my stake in the ground and say that Windows Server 2008 was really the first real operating system where at least I myself felt like I could stand behind Windows clustering as a real solution for adding uptime, as opposed to taking it away.

Windows Server 2008 eliminated the use of SCSI resets, so you didn't have that very painful reset problem anymore. It also eliminated the requirement that the entire solution was on the Windows HCL.

Back in those days, you would have to buy an entire solution, and the solution was on Microsoft's Hardware Compatibility List. With 2008, the individual pieces had to be on the HCL, but not the entire solution.

Which brought the costs of these cluster tools and solutions down dramatically. 2008, for a lot of people, it was the first version that Microsoft said really, 'Look. We're trying to create clustering that mere mortals can actually do.'

They did that by adding the cluster validation wizard, some very nice pre-clustering tests, and then also this ability for clusters to now span subnets.

This is an important point for the multi-site clustering story, because prior to 2008, the only way that you could stretch a cluster across two different sites was to also stretch that same subnet across those two different sites.

If you know anything about networking, that is very hard to do. Being able to span subnets made the cluster much more able to work in a multi-site scenario.

2008 R2 added improvements, the validation wizard and the migration wizards to move cluster services. Then it added in live migration for Hyper-V and also cluster share volumes, both of which are great, great technologies that make Hyper-V much more easy and fun to work with.

What really is a cluster? When you look at a cluster in terms of all the different pieces that make up said cluster, what really is that cluster? A cluster is two servers, it's server one there and server two, or more servers, and they both attach to the sense of shared storage.

You can't have a cluster without having some sense of shared storage. They also share a network.

Now, the shared storage is really the part that I think trips up a lot of people, because you have to have these special connections that connect into a single one, so multiple servers that are connected into the same one at the same time, which completely invalidates all the laws of MTFS.

When clustering is enabled, clustering allows these servers to adjudicate which of them actually is allowed to access which resource at what time. Those are the resources and applications that are allowed.

This is very important, because Windows clustering is a what we called 'shared nothing' cluster infrastructure. Even though there is shared storage, there is not actually the sharing of those resources. They can all talk to the same storage, but only one of them can talk to a specific piece of storage at any particular period of time.

If we dig in just a little bit deeper here, we can see that here again is just a two note cluster. I've got some storage here in the middle, I've got my networking. The storage is where the Hyper-V virtual machines are going to go.

You're going to have to create at a minimum, probably two [??]. One, a very small [??], 512MB, very small, that supports the quorum drive. We'll talk about quorum in just a minute, but the quorum drive is essentially a mechanism to determine if the cluster is still a cluster or not.

You'll also have to have another [??] where your Hyper-V VMs will be stored. They're all there in that shared storage.

Microsoft recommends at least right now today, a minimum of three different networks for Windows clustering. One for production network, your typical outbound, Outlook is talking to Exchange kind of thing.

A second one for the storage network. If it's ISCSI for example, that would be another TCP/IP network. If it's fiber channel, it would be a bit of a different thing.

Then a third network called the private net that is used only for communication between one host and the other. This private net is how the different servers talk to each other to see if they're all still there or not.

You have to remember that the whole job of a cluster, the reason why a cluster exists, is because it's always preparing for a failure. Its whole job is to be there in case of a failure. What the cluster is always doing is making sure that everybody's still there, the failure hasn't yet occurred.

This is really, really important.

If we take this single site cluster thing, and now we add to it the notion of the multi-site clusters we talked about just earlier, you can see here the picture gets somewhat more complicated.

Here's my Hyper-V server there on the left. I have my switches, I have my storage, I'm replicating my storage, I have my servers on the other side. Then I have this thing called a witness server there that's up at the top.

Just hold onto that thought for a minute, because we're going to talk about that witness server and the witness site here in just a minute. Just think of the witness server as your third party umpire, adjudicator, referee that helps the cluster determine if indeed the cluster still exists as a cluster.

I say this because we can't talk about the witness site really in context until we understand this whole notion of quorum. I mentioned before, the whole point of a cluster is that it is a thing that is there just waiting for something awful to happen.

It's constantly preparing for some piece of itself to go down. The way that Microsoft has implemented their clustering is using a mechanism called quorum. Quorum is essentially a voting process.

I always ask the same question whenever we get to this part in the conversation. Ever been to a Kiwanis meeting, or a rotary, the student council, or any kind of meeting at all where they're following the whole notion of parliamentary procedure?

'All those signify by saying aye' for second, extensions, that kind of stuff. When you go to a Kiwanis meeting, a Kiwanis meeting has to have a vote or at least a count of the number of people that are actually at the Kiwanis meeting.

Before they actually can vote on business of the day, they have to make sure that they have a quorum. This is important. Let's say you're a Kiwanis meeting of 100 people, and at the meeting the only people that show up are the president and the secretary.

The president says look, it's just the two of us. All those in favor of taking all the money in the account and sending ourselves to Disneyworld signify by saying aye. They both say aye.

Well, we don't really want that to happen. We don't want the president of the club to be able to just take all the money and go to Disneyworld. The club's bylaws strictly say that they have to have quorum in place before important business can actually be decided.

Typically quorum is 50% plus one. So 50% of the people plus one is quorum.

Different Kiwanis clubs actually have different rules for what constitutes quorum. Maybe one club says that if there's 20 people in the room, that's a quorum. Maybe another club has to have everybody in the room for a quorum to occur.

Different clubs are going to have different rules for what they consider to be enough people for them to be a club, so they can make decisions.

That's exactly how Windows clustering works. Different clusters have different quorum models for which to determine whether or not it is still a cluster or not.

If a cluster loses its quorum, then a cluster can no longer consider itself a functioning cluster anymore, and the whole thing shuts down and ceases to exist. This happens until it gains the quorum back.

This is very much different from any sort of resource fail over. A functioning cluster, I lose a host, well the rest of the cluster nodes, there are enough votes that they can say, I'll take that host's resources and move it onto other hosts.

If we lose too many of these hosts, there's just not enough to go around and the cluster cannot operate, and so it actually has to cease to exist.

Now, Microsoft actually implemented multiple quorum models for fail over cluster. There are four. Node and disc majority, no majority, node in file share majority, and no majority disc only. The fourth of which, this no majority disc only, is not really a model.

If the disc exists, then we have a cluster. Up to you whether you believe that or not. Most people don't use no majority.

Back in the old days, and with very simple single site clusters, it's common to see node and disc majority as the model chosen. Because each node and the disc gets a vote. If there are four nodes and the disc, there are five votes, and as long as two nodes and the disc stay up, or three nodes stay up, then we're good.

This is OK until we get to the idea of a multi-site cluster. With a multi-site cluster, there's more than one disc. There are actually two discs. There's one at the primary site and then there's one at the secondary site.

Node and disc majority is not really a great quorum model for situations where I have multiple discs. What we have in its place is this concept of a node and fileshare majority.

In the node and fileshare majority model, replacing the disc is actually a fileshare somewhere out on a server somewhere that contains just a little bit of data that all of the hosts then talk to register, essentially just read that data, to verify whether the cluster exists.

It provides the best protection for a full site outage, when it's configured appropriately.

This is where things get a little bit interesting. To protect against that full site outage, you have to have a file share witness in a third geographic location from where the two other sites are.

Back to the witness server. You see how that witness server's blue and it says 'witness site?' In order to support the outage of a full site, remember how each of these counts as a vote, an entire site might go down.

I have to have this third site where this witness server exists so that if one of the sites go down, I have enough votes still remaining in the cluster to maintain quorum and maintain the fact of the cluster's existence.

This is one of the big gotchas for creating fully functional, fully realized multi-site clusters.

Most people get to this point and they go, 'Really. I need a third site, seriously?' Like I said, this is where the ridiculous quorum notion gets complicated, and some would argue unnecessarily.

Let's do some math here, we'll hold fingers in the air. In my left hand, I've got three fingers, and on my right hand I've got three fingers. What happens when I put the quorum's fileshare on the primary side?

I've got three fingers and three fingers. When I put the fileshare on the primary side, now I have four fingers and three fingers.

That means that if the primary site fails, I've lost four votes. The secondary site might not automatically come back online, because the number of votes in that secondary site, three, is less than the number of votes in the primary site before. Oops.

Let's reverse the situation and put the fileshare in the secondary site. Well, a failure in the secondary site could cause the primary site to go down, because the votes on the secondary site could become greater than the votes at the primary site.

This gets really, really challenging. Having that third site that has nothing to do with the other two, is somehow connected on the network, that both can talk to via TCP/IP, is handy for ensuring that that cluster still is allowed to exist.

I will argue further, this box here, that this whole problem of the third site gets weirder as time passes and the number of servers changes at each site. Suddenly I go from three and three to three and four, or three and three to four and four, four and five.

As the number of servers changes at each site, the number of votes changes at each site. Things just start to get a little odder all the time.

Recommendation? If you're going to create a multi-site cluster, just do it the right way and find some third site out there that's got a file server somewhere and put your witness fileshare on that server and be done with it.

You've got yourself well protected against pretty much all of the big problems that are out there.

This is the architecture. This is how you actually go about constructing the whole thing. Once we have the thing in place, once we have this multi-site architecture in place, the next step is we're going to run the thing.

Let me try to help you with some of these tips and tricks for not making any big mistakes whenever you start running your multi-site cluster and operations.

The first of which is, how Microsoft actually identifies where something might go. If you've ever played around with Windows clustering, I have here a virtual machine cluster resource, VM1. VM1, I'm looking at its properties, and you can see it says down there, 'Preferred owners.'

That is an ordered list of the preferred owners for that virtual machine resource. As you can imagine, when I lose a host, let's say I've got server 123 in the primary site and 456 in the secondary, if I don't do a good job of managing what the ordered list is for my preferred owners, I can have a failure at the primary site, and that VM actually fail over to the backup site.

I don't want that, because now I'm on a different subnet. I essentially invoked the DR plan at this point.

You have to be really conscious of those preferred owners, because you don't want virtual machines to end up failing all the way over to the backup site if you simply have a host that goes down in your primary site.

This gets a little bit easier, little bit better, when you start adding in things like VM2012, which is on its way out. This is specific to the fail over clustering management tool.

You're going to have to be very conscious with how you apply these preferred owners, put them in the right order, check the ones that you want to be preferred, and then keep an eye on those, especially as you add additional resources.

There are as you can see here two tabs associated with properties of every cluster resource. The second tab is for fail over. You have to really be conscious of the effects of fail back.

Some disagree with me in terms of this fail back, but I've had this fail back problem occur many times myself. Down here on the bottom, you can see once the primary site, or once the server that went down is no longer done, the VM could potentially fail back to its original home.

This could be a good thing, right? Server comes back online, yay, the virtual machine can move back to their original location automatically and everything's happy.

Fail back can also be a problem causer as well. Think about it. Let's say we allow fail back immediately, and you get a server that comes on and maybe is not quite ready, so it comes on, then off and on, then off and on and off.

Suddenly you've got a virtual machine that's trying to bounce back and forth between primary and secondary automatically when it would just be better left set in its secondary site until you've got that primary site back online.

These effects are particularly pronounced in multi-site clusters. My recommendation, and what I tell pretty much everybody is just leave fail back off and manage the relocation of your virtual machines, your resources, manually.

Again, VM2012 improves upon this for Hyper-V, but keeping the automated fail back out of your whole environment, you're already in a disaster anyway. Let's not create any more automated itchiness. Turn it off until you're absolutely ready to turn it on.

That's a really important realization there.

Here are three more tips and tricks in terms of multi-site clustering operations that you should be aware of. The first is a recommendation from Microsoft themselves, and that is to just simply resist creating clusters that support other services.

A Hyper-V is a Hyper-V cluster, which is a Hyper-V cluster. This is Microsoft's recommendation. It's a service isolation reason. Create a cluster that's for Hyper-V, and then you don't have to deal with the other vagaries of the different services that may be on there.

You think about it, if you're trying to create a file server cluster or DNS cluster, or any other kind of cluster that's out there, the rules for how and why you might fail over those servers are going to be a little different than the rules you might use for virtual machines.

It's good advice to resist creating clusters that do anything else other than Hyper-V. Make them on their own cluster.

You can also use disc dependencies as an affinity rule. Hyper-V right now today does not have a really nice way to create affinity rules.

If you're familiar with the concept of affinity, affinity is, sometimes as these virtual machines are moving around across these different hosts, sometimes it makes sense for the two machines to always be on the same host together.

If I have an application server that talks to the database, and they need to have a lot of chatter between the two of them to do their jobs, running that data over the system bus or the virtual machine bus, as opposed to over the network, can improve performance. That's an affinity rule.

There are other servers that you never, never, never want to occupy the same host together. A perfect example is domain controllers.

As my virtual machines are moving around and failing over and rebalancing, then I get two domain controllers that exist on the same host, and they happen to be my only two domain controllers, you can see if the host goes down I've lost both my DCs, and that becomes a very bad day.

There's not a really good, elegant way to affinitize, but you can use disc dependencies, one to the other, to work around how you want machines to move around. I believe there's also some power shell exposure that can do that as well, although it's buried strangely deep last I checked.

The other one, and this is just based off of personal recommendation. You won't find this in any Microsoft documentation. I tend to add servers in pairs.

Whether you're single site or whether you are a multi-site cluster, adding servers in pairs just seems to help with the whole keeping the cluster balanced. Server loss is not going cause sites split brain. If you have a fileshare witness you don't have to worry about that.

Adding servers in pairs keeps everything even, and I mean even as in opposed to odd, and tends to make a lot of these calculations much, much more simple as it's going about determining what its quorum is.

I will say this, whether you are a Hyper-V all by yourself or whether you're part of a cluster, multi-site cluster, you've got to segregate your traffic.

These days, it doesn't necessarily mean you have to have separate nicks on each host, but what it does mean is you have to have at least different V-LANs for the different types of traffic that are coming out of your virtual host.

Traffic to your SAN, a bonded connection to your ISCSI or private channel SAN, a bonded connection to your production network that is separate from that SAN, a single connection for live migration, another connection for management.

All of these are important for maintaining good performance, assuring that the traffic is at different trust levels, especially when it comes to that connection from management. It protects live migration traffic and sniffing or being impacted by other congestion on the wire.

It separates out your storage and your production from impacting each other. This gets a little bit less important as we start moving towards converged networks and 10GB-E and the 40 and 100GB-E coming up.

At least different subnets is really important, because it makes things simpler when it comes time for actual troubleshooting.

Lastly and most importantly is just the recognition that you are now spreading the location in which a server can operate across multiple subnets, in multiple geographic locations, too.

Remember when I told you way back in 2003, multi-site clustering was supported, but not across subnets. One of the design reasons, at least I think, for why this is the case is that think about it, right?

VM is in my primary network here and it's on the 192.68.2 net, and then it fails over to the remote site in I don't know, Denver, and at that remote site it's at the 192.68.3 net.

This can cause a problem. The server that used to be on the two net's now on the three net. I've got to go change its IP address, probably the subnet mask, the gateway, all of that networking stuff at the new site, for the server to actually continue to operate as a member of the network community at that site.

You can do this manually, but that's a really bad idea. You can also do it automatically by using DHCP and dynamic DNS.

You're probably saying, 'Greg, [??], DHCP,' admittedly yes. For most server infrastructures, DHCP is not a good idea. It's not so much the fact that they're DHCP address assigned, it's the fact that they may get a different address.

Whereas you will probably want to use DHCP for assigning your addresses, you'll probably want to use reservations to ensure that as that machine moves, it gets a recognized, known address at either side.

Be very conscious of that. Makes sure DNS updates, make sure you use DHCP and reservations so that you get the right IP address so you're not getting whatever address is available whenever you make that move.

DNS replication can also be a problem. More specifically, the local cache time out for any of the clients that are attaching to these servers that have failed over.

You can configure that cache. I don't know how many minutes the cache is before it actually clears, but you know this problem. You go in and you ping a server, and you ping the wrong one, and then you change the IP address and it still thinks it's bad, because the local cache is holding that negative address response in its local cache.

Clearing the cache, or reducing the time to live for DNS entries, is one way that you can protect yourself against the local cache being a problem.

That's Microsoft's recommendation. My recommendation, honestly, if the tornado has hit your primary data center, I think it's probably OK for you to tell your users to just reboot their computers so that the cache gets cleared.

Maybe that's not you, but in a lot of situations it's probably going to be OK.

Not withstanding, consider reducing that TTL. You can send a power shell command out to everybody to clear their client cache, IP config flush DNS, or just reboot computers and they should grab the new DNS entries once those machines have failed over to the alternate site.

This has been sort of an architectural level explanation of the whole notion of multi-site clustering for Hyper-V.

How you can take an existing Hyper-V cluster, this is the cool part, that you've already built, you've already put your blood, sweat and tears into building this Hyper-V cluster, and now you can take it and add a couple of additional servers at some other site, and extend that cluster into another site.

If you've got the storage in place, you've got the replication in place, and you've got those servers where they need to be, at the end of the day, you can extend single site clustering into multi-site clustering with just a little bit of effort.

Again, my name is Greg Shields with Concentrated Technology. I want to thank Backup Academy for giving me the opportunity to talk with you today.

Good luck with your Hyper-V multi-site clusters. Have a great day.

More expert videos

 Browse all videos 
Join your peers!
900+ Backup Academy
Certified Professionals
 Ready? Take exam! 

Still not sure? Get a sneak peek to exam!