In recent projects the use of stretched cluster solutions has been a recurring topic. But why is it a recurring topic, and what are the benefits and drawbacks?

The primary benefit of a stretched cluster solution is to enable active-active and workload balanced data centers. The solution has the ability to migrate virtual machines between sites, which enables cross-site mobility of workloads. To be more specific, stretched cluster solutions offer the following benefits:

  • Workload mobility
  • Cross-site automated load balancing
  • Enhanced downtime avoidance
  • Disaster avoidance

While the first two benefits are almost self-explanatory, I want to dive a little deeper into the disaster avoidance aspects. While most customers see a stretched cluster solution as a disaster recovery solution, this may not be entirely true.

Disaster Avoidance vs Disaster Recovery

While generally explained as the same, disaster avoidance and disaster recovery are two different things.

Disaster avoidance focuses on keeping the service available while avoiding an impending disaster. This can be done through the use of availability features in the application or infrastructure, such as vSphere vMotion.

Disaster recovery focuses on getting the service available after a disaster has occurred in a timely and orderly fashion. This can be done through the use of  availability features in the application or infrastructure, such as vSphere HA, or SRM.

But how do these reflect on a failure/disaster?

Host Level

  • Disaster avoidance = vMotion to avoid disaster and outage (non-disruptive)
  • Disaster recovery = vSphere HA restarts virtual machines (disruptive)

Site Level

  • Disaster avoidance = vMotion over distance to avoid disaster and outage (non-disruptive)
  • Disaster recovery = VMware SRM or scripted register/power-on of VMs at recovery site (disruptive)

Service Level Agreement

While talking with customers about disaster avoidance/recovery the first question I always ask is: “Do you have an SLA with the business?”. Sadly, the answer is most of the time “uhh…”.

Without an SLA, the business expectations of the services delivered can be very different from what can be delivered. For example, the business wants an application to be available all the time (100% availability),  but the infrastructure and application only provide an availability of 95%.

Why this brief outline about SLA’s? The SLA is an important piece of information to help decide if a stretched cluster solution is something that helps to meet the agreed upon service level objectives.

Application or Infrastructure?

Let’s say you have an SLA or at least some RPO and RTO values to which you need to design your environment. The next question is “Where in the environment do you want to ensure the application meets the availability requirements?”

On other words, who is responsible for the availability of the application? Is it the application itself, or does the infrastructure provide the availability?

With the rise of cloud infrastructures (AWS, Azure), and cloud native apps, there is a shift in who is responsible for the availability. Applications need to be able to run anywhere with no dependencies on the underlying infrastructure. For example, does it need to survive a data center failure? The application needs to make sure that data is synchronized to another location. In this scenario, the application itself is responsible for the availability.

Availability features in the more traditional applications are for example; Exchange Database Availability Groups (DAG), and SQL Always On Availability Groups (AAG). These features do not rely on the infrastructure to provide availability for the application.

It may not be possible or even desirable to provide availability in the application due to constraints such as network latency/bandwidth or even organizational constraints such as budget or security policies.

Solutions

There are different options to provide availability through the infrastructure. In my opinion, all the different options can be categorized in a couple of solutions. Each solution has its own pro’s and con’s, and should be matched against the availability requirements of the business.

To stretch or not to stretch?

While a stretched cluster is a solution that fits you organization, it may not be for any other. Each organization has its own requirements and they may or may not match with a stretched cluster solution. The decision to use a stretched cluster solution should not only be based on the current application landscape and requirements but also on the future ones. What is the vision for the application landscape? Are you going to use public cloud infrastructures? Does your organization develop cloud native apps? Where do you want to responsibility for the availability of the application? All these questions and more should help you in choosing the correct solution. And maybe one type of solution is not enough, and you will need to combine different solutions. It’s is all possible but keep in mind you are doing it to meet the requirements of the business.

Advertisements

One thought on “To stretch or not to stretch?

  1. Great blog Erik! Good breakdown of different factors that influence the decision for a stretched cluster solution or a site local solution with something like SRM. In my experience the isolation of failure domains, testing and auditing of the BCDR solution and a controlled human decision to initiate a failover to a recovery site are also factors often taken into account. Good stuff 😉

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s