To stretch or not to stretch?

In recent projects the use of stretched cluster solutions has been a recurring topic. But why is it a recurring topic, and what are the benefits and drawbacks?

The primary benefit of a stretched cluster solution is to enable active-active and workload balanced data centers. The solution has the ability to migrate virtual machines between sites, which enables cross-site mobility of workloads. To be more specific, stretched cluster solutions offer the following benefits:

  • Workload mobility
  • Cross-site automated load balancing
  • Enhanced downtime avoidance
  • Disaster avoidance

While the first two benefits are almost self-explanatory, I want to dive a little deeper into the disaster avoidance aspects. While most customers see a stretched cluster solution as a disaster recovery solution, this may not be entirely true.

Disaster Avoidance vs Disaster Recovery

While generally explained as the same, disaster avoidance and disaster recovery are two different things.

Disaster avoidance focuses on keeping the service available while avoiding an impending disaster. This can be done through the use of availability features in the application or infrastructure, such as vSphere vMotion.

Disaster recovery focuses on getting the service available after a disaster has occurred in a timely and orderly fashion. This can be done through the use of  availability features in the application or infrastructure, such as vSphere HA, or SRM.

But how do these reflect on a failure/disaster?

Host Level

  • Disaster avoidance = vMotion to avoid disaster and outage (non-disruptive)
  • Disaster recovery = vSphere HA restarts virtual machines (disruptive)

Site Level

  • Disaster avoidance = vMotion over distance to avoid disaster and outage (non-disruptive)
  • Disaster recovery = VMware SRM or scripted register/power-on of VMs at recovery site (disruptive)

Service Level Agreement

While talking with customers about disaster avoidance/recovery the first question I always ask is: “Do you have an SLA with the business?”. Sadly, the answer is most of the time “uhh…”.

Without an SLA, the business expectations of the services delivered can be very different from what can be delivered. For example, the business wants an application to be available all the time (100% availability),  but the infrastructure and application only provide an availability of 95%.

Why this brief outline about SLA’s? The SLA is an important piece of information to help decide if a stretched cluster solution is something that helps to meet the agreed upon service level objectives.

Application or Infrastructure?

Let’s say you have an SLA or at least some RPO and RTO values to which you need to design your environment. The next question is “Where in the environment do you want to ensure the application meets the availability requirements?”

On other words, who is responsible for the availability of the application? Is it the application itself, or does the infrastructure provide the availability?

With the rise of cloud infrastructures (AWS, Azure), and cloud native apps, there is a shift in who is responsible for the availability. Applications need to be able to run anywhere with no dependencies on the underlying infrastructure. For example, does it need to survive a data center failure? The application needs to make sure that data is synchronized to another location. In this scenario, the application itself is responsible for the availability.

Availability features in the more traditional applications are for example; Exchange Database Availability Groups (DAG), and SQL Always On Availability Groups (AAG). These features do not rely on the infrastructure to provide availability for the application.

It may not be possible or even desirable to provide availability in the application due to constraints such as network latency/bandwidth or even organizational constraints such as budget or security policies.

Solutions

There are different options to provide availability through the infrastructure. In my opinion, all the different options can be categorized in a couple of solutions. Each solution has its own pro’s and con’s, and should be matched against the availability requirements of the business.

To stretch or not to stretch?

While a stretched cluster is a solution that fits you organization, it may not be for any other. Each organization has its own requirements and they may or may not match with a stretched cluster solution. The decision to use a stretched cluster solution should not only be based on the current application landscape and requirements but also on the future ones. What is the vision for the application landscape? Are you going to use public cloud infrastructures? Does your organization develop cloud native apps? Where do you want to responsibility for the availability of the application? All these questions and more should help you in choosing the correct solution. And maybe one type of solution is not enough, and you will need to combine different solutions. It’s is all possible but keep in mind you are doing it to meet the requirements of the business.

vROPS RBAC and upgrades

Last week I was at a customer doing some vROPS magic, which included updating the current vROPS 6.4 cluster to 6.5 to get the improved vRLI integration. Upgrading both vRLI and vROPS clusters went perfect, but the vRLI integration items like the Log Insight icon, and Logs tab would not be visible in vROPS after the configuration.

We followed the configuration steps as described in the documentation and verified all the configuration steps. After verifying all the configuration steps and coming to the conclusion that the configuration is correct, we tried logging in with the default admin credentials instead of the customer Active Directory credentials. With the admin account, the Log Insight icon and Logs tab were visible!

When upgrading vROPS to a version with new features, these new features get new privileges to be assigned. When using the default roles, these privileges get assigned automatically to the applicable roles, like for instance the administrator role. But if you create a new role, by cloning are just creating a new, these privileges need to be assigned manually.

After assigning the new privileges to the created role, the Log Insight page and Logs tab were finally visible for the customer account.

TLDR; When upgrading vROPS to a higher version with new features, make sure to check the privileges for these new features on roles. User created roles do not automatically receive newly added privileges.

vROPS 6.5 Node Specifications

VMware has released version 6.5 of vROPS which includes a new node configuration type for additional monitoring capabilities. From the release notes:

Additional monitoring capabilities

  • Adds ability to increase memory and increase the scope of monitoring within the same environment.
  • Enables you to monitor larger environments with the same footprint through platform optimization.
  • XL size node enables you to monitor more objects and it processes more metrics.

The previous largest configuration for an analytics cluster node was a 16 vCPU, 48 GB Memory node. This node configuration was for environments larger then 4.000 VM’s.

Apparently this was not enough 😉

The new largest node (XL) configuration is a little bit bigger.. 24 vCPU and 128 GB Memory! This node configuration is for environments between 12.000 and 40.000 VM’s.

The VMware documentation has not been updated with this configuration yet. I’ve downloaded the OVA, extracted it, and opened the OVF to get these specifications.

vrops_xl_config

The new specifications for all the node configurations are:

Node Size vCPU Memory
Extra small 2 8 GB
Small 4 16 GB
Medium 8 32 GB
Large 16 48 GB
Extra Large 24 128 GB

–Update–

thanks to @pheldoorn for commenting the sizing guidelines. The maximum cluster size with Extra Large sized nodes, is 4 nodes and not 16 nodes as with Medium and Large sized nodes.

 

vROPS 6.4 – Metric Config Picker

With vRealize Operations Manager you can create a metric configuration file. This is an XML file that contains predefined metrics that can be used in different widgets. With the XML file you can skip the process of manually picking the required metrics and attributes over and over again. Another advantage is that you can reuse this XML file in different widgets for different objects in your environment.

The metric configuration file contains 3 important tags:

<Adapterkind>

The AdapterKind tag determines which adapter is required for the metrics you want to include.

<ResourceKind>

The ResourceKind tag determines the resource type, e.g. cluster, datastore, virtual machines, host.

<Metric attkey>

The Metric attrkey tag determines which attributes and/or metric are included in the configuration.

In previous versions of vROPS if you wanted to create a metric configuration file you needed to know the true adapter kind name , resource kind name  and metric value name in vROPS instead of the display names. You could get these names using the API browser of vROPS. An excellent blog post on how to get these names can be found here.

In vROPS 6.4 they made life a bit easier, it is now possible to use a metric picker to get the right metric in you metric configuration file. Depending in which line you are in the metric configuration file, the applicable metric picker will be available.

First create the metric configuration file and select the applicable adapter kind. In my case I only have the vCenter Server solution configured.

vrops_adapterkind_picker_menu

Next select the resource kind of which you want to use the metrics.

vrops_resourcekind_picker_menu

And finally, select the metrics you want to add to you metric configuration file.

vrops_metric_picker_menu

The metric configuration file is now configured and can be used in widgets. More information on how to use the metric configuration file in a widget can be found here.

I personally think this is a very welcome addition to the vROPS GUI. Creating a metric configuration file used to take a lot of time looking up the name of the metric you wanted to use. This will save you time and maybe even some frustration when you had found out that the metric you selected was not the right one.

Update: Forgot to mention my coworker Johan van Amersfoort for discovering this new feature. Visit his blog on vhojan.nl for EUC awesomeness!

 

 

VCSA Migration – Syslog and Dump Collector

In preparation for an upcoming project it was time to try out the vCenter Server Migration Tool for 6.0 (vCenter Server 6 U2M). I started with reading the vSphere Migration Guide and Release Notes to find out what the prerequisites are and if any known issues are reported.

vSphere Migration Guide

Release Notes

In the release notes I found the following known issue.

Release Notes Known Issue - Syslog and Dump Collector

In the vCenter Server Windows 5.5 the Syslog and Dump Collector are extra components you could deploy but in the vCenter Server Appliance 6.x the Syslog and Dump Collector are integrated in the appliance.

The pictures below show the difference in the vSphere Web Client. The left picture shows the vCenter Server Windows 5.5 with Syslog and Dump Collector and the right picture shows the vCenter Server Appliance 6.0 with the integrated Syslog and Dump Collector.

As you can see there are no more items for the Syslog and Dump Collector in the vSphere Web Client.

When you migrate from the vCenter Server 5.5 Windows to vCenter Server Appliance 6.0 and you have the Syslog and/or Dump Collector installed and configured as integrated with vCenter Server these items will still exist in the Web Client and will show an error even after configuring the services.

syslog_errordump_error

According to the release notes there is no workaround. The only way the avoid this issue is to uninstall the Syslog and/or Dump Collector before migrating to the vCenter Server Appliance.

As mentioned earlier in this post, this only happens when the Syslog and Dump Collector are configured as integrated with vCenter Server. If you have deployed them as standalone instances this issue does not occur.

 

VCSA Upgrade and VM Monitoring

With the release of vCenter Server 6.0 Update 2a and ESXi 6.0 Patch 4 I decided it was time the update the lab environment. Both releases contain a lot of fixes, some specifically for VSAN.

Release notes

vCenter Server 6.0 Update 2a

ESXi 6.0 Patch 4

The update went perfect on the Platform Service Controllers but then disaster struck. During the update of the vCenter Server the VM got a reset and the VM would not boot anymore with the error message:

“Error 15: Could not find file”

When using the Embedded Host Client to check the status of the VM I found out that the VM has had a reset started by the vpxuser. At first I suspected somebody else had giving the VM a restart but this was not the case.

Because I did not make a VM snapshot before the update (it’s a lab environment so he..) I could not recover the VM to a point before the update.

Luckily we had vSphere Replication configured for the vCenter Server to a different lab environment with PIT’s so I could recover the VM to an earlier state. After recovering the vCenter Server and logging on to the Web Client the cause of the reset was made clear.

Virtual Machine Monitoring was enabled for this cluster. Apparently no VMware Tools heartbeats have been received for 120 seconds (low sensivity) and no storage or network traffic was happening for a period of 120 seconds (default). This triggered a reset of the VM and therefore breaking it.

This vCenter Server has been upgraded several times and I never had any issue with VM Monitoring . I have no idea why this happened this time but to be sure I recommend disabling VM Monitoring for the vCenter Server during an update. And of course always have a backup of the vCenter Server in case of a failure.

 

 

vROPS 6.4 – New Dashboards

VMware has released vROPS 6.4 which contains several new dashboards to display status and identify problems. The new dashboards can be divided into several categories:

  • Environment and capacity overview dashboards to get a summary of your environments.
  • VM troubleshooting dashboard that helps you diagnose problems in a VM and start solving them.
  • Infrastructure capacity and performance dashboards to view status and see problems across your datacenter.
  • VM and infrastructure configuration dashboards to highlight inconsistencies and violations of VMware best practices in your environment

In this post I will highlight some of these dashboards. If you want to know more about all the other dashboards I suggest you download and install or upgrade to vROPS 6.4 🙂

Operations Overview

This dashboard provides an general overview of your vSphere environment such as amount of VM’s, clusters hosts, and datastores. The dashboard also provides top list information about virtual machines with CPU contention, memory contention or disk latency.

vROPS Dashboard - Operations Overview

Capacity Overview

This dashboard provides an overview of the capacity of your vSphere environment such as total CPU cores, memory and storage capacity. The dashboard also provides graphs for the different resources utilization containing realtime and trend/forecast data.

vROPS Dashboard - Capacity Overview

Troubleshoot a VM

This dashboard provides general troubleshooting information for a virtual machine such as critical alerts and possible contention.

Maybe the most requested dashboard by customers and I am excited this is now default available in vROPS!

vROPS Dashboard - Troubleshoot a VM

Heavy Hitter VMs

Like the name suggests, this dashboard provides information of the top heavy virtual machines in your vSphere environment such as top highest IOPS and network throughput.

vROPS Dashboard - Heavy Hitter VMs

Cluster Performance

This dashboard provides general performance information for clusters such as critical alerts and possible contention.

vROPS Dashboard - Cluster Performance

ESXi Configuration

This dashboard provides general information about the hardware of the vSphere hosts in your environment such as hardware model, ESXi version and power management setting.

Not in the picture below but the dashboard also provides an overview of the configuration of all the vSphere hosts. This overview contains information such as CPU sockets, NICs, Power State, CPU Model, etc.

vROPS Dashboard - ESXi Configuration

VM Usage

This dashboard provides general information about virtual machines in your environment such as general virtual machine configuration and graphs about CPU, memory and IOPS demand.

vROPS Dashboard - VM Usage

The addition of these dashboards are very welcome and I think these dashboards make vROPS even better to use. Here are some links to vROPS resources.

Release notes

http://pubs.vmware.com/Release_Notes/en/vrops/64/vrops-64-release-notes.html

vROPS Sizing Guidelines

https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2147780

Download

https://my.vmware.com/en/web/vmware/info/slug/infrastructure_operations_management/vmware_vrealize_operations/6_4