vROPS 6.3 – vSphere Hardening Guide 6.0

As described in my previous post I have upgraded my lab vROPS cluster to vROPS 6.3. After a couple of days I finally had time to look at the updated vROPS policies. One of the things I was most interested in was support for the vSphere 6.0 hardening guide.

With vROPS 6.3 it is possible to generate alerts when a host or vCenter violates rules found in the vSphere 6.0 hardening guide. In previous releases only the vSphere 5.5 hardening guide could be used.

To enable alerts for the vSphere hardening guide you need to perform the following actions:

  • Enable vSphere hardening guide alerts in the VMware vSphere solution
  • Customize a policy to enable the vSphere hardening guide alerts

To enable vSphere hardening guide alerts in the VMware vSphere solution define the monitoring goals: Administration -> Solutions -> VMware vSphere -> Configure -> Define Monitoring Goals.

vrops_enable_hardening_alerts

After this you need to customize your policy to enable the alerts. At this step I had a problem with the vSphere hardening guide alerts. Because I performed an upgrade and I did not want to lose any customization on default objects I choose to not reset out of the box content during the upgrade.

This resulted in the policies not being updated with the new vSphere hardening guide alerts.

vrops_hardening_guide_55

After some digging I found a VMware KB article explaining that the policies were not updated because of my choice to not reset out of the box content. The only solution is to reset default content in the VMware vSphere Solution. You can do this via Administration -> Solutions -> VMware vSphere -> Reset Default Content.

Keep in mind that this removes all your customizations on default objects such alert definitions, symptoms, policy definitions and dashboards.

vrops_reset_default_content

A common best practice is to not customize the out of the box content but clone or create new objects such as dashboards and policies.

After resetting the default content I could enable the vSphere 6.0 hardening guide alerts in the policy I have created and alerts where created for the hosts.

vrops_hardening_guide_60

vrops_alert_hardening_guide

 

 

 

vROPS 6.x Blank Dashboard

A few weeks ago I upgraded the vROPS cluster in the lab environment to vROPS 6.3. The cluster is a 3 node cluster with a master, replica and remote collector. The cluster is behind a NSX load balancer to provide a single FQDN to connect to the cluster.

The upgrade went smoothly and all nodes were upgraded without a problem. One important thing to remember is to always update the virtual appliance OS first before upgrading vROPS! If you do not do this, you will break your vROPS instance!

But the problem started when I logged on to the vROPS instance. Some dashboard were working fine but other dashboards would not show any contect. The content on these faulty dashboards varied from completely blank to only some widgets.

vROPS Blank Dashboard

At first I thought the problems appeared only on custom dashboards I had made but after looking at some more dashboards it appeared that it also happened on the out of box dashboards.

At this stage I was thinking I broke the vROPS cluster and needed to redeploy the cluster. But before I was sure I really needed to do this I asked a colleague if he experienced the same problems. He could view the dashboards without any problem.

Because of this I tried to open the dashboards from another browser, in this case Internet Explorer instead of Chrome. The dashboards were working fine with Internet Explorer, so my first suspicion went to Chrome. But my colleague had opened the dashboards in Chrome and they worked fine for him.

Eventually my colleague suggested that I cleared the Chrome browser cache so I would not have any old references to the dashboards. And lo and behold, the dashboards were working fine after this!

vROPS Dashboard

TLDR, if you have any issues with content on vROPS dashboards clear your browser cache first 🙂

vSphere 6.5 Storage – What’s new

This VMworld EU 2016 VMware announced the long awaited vSphere 6.5. This blogpost focuses on the new and enhanced storage features in vSphere 6.5.

VMFS-6

A new version of the VMFS file system is introduced providing an all-round performance improvement including faster file creation, device discovery and device rescanning. Maybe the biggest change is that VMFS-6 is 4K aligned which will allow the support of 4K drives when they become supported.

There will be no upgrade path to VMFS-6 because of the amount of on-disk changes. Moving to VMFS-6 should be considered a migration using Storage vMotion.

Limits Increase

There are two major limit increases in vSphere 6.5. First off, ESXi hosts running version 6.5 can now support up to 2,000 paths in total. Second, ESXi hosts running version 6.5 can now support up to 512 devices. This is a 2x increase from previous versions of ESXi where the number of devices supported was limited to 256.

NFS 4.1 Improvements

The major improvement for NFS 4.1 is the support for hardware acceleration. This allows certain operations to be offloaded to the array. Other improvements are:

  • NFS 4.1 will now be fully supported with IPv6
  • NFS 4.1 Kerberos AES encryption support (AES256-CTS-HMAC-SHA1-96 and AES128-CTS-HMAC-SHA1-96)
  • NFS 4.1 Kerberos integrity checking support (SEC-KRB5I)

iSCSI Improvements

iSCSI routed conections

The first improvement is that iSCSI routed connections are now supported. Another improvement is that it is now possible to use different gateway settings per VMKernel interface. This implies that port binding can be used to reacht targets in different subnets.

UEFI boot

It is also now possible to use UEFI iSCSI boot. With this you can boot an ESXi host from a iSCSI LUN using UEFI settings in the host BIOS.

SIOC v2

Storage I/O Control will be policy driven via I/O Filters. This allows you to expand Storage Policies with SIOC settings such as Limits, Reservations and Shares. By default there will be 3 configuration options for these settings called Low, Normal and High. It is possible to customize the options to your likings.

SIOC Storage policy integration
SIOC Storage policy integration

 

In the initial release of SIOC v2 there will be no support for VSAN or VVOLs. SIOC v2 is only supported with virtual machines that run on VMFS or NFS backed datastores.

VSAN 6.5

VSAN 6.5 is included in vSphere 6.5 and adds a few new features and a different licensing setup.

iSCSI service

The VSAN iSCSI service allows you to create iSCSI targets and LUNs on top of the VSAN datastore. These LUNs are VSAN objects and have a Storage Policy assigned to them. This feature is targeted for physical workloads such as Microsoft Clustering with shared disks.  It is not intended for connecting to other vSphere Clusters. It is possible to create a maximum of 1024 LUNs and 128 iSCSI targets per cluster. The LUN capacity limit is 62TB.

2-Node Direct Connect

VSAN 2-Node Direct Connect allows you the create a VSAN ROBO configuration without a switch by simple connection the 2 hosts together with cross connect cables. This can make a huge difference in total cost of ownership because it is no longer required to purchase 10 Gbit switches to connect the hosts together.

VSAN 2-Node Direct Connect
VSAN 2-Node Direct Connect

 

Furthermore, for this type of configuration it is possible to tag a VMkernel interface for the witness traffic to separate this type of traffic.

Licenses

The different VSAN licenses have been changed and an All-Flash configuration is now possible with the VSAN standard license. This means all VSAN licenses now support an All-Flash configuration. If you want to use data services like deduplication, compression or erasure coding you still have to buy the VSAN Advanced license. For a quick overview of the different licensing options visit the VMware website at http://www.vmware.com/products/virtual-san.html

Hardware support

VSAN 6.5 also introduces support for 512e drives, which will enable larger capacities.

VVOLs 2.0

Array-based Replication

VVOLs 2.0 adds the support for array-based replication. Unlike traditional array-based replication like NetApp MetroCluster which replicates the entire datastore, VVol replication allows you to use fine grained control for virtual machine replication. This means you have the flexibility to replicate not all virtual machines but a group of or individual virtual machine(s).

VVol Array-based Replication
VVol Array-based Replication

DR API

vSphere 6.5 also offers public APIs for triggering DR operations as well PowerCLI cmdlets for administrator-level orchestration and automation.

  • Replication Discovery – VVol disaster recovery discovers the current replication relationships between two fault domains.
  • Sync Replication Group – Synchronizes the data between source and replica.
  • Test Failover – To ensure that the recovered workloads will be functional after a failover, administrators periodically run the test-failover workflow. After a test, administrators can optionally move the devices from test to production when ready for a real failover.
  • Disaster Recovery and Planned Migration – For planned migration, on-demand sync can be initiated at the recovery site.
  • Setting up protection after a DR event – After the recovery on a peer site, administrators can set protection in the reverse direction.

Oracle RAC

VVols is also now validated to support Oracle RAC workloads.

VASA 3.0

VASA 3.0 introduces a new concept called ‘line of service’. A line of service is a group of related capabilities with a specific purpose, such as inspection, compression, encryption, replication, caching, or persistence. Now in addition to configuring replication at the individual Storage Policy, it is possible to create a line of service for replication and assign it to multiple Storage Polices.

As an example, imagine you have 3 Storage Policies: Gold, Silver and Bronze. While these three categories have very different storage capabilities assigned, it is possible to manage the replication once with a line of service replication instead of configuring replication on each individual Storage Policy.

Cross vCenter vMotion – Cannot connect to host

21 November – VMware engineering has given a fix for this issue. Results are posted at the end of this post.

17 October – Currently this issue is under investigation by VMware and the SR has been referred to VMware engineering. The workaround for this issue at this time is to not use the provision VMkernel interface and TCP/IP stack.

Part of a solution for a customer was the ability to perform migrations of VM’s between vCenter Servers. Furthermore, company policy dictates the management network is used only for management purposes.

By default, data for VM cold migration, cloning, and snapshots is transferred through the management network. This traffic is called provisioning traffic. On a host, you can dedicate a separate VMkernel interface to the provisioning traffic, for example, to isolate this traffic on another VLAN.

To comply with the policy the design used the provisioning VMkernel interface and TCP/IP stack to isolate the provisioning traffic to another VLAN.

When performing the validation of the design we ran into an issue with the Cross vCenter vMotion possibilities. If the VM was powered on the x-vCenter vMotion would perform successful but when the VM was powered off the x-vCenter vMotion would fail with the error ‘Cannot connect to host’.

web_client_error

Because the failure happened when the VM was powered off we immediately suspected the provisioning vmkernel interface and TCP/IP stack. We double checked the configuration and checked if the vmkernel interfaces could reach each other.

esxa01_vmkernel_config esxb01_vmkernel_config

esxa01_firewall esxa01_firewall_out

esxa01_ping_esxb01 esxb01_ping_esxa01

After this we examined the vpxa log on the source host and found connection errors between the provisioning vmkernels on the source and destination hosts.

esxa01_vpxa_log_error esxa01_vpxa_log_error_2

Because pinging between the vmkernel interfaces worked, we wanted to verify if the provisioning network packets reached the vmkernel interface. To verify this we used the pktcap-uw tool which is included by default in ESXi 5.5 and later versions. The pktcap-uw tool is an enhanced packet capture and analysis tool that can be used in place of the legacy tcpdump-uw tool.

With the pktcap-uw tool we generated receive and transmit traffic captures on the provisioning vmkernel interfaces on both the source and destination hosts. The capture files were analyzed with wireshark to verify if the provisioning network packets are exchanged between the hosts.

esxb01_pcap_receive

The picture above is taken from the receive packet capture on the destination host. As you can see packets are received from the IP address 192.168.13.10  on port 902. This is the IP address of the provisioning vmkernel interface on the source host and port 902 is used for the provisioning traffic (NFC).

Because traffic was flowing between the vmkernel interfaces we checked if the NFC service is listening to accept connections on the provisioning vmkernel interfaces.

esxa01_ssh_esxb01_prov esxb01_ssh_esxa01_prov

The NFC service was not listening on the provisioning vmkernel interfaces on both hosts. To verify if the NFC service was listening on the host we performed the same test on the management vmkernel interfaces.

esxa01_ssh_esxb01_mgmt esxb01_ssh_esxa01_mgmt

This time the test was successful. It looks like the NFC service only responds to incoming connections on the management vmkernel interfaces. To further investigate this issue we opened a SR at VMware.

The SR was transfered to VMware engineering and we had to wait a very long time.

The fix VMware engineering provided us was to increase the maximum memory that could be used by the NFC process on the vSphere hosts. These commands where:

“grpID=$(vsish -e set /sched/groupPathNameToID host vim vmvisor nfcd | cut -d’ ‘ -f 1)”
“vsish -e set /sched/groups/$grpID/memAllocationInMB max=16”

And to check the result:

“vsish -e get /sched/groups/$grpID/memAllocationInMB”

esxi_commands

After configuring the hosts with this, I performed a new x-vCenter vMotion action and this time at was successful. We decided to do a few more between different hosts and all of these were successful.

vmotion_success.PNG

Many thanks to the VMware support representative for keeping the SR open and giving us updates on the progress!

NSX 6.2.3 Guest Introspection Deployment

VMware has announced the end of availability of vCloud Networking and Security 5.5.x which will commence on September 19. If you are using vCNS it is possible to migrate to NSX.

https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2144733

NSX 6.2.3 has a default license for NSX for vShield Endpoint so if you want to use the Guest Introspection services (eg. Deep Security Anti Malware)  it’s no longer required to buy NSX licenses.

If you are planning to upgrade vCND to NSX there are some caveats to remember, especially if you are using vSphere Auto Deploy.

Host Preparation

After deploying the NSX Manager and registering it with the vCenter Server it is time to deploy the Guest Introspection service. For people who are familiar with NSX the first step to perform is the Host Preparation. If you are using the default NSX for vShield Endpoint license you will not be able to perform this action.

nsx_prepare_cluster

This behavior is by default and does not impact the service deployments. You do not have to perform the Host Preparation if you are only using the service deployments of NSX (eg. guest introspection).

Deploy the guest introspection service from the Service Deployments tab.

Service Deployment

The guest introspection service deployment is performed per cluster. If you are deploying the Guest Introspection service to a cluster with vSphere hosts using vSphere Auto Deploy in a stateless configuration the deployment will fail.

nsx_service_vib_manual_install

There is an VMware KB article on how to deploy VXLAN through Auto Deploy.

https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2041972

This articles does not provide the path to the VXLAN offline bundle on the NSX Manager. You can find the download path of the offline bundle from the following webpage on your NSX Manager:

https://<NSX Manager IP>/bin/vdn/nwfabric.properties.

The VIB used for Guest Introspection is not included in the offline bundle on the NSX Manager used to deploy VXLAN through vSphere Auto Deploy. So you have to add this VIB manually to your Auto Deploy image profile. The location of this VIB is not documented but after some Googling the following blog post helped me:

https://community.hds.com/people/swalker/blog/2015/09/22/deploying-nsx-in-an-autodeploy-environment

This blogpost contains the location of the VIB on older versions of the NSX Manager. To get the right location I attached the Hirens boot CD to the NSX Manager, booted from it and started a search for all .zip files.

bootcd_search_results

The search result shows the correct name of the offline bundle and the location of the file. I used the bit in the blog about the NSX 6.2 file location as a reference to the location of the .zip file but apparently they have changed the location in NSX 6.2.3 to the same format as used in NSX 6.1.

https://<NSX Manager IP>/bin/offline-bundles/vShield-Endpoint-Mux-6.0.0esx50-3796715.zip

https://<NSX Manager IP>/bin/offline-bundles/vShield-Endpoint-Mux-6.0.0esx55-3796715.zip

https://<NSX Manager IP>/bin/offline-bundles/vShield-Endpoint-Mux-6.0.0esx60-3796715.zip

Add the offline bundle to your image profile and configure vSphere Auto Deploy to use this new image profile. Reboot your vSphere hosts and click Resolve in the NSX Service Deployments tab to verify the deployment went successful.

nsx_service_cluster_overview_succes

The only downside is that every time you upgrade NSX you have to find the correct file for the offline bundle on the NSX Manager. VMware used to have a KB article for vCND which provided you with the correct file locations but they do not have this for NSX.

VSAN 6.2 UPGRADE ERROR

Update 30 March – VMware have released a KB article with steps how to solve this problem. VMware will be providing a permanent fix in due course. In the meantime they are providing a  script that will detect stranded objects, broken disk chains and objects with a CBT lock. Where possible, the script will then ask whether you want the issue to be fixed for you, align the objects, and bring everything to interim version 2.5. This will then allow the upgrade of the on-disk format to proceed to V3. All steps (as well as the script) can be found in the KB articles listed below.

https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2144881

For a detailed description of the steps in the KB article Cormac Hogan has written a blog post.

http://cormachogan.com/2016/03/31/vsan-6-2-upgrade-failed-realign-objects/

With the announcement of GA of vSphere 6 Update 2 it was time to upgrade our lab environment to vSphere Update 2 and VSAN 6.2. Upgrading the PSC, VCSA and vSphere hosts went easily with no problems. What made me smile was that the VCSA upgrade 70% stuck GUI issue was fixed as well! Not only did VMware implement new features into Update 2 but also a lot off bug fixes were applied.

But now for the serious part, upgrading to VSAN 6.2. When you have updated your vCenter and vSphere hosts to Update 2 it is possible to upgrade your VSAN datastore. This is as easy as clicking one button. At least, it should be.. not for our lab environment 😦

The lab enviroment has been build when vSphere 6 GA was available and in the time after this it has been upgraded multiple times, used a lot and in occasion CPR had to be performed to get it back running.

All this made the automatic upgrade to VSAN 6.2 return with the error: “Failed to realign following VSAN objects”.

vsan_62_upgrade_failure

Luckily, after playing with the lab environment  I know my way around checking the current status of  VSAN. The health check plugin in the Web Client didn’t show any errors so I used the RVC on the vCenter Server Appliance to dig a little deeper.

I ran the  ‘vsan.check_state’ command the verify the status of the VSAN cluster. As shown in the picture below no immediate problems were found.

vsan_check_state

After this I decided to check the status of one of the objects presented in the error message with the ‘vsan.object_info’ command. The object presented in the error is a VMDK of the PSC. I random checked some other objects presented in the error and almost all of them where VMDK’s of VM’s on the VSAN datastore.

vsan_object_info

I knew it was possible to remove these objects from the VSAN datastore but since it were VMDK’s and for instance not folder or VM swap files this was not possible without breaking the VM. If you remove the object it will not be available anymore!

The objects that represented an VM swap file could be removed. I powered off the VM and logged on with SSH to a vSphere host in the VSAN cluster. On the host i used the ‘objtool’ command to remove the object.

esx_shell_remove_object.PNG

But after removing the objects I still had the problem that the remaining objects represented VMDK’s. Since we do not have any other storage in our lab environment then VSAN it was not possible to Storage vMotion alle these VM’s to different datastores. We use the default VSAN storage policy with an FTT configuration of ‘1’. With this configuration it is possible to survive 1 failure, eg disk group or host. With this in mind I decided to perform the upgrade to VSAN 6.2 manually.

When you create a new diskgroup it will automatically be created with the VSAN 6.2 on-disk format. Since I did not have any new disks I deciced to remove the disk groups one by one and recreate new disk groups with disks from the removed disk groups. Actually, this is the same as the automatic upgrade does!

When removing the disk group I decided to use the option to do a full data migration of the data on the disk group to another disk group. I think it should also be possible to skip this and let VSAN handle this as a disk group failure. The data will be resynchronised after the disk group has been removed, thus resulting in lower availability of the data for a period of time.

After a while I managed to upgrade all the disk groups to the new on-disk format and able to use the new VSAN 6.2 features. As it is a lab setup, we only have a hybrid VSAN config so not all new features are supported (deduplication, erasure coding).

VMCA subordinate CA caveats

vSphere 6 comes with the VMware Certificate Authority service on the VCSA. It’s possible to configure this as a subordinate CA in the existing CA infrastructure.

We tested this configuration in our lab environment and ran into some issues with this configuration. As I don’t have any screenshots of the issues, I will try and explain them shortly.

*Updated with VSAN Health Check error*

 

Java: Certificate does not comply to algorithm constraints

First off, when we configured the VMCA as a subordinate CA and replaced all certificates of the PSC and vCenter it wasn’t any longer possible to connect vRealize Operations or Trend Micro Deep Security to the vCenter Server. Both products would give you an Java error complaining that the certificate did not comply to the algorithm constraints. After a lot of debugging we found the problem. Our Windows CA infrastructure uses the SHA256 algorithm with the option AlternateSignatureAlogritm=1. This option sets the algorithm to RSASSA-PSS.

vROPS and Deep Security do not work with this algorithm and will trow you an error. We have setup a second CA infrastructure but this time with the option AlternateSignatureAlogritm=0. The algorithm that is used now is SHA256. After re-configuring the VMCA as subordinate CA and replacing the PSC and vCenter certificates it was possible again to connect vROPS and Deep Security to the vCenter Server.

 

NSX Manager SSO Lookup Service: Failed to Register

After replacing the certificates we were not able to configure the NSX Manager to use the SSO Lookup service.

This happens because after replacing the certificates the corresponding service registration does not get updated in the lookup service. VMware does have a KB article explaining this and provides a solution in these articles if you use an External or Embedded PSC.

vCenter Server with External PSC

http://kb.vmware.com/selfservice/search.do?cmd=displayKC&docType=kc&docTypeID=DT_KB_1_1&externalId=2121701

vCenter Server with Embedded PSC

http://kb.vmware.com/selfservice/search.do?cmd=displayKC&docType=kc&docTypeID=DT_KB_1_1&externalId=2121689

 

NSX Manager Host preparation: Not Ready

After replacing the certificates and updating the lookup services with the instructions from the KB article above the host preparation tab would show the status “Not Ready” for the cluster.

This happens because the EAM service on the VCSA doesn’t recognize the new certificate. VMware does also have a KB article for this.

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2112577

 

VSAN Health Check: Unexpected status code:400 

After replacing the certificates the VSAN Health Check Plugin wouldn’t run anymore with the error “Unexpected status code: 400”. I have had already encountered this once before so I knew how to fix this.

Not very surprising but VMware have a KB article for this.

http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&docType=kc&externalId=2133384&sliceId=1&docTypeID=DT_KB_1_1&dialogID=764690653&stateId=1%200%20764700126