VMware has a KB article detailing a bug present in ESXi 5.0 that has been known to cause a variety of networking issues in iSCSI environments. Until last week, I had not encountered this particular bug and thought I’d detail my experiences troubleshooting this issue for those still on 5.0 that may experience this issue.
The customer I was working with had originally called for assistance because their storage array was only reporting 2 out of 4 available paths “up” to each connected iSCSI host. All paths had originally been up/active until a recent power outage and since then, no manner of rebooting or disabling/re-enabling had been successful in bringing them all back up simultaneously. Their iSCSI configuration was fairly standard, with 2 iSCSI port groups connected to a single vSwitch per-server and each port group connected to separate iSCSI networks. Each port group in this configuration has a different NIC specified as an “Active Adapter” and the other is placed under the “Unused Adapters” heading.
One of the first things that I wanted to rule out was a hardware issue related to the power outage. However, after not much time troubleshooting, I quickly discovered that simply doing some NIC disable/re-enable on the iSCSI switches would cause the “downed” paths to become active again within the storage array and the path that was previously “up” would go down. As expected, a vmkping was never successful through a NIC that was not registering properly on the storage array. Everything appeared to be configured correctly within the array, the switches and the ESXi hosts so at this point I had no clear culprit and needed to rule out potential causes. Luckily these systems had not been placed into production yet and so I was granted a lot of leeway in my troubleshooting proccess.
- Test #1. For my first test I wanted to rule out the storage array. I was working with this customer remotely, so I had them unplug the array from the iSCSI switches and plug into some Linksys switch they had lying around. I then had them plug their laptop into this same switch and assign it an IP address on each of the iSCSI networks. All ping tests to each interface was successful so I was fairly confident at this point the array was not the cause of this issue.
- Test #2. For my second test I wanted to rule out the switches. I had the customer plug all array interfaces back into the original iSCSI switches. I then had them unplug a few ESXi hosts from the switches. Then they assigned their laptop the same IP addresses as the unplugged ESXi host iSCSI port groups and ran additional ping tests from the same ports the ESXi hosts were using. All ping tests on every interface was successful, so it appeared unlikely that the switches were the culprit.
At this point it appeared almost certain that the ESXi hosts were the cause of the problems here. They were the only component that appeared to be having any communication issues as all other components taken in isolation communicated just fine. At this point it was also evident that something with the NIC failover/failback wasn’t working correctly (given the behavior when we disabled/re-enabled ports) so I put the iSCSI port groups on separate vSwitches. BINGO! Within a few seconds of doing this I could vmkping on all ports and the storage array was showing all ports active again. Given that this is not a required configuration for iSCSI networking for ESXi, I immediately started googling for known bugs. Within a few minutes I ran across this excellent blog post by Josh Townsend and the KB article I linked to above. The issue caused by the bug is that it will actually send traffic down the “unused” NIC during a failover scenerio.
This is why me separating the iSCSI port groups “fixed” the issue. There was no unused NIC in the portgroup for ESXi to mistakenly send the traffic to. In addition, it also explained the behavior where disabling/re-enabling a downed port would cause it to become active again (and vice versa). In this case ESXi was sending traffic down the unused port and my disable/re-enable caused a failover scenario that caused ESXi to send traffic down the active adapter again.
In my case, upgrading to 5.0 Update 1 completely fixed this issue. I’ll update this post if I run across this problem with any other version of ESXi, just note the workaround I spoke of above and outlined in both links.