There’s been a lot of industry news lately regarding Software-Defined Storage, Software-Defined Data Centers and hyper-convergence . After numerous conversations with various colleagues and friends about these concepts, I wanted to post my own thoughts on them and how I believe they are related.
First off, hyper-convergence has usually been used to denote the “next stage” in modern converged infrastructure. With many of the popular reference architectures or pre-built systems representing some level of “convergence”, hyper-convergence has come to refer to those systems that combine multiple data center tiers into a single appliance. However, as a term, I’ve come to view “hyper-convergence” as a misnomer. When looking at the modern landscape of integrated infrastructure platforms, there is only “convergence” and “simulated convergence”. Examples of converged infrastructure include Nutanix, Simplivity, et al while simulated convergence examples can be found in vBlock, VSPEX and FlexPod. And while there is differentiation within the simulated convergence platforms (e.g. pre-built vBlock vs. reference architectures VSPEX/FlexPod), they are only “converged” insofar as their disparate components are cabled and racked together in a branded rack and sometimes managed with common software (e.g. Cloupia). With simulated convergence, each “tier” of the data center is still represented by separate hardware components and an attempt at unity is made through the use of “single-pane” management software. Convergence differs from this in that data center tiers are consolidated into common hardware components which naturally increase management software simplicity as well.
Another interesting difference is that while simulated convergence offers simplified management and automation, convergence gives you these same things plus performance, cost and reduced complexity benefits as well. Because convergence moves data center tiers into a common platform, this naturally puts the network/compute/storage into closer proximity to each other, enabling greater performance and reduced complexity. Cost savings are achieved not only through hardware consolidation but operational expenditures can be lessened in a converged model as well.
None of this is to say that simulated convergence is worthless. On the contrary, simulated convergence via management software and reference architecture/pre-built configurations can greatly increase the consume-ability and ease of management of these separate components. Simulated convergence gives you increased efficiency on legacy platforms that organizations already have in place and already have knowledge on how to manage. It’s an improvement over traditional processes but it is not actual convergence, which is the next logical progression.
Indeed, say what you will about specific converged offerings but it’s hard to see why convergence as a model wouldn’t be the clear path to simplified software-defined data centers. No matter how much management software and automation you put in front of it, simulated convergence will always require specialized knowledge of various levels of divergent hardware components in order to properly maintain and run that model. You would never deploy a vBlock and only train your support staff on just Cloupia or vCenter with VSI plugins. No, for advanced troubleshooting and configuration an in-depth knowledge of all the network, hypervisor, compute, storage network and array components is necessary as well. Management software can mask the complexity, but it’s still there. It doesn’t move the control plane, it just creates another one.
Converged infrastructure that relies on commodity hardware and is software/virtualization-based shifts the focus from tier-based component management and support to a more holistic data center view. Under the converged model , the deployment and ongoing maintenance of the underlying infrastructure is greatly simplified, allowing for faster application deployment , monitoring and troubleshooting. In short, you spend much less time on your physical infrastructure and more time focusing on the business. Of course, hardware is still necessary on such a system but that’s not where the intelligence lies and as we’ve seen, there’s much less of it!
Going forward, I’m convinced that the popularity of convergence will only increase. What will be interesting to see is how the major compute/storage vendors handle this shift. As convergence increases, will a storage and compute vendor team up to sell their own converged solution? Will one of the startup convergence companies be acquired? Whatever happens, this will be one of the more exciting areas of IT to be involved with for many years to come. I can’t wait!
Over the past few years there has been no shortage of excellent blog posts detailing how to properly configure resource pools in a vSphere environment. Despite the abundance, quality and availability of this information, resource pools still seem to be the #1 most commonly misconfigured item on every VMware health check I’m involved with. Even though this is well treaded territory I wanted to lend my own way of explaining this issue, if for nothing else than just a place to direct people for information on resource pools.
What follows below is a simple diagram I usually draw on a whiteboard to help explain how resource pools work with customers.
There’s not much to say that the pictures don’t already show. Just remember to keep adjusting your pool share values as new VMs are added to the pool. Also note that while I assigned 8000:4000:2000 to the VMs in the High:Normal:Low pools above, I could have just as easily assigned 8:4:2 to the same VMs and achieved the same results. It’s the ratio between VMs that counts. In either example, a VM in the “High” pool gets twice as much resources under contention as a VM in the “Normal” pool and four times as much as a VM in the “Low” pool.
Looking for more information on resource pools?
- Understanding Resource Pools in VMware vSphere – Chris Wahl
- Label Resource Pools with Per VM Shares Value – Chris Wahl
- The Resource Pool Priority-Pie Paradox – Duncan Epping
- Shares set on Resource Pools – Duncan Epping
- Custom Shares on a Resource Pool, scripted – Duncan Epping
- Don’t add resource pools for fun, they’re dangerous – Eric Sloof
- Resource pools memory reservations – Frank Denneman
Feel free to send me any other good resource pool links in the comments section and I’ll add them to my list.
Below you’ll find step-by-step instructions on setting up a Cisco UCS environment for the first time. I wanted to post this as a general guideline for those new to UCS who may be setting up their first lab or production environments. It’s important to note that UCS is highly customizeable and that configuration settings will be different between environments. So, what you’ll see below is a fairly generic configuration of UCS with an ESXi service profile template. Also important to note is that since the purpose of this is to aid UCS newcomers in setting up UCS for the first time, I’ve done many of these steps manually. Most of the below configuration can be scripted and pools and policies can be created in the service profile template wizard but to really learn where things are at the first time, I recommend doing it this way.
This is a pretty lengthy blog post, so if you’d like it in .pdf format, click here.
There’s really not more to say on a general level that the pictures don’t already show. Based on how your environment is set up and the type of connectivity you require, the cabling could be much different than what is pictured above. The important things to note, however, are that you will always only connect a particular I/O Module to its associated Fabric Interconnect (as shown above) and for Fiber channel connections, “Fabric A” goes to “Switch A” and likewise for Fabric B. Each switch is then connected to each storage processor. Think of the Fabric Interconnects in this scenario as separate initiator ports on a single physical server (which is how we’ll configure them in our service profile) and the cabling will make much more sense.
Configuring the Fabric Interconnects
Connect to the console port of Fabric Interconnect (FI) “A”, which will be the primary member of the cluster. Power on FI-A and leave the secondary FI off for now. Verify that the console port parameters on the attached computer are as follows “9600 baud”, “8 data bits”, “No parity”, “1 stop bit”. You will then be presented with the following menu items (in bold, with input in green):
Enter the configuration method. (console/gui) ? console
Enter the setup mode; setup newly or restore from backup. (setup/restore) ? setup
You have chosen to setup a new Fabric interconnect. Continue? (y/n): y
Enter the password for “admin”: password
Confirm the password for “admin”: password
Is this Fabric interconnect part of a cluster(select ‘no’ for standalone)? (yes/no) [n]: yes
Enter the switch fabric (A/B) : A
Enter the system name: NameOfSystem (NOTE: “-A” will be appended to the end of the name)
Physical Switch Mgmt0 IPv4 address : X.X.X.X
Physical Switch Mgmt0 IPv4 netmask : X.X.X.X
IPv4 address of the default gateway : X.X.X.X
Cluster IPv4 address : X.X.X.X (NOTE: This IP address will be used for Management)
Configure the DNS Server IPv4 address? (yes/no) [n]: y
DNS IPv4 address : X.X.X.X
Configure the default domain name? (yes/no) [n]: y
Default domain name: domain.com
Apply and save the configuration (select ‘no’ if you want to re-enter)? (yes/no): yes
Now connect to the console port of the secondary FI and power it on. Once again, you will be presented with the following menu items:
Enter the configuration method. (console/gui) ? console
Installer has detected the presence of a peer Fabric interconnect. This Fabric interconnect will be added to the cluster. Continue (y/n) ? y
Enter the admin password of the peer Fabric interconnect: password
Physical Switch Mgmt0 IPv4 address : X.X.X.X
Apply and save the configuration (select ‘no’ if you want to re-enter)? (yes/no): yes
Both Fabric Interconnects should now be configured with basic IP and Cluster IP information. If, for whatever reason you decide you’d like to erase the Fabric Interconnect configuration and start over from the initial configuration wizard, issue the following commands: “connect local-mgmt” and then “erase configuration”
After the initial configuration and cabling of Fabric Interconnect A and B is complete, open a browser and connect to the cluster IP address and launch UCS Manager:
Configuring Equipment Policy
Go to the “Equipment” tab and then “Equipment->Policies”:
The chassis discover policy “Action:” dropdown should be set to the amount of links that are connected between an individual IOM and Fabric Interconnect pair. For instance, in the drawing displayed earlier each IOM had four connections to its associated Fabric Interconnect. Thus, a “4 link” policy should be created. This policy could be left at the default value of “1 link” but my personal preference is to set it to the actual amount of connections that should be connected between an IOM and FI pair. This policy is essentially just specifying how many connections need to be present for a chassis to be discovered.
For environments with redundant power sources/PDUs, “Grid” should be specified for a power policy. If one source fails (which causes a loss of power to one or two power supplies), the surviving power supplies on the other power circuit continue to provide power to the chassis. Both grids in a power redundant system should have the same number of power supplies. Slots 1 and 2 are assigned to grid 1 and slots 3 and 4 are assigned to grid 2.
Go to the “Equipment” tab and then “Fabric Interconnects->Fabric Interconnect A/B” and expand any Fixed or Expansion modules as necessary. Configure the appropriate unconfigured ports as “Server” (connections between IOM and Fabric Interconnect) and “Uplink” (connection to network) as necessary:
For Storage ports, go to the “Equipment” tab and then “Fabric Interconnects->Fabric Interconnect A/B” and in the right-hand pane, select “Configure Unified Ports”. Click “Yes” in the proceeding dialog box to acknowledge that a reboot of the module will be necessary to make these changes. On the “Configure Fixed Module Ports” screen, drag the slider just past the ports you want to configure as storage ports and click “Finish”. Select “Yes” on the following screen to confirm that you want to make these changes:
Next, create port channels as necessary on each Fabric Interconnect for Uplink ports. Go to the “LAN” tab, then “LAN->LAN Cloud->FabricA/B->Port Channels->Right-Click and ‘Create Port Channel’”. Then give the port channel a name and select the appropriate ports and click “Finish”:
Select the Port Channel and ensure that it is enabled and is set for the appropriate speed:
Next, configure port channels for your SAN interfaces as necessary. Go to the “SAN” tab and then “SAN Cloud->Fabric A/B->FC Port Channels->Right Click and ‘Create Port Channel’”. Then give the port channel a name and select the appropriate ports and select finish:
Select the SAN port channel and ensure that it is enabled and set for the appropriate speed:
What follows are instructions for manually updating firmware to the 2.1 release on a system that is being newly installed. Systems that are currently in production will follow a slightly different set of steps (e.g. “Set startup version only”). After the 2.1 release, firmware auto install can be used to automate some of these steps. Release notes should be read before upgrading to any firmware release as the order of these steps may change over time. With that disclaimer out of the way, the first step in updating the firmware is downloading the most recent firmware packages from cisco.com:
There are two files required for B-Series firmware upgrades. An “*.A.bin” file and a “*.B.bin” file. The “*.B.bin” file contains all of the firmware for the B-Series blades. The “*.A.bin” file contains all the firmware for the Fabric Interconnects, I/O Modules and UCS Manager.
After the files have been downloaded, launch UCS manager and go to the “Equipment” tab. From there navigate to “Firmware Management->Download Firmware”, and upload both .bin packages:
The newly downloaded packages should be visible under the “Equipment” tab “Firmware Management->Packages”.
The next step is to update the adapters, CIMC and IOMs. Do this under the “Equipment” tab “Firmware Management->Installed Firmware->Update Firmware”:
Next, activate the adapters, then UCS Manager and then the I/O Modules under the “Equipment” tab “Firmware Management->Installed Firmware->Activate Firmware”. Choose “Ignore Compatibility Check” anywhere applicable. Make sure to uncheck “Set startup version only”, since this is an initial setup and we aren’t concerned with rebooting running hosts:
Next, activate the subordinate Fabric Interconnect and then the primary Fabric Interconnect:
Creating a KVM IP Pool
Go to the “LAN” tab and then “Pools->root->IP Pools->IP Pool ext-mgmt”. Right-click and select “Create Block of IP addresses”. Next, specify your starting IP address and the total amount of IPs you require, as well as the default gateway and primary and secondary DNS servers:
Creating a Sub-Organization
Creating a sub-organization is optional, for granularity and organizational purposes and are meant to contain servers/pools/policies of different functions. To create a sub-organization, right-click any “root” directory and select “Create Organization”. Specify the name of the organization and any necessary descriptions and select “OK”. The newly created sub-organization will be visible in most tabs now under “root->Sub-Organizations”:
Create a Server Pool
To create a server pool, go to “Servers” tab and then “Pools->Sub-Organization->Server Pools”. Right-Click “Server Pools” and select “Create Server Pool”. From there, give the Pool a name and select the servers that should be part of the pool:
Creating a UUID Suffix Pool
Go to the “Servers” tab and then “Pools->Sub-Organizations->UUID Suffix Pool”. Right-Click and select “Create UUID Suffix Pool”. Give the pool a name and then create a block of UUID Suffixes. I usually try to create some two letter/number code that will align with my MAC/HBA templates that allow me to easily identify a server (e.g. “11″ for production ESXi):
Creating MAC Pools
For each group of servers (i.e. “ESXi_Servers”, “Windows_Servers”, etc.), create two MAC pools. One that will go out of the “A” fabric another that will go out the “B” fabric. Go to the “LAN” tab, then “Pools->root->Sub-Organization”, right-click “MAC Pools” and select “Create MAC Pool”. From there, give each pool a name and MAC address range that will allow you to easily identify the type of server it is (e.g. “11″ for production ESXi) and the fabric it should be going out (e.g. “A” or “B”):
Whole blog posts have been written on MAC pool naming conventions, to keep things simple for this initial configuration, I’ve chosen a fairly simple naming convention where “11″ denotes a production ESXi server and “A” or “B” denotes which FI traffic should be routed through. If you have multiple UCS pods and multiple sites, consider creating a slightly more complex naming convention that will allow you to easily identify exactly where traffic is coming from by simply reviewing the MAC address information. The same goes for WWNN and WWPN pools as well.
Creating WWNN Pools
To create a WWNN Pool, go to the “SAN” tab, then “Pools->root->Sub-Organization”. Right-click on “WWNN Pools” and select “Create WWNN Pool. From there, create a pool name and select a WWNN pool range. Each server should have two HBA’s and therefore two WWNNs. So the amount of WWNNs should be the amount of servers in the pool multiplied by 2:
Create WWPN Pools
Each group of servers should have two WWPN Pools, one for the “A” fabric and one for “B”. Go to the “SAN” tab, then “Pools->root->Sub-Organization”. Right-click on “WWPN Pools” and select “Create WWPN Pool”, from there, give the pool a name and WWPN range:
Creating a Network Control Policy
Go to the “LAN” tab, then “Policies->root->Sub-Organizations->Network Control Policies”, from there, right-click “Network Control Policies” and select “Create Network Control Policy”. Give the policy a name and enable CDP:
Go to the “LAN” tab and then “LAN->LAN Cloud->VLANS”. Right-click on “VLANs” and select “Create VLANs”. From there, create a VLAN name and ID:
Go to the “SAN” tab and then “SAN->SAN Cloud->VSANs”. Right-Click “VSANs” and select “Create VSAN”. From there, specify a VSAN name, select “Both Fabrics Configured Differently” and then specify the VSAN and FCoE ID for both fabrics:
After this has been done, go to each FC Port-Channel in “SAN” tab “SAN->SAN Cloud->Fabric A/B->FC Port Channels” and select the appropriate VSAN. Once the VSAN has been selected, “Save Changes”:
Creating vNIC Templates
Each group of servers should have two templates. One going out the “A” side of the fabric and one going out the “B” side. Go to the “LAN” tab, then “Policies->root->Sub-Organization->vNIC Templates”. Right-click on “vNIC Templates” and select “Create vNIC Template”. Give the template a name, specify the Fabric ID and select “Updating Template”. Also specify the appropriate VLANs, MAC Pool and Network Control Policy:
Creating vHBA Templates
Each group of servers should have two templates. One going out the “A” side of the fabric and one going out the “B” side. Go to the “SAN” tab, then “Policies->root->Sub-Organization->vHBA Templates”. Right-click on “vHBA Templates” and select “Create vHBA Template”. Give the template a name, specify the Fabric ID and select “Updating Template”. Also specify the appropriate WWPN Pool:
Creating a BIOS policy
For hypervisors, I always disable Speedstep and Turbo Boost. Go to the “Servers” tab, then “Policies->root->Sub-Organizations->BIOS Policies”. From there, right-click on “BIOS Policies and select “Create BIOS Policy. Give the policy a name and under “Processor”, disable “Turbo Boost” and “Enhanced Intel Speedstep”:
Creating a Host Firmware Policy
Go to the “Servers” tab, then “Policies->root->Sub-Organizations->Host Firmware Packages”. Right-click “Host Firmware Packages” and select “Create Host Firmware Package”. Give the policy a name and select the appropriate package:
Create Local Disk Configuration Policy
Go to the “Servers” tab, then “Policies->root->Sub-Organizations->Local Disk Config Policies”. Right-click “Local Disk Config Policies” and select “Create Local Disk Configuration Policy”. Give the policy a name and under “Mode:” select “No Local Storage” (assuming you are booting from SAN):
Create a Maintenance Policy
Go to the “Servers” tab, then “Policies->root->Sub-Organizations->Maintenance Policies”. Right-click “Maintenance Policies” and select “Create Maintenance Policy”. From there, give the policy a name and choose “User ack”. “User ack” just means that the user/admin has to acknowledge any maintenance tasks that require a reboot of the server:
Create a Boot Policy
Go to the “Servers” tab, then “Policies->root->Sub-Organizations->Boot Policy”. Right-click “Boot Policy” and select “Create Boot Policy”. Give the policy a name and add a CD-ROM as the first device in the boot order. Next, go to “vHBAs” and “Add SAN Boot”. Name the HBA’s the same as your vHBA templates. Each “SAN Boot” vHBA will have two “SAN Boot Targets” that will need to be added. The WWNs you enter should match the cabling configuration of your Fabric Interconnects. As an example, the following cabling configuration…:
Should have the following boot policy configuration:
Creating a Service Profile Template
Now that you have created all the appropriate policies, pools and interface templates, you are ready to build your service profile. Go to the “Servers” tab and then “Servers->Service Profile Templates->root->Sub-Organizations”. Right-click on the appropriate sub-organization and select “Create Service Profile Template”. Give the template a name, select “Updating Template” and specify the UUID pool created earlier. An updating template will allow you to modify the template at a later time and have those modifications propagate to any service profiles that were deployed using that template:
In the “Networking” section, select the “Expert” radio button and “Add” 6 NICS for ESXi hosts (2 for MGMT, 2 for VMs, 2 for vMotion). After clicking “Add” you will go to the “Create vNIC” dialog box. Immediately select the “Use vNIC Template” checkbox, select vNIC Template A/B and the “VMware” adapter policy. Alternate between the “A” and “B” templates on each vNIC:
In the “Storage” section, specify the local storage policy created earlier and select the “Expert” radio button. Next “Add” two vHBA’s. After you click “Add” and are in the “Create vHBA” dialog box, immediately select the “Use vHBA Template” checkbox and give the vHBA a name. Select the appropriate vHBA Template (e.g. vHBA_A->ESXi_HBA_A, etc) and adapter policy:
Skip the “Zoning” and “vNIC/vHBA Placement” sections by selecting “Next”. Then, in the “Server Boot Order” section, select the appropriate boot policy:
In the “Maintenance Policy” section, select the appropriate maintenance policy:
In the “Server Assignment” section, leave the “Pool Assignment” and power state options at their default. Select the “Firmware Management” dropdown and select the appropriate firmware management policy:
In “Operational Policies”, select the BIOS policy created earlier and then “Finish”:
Deploying a Service Profile
To deploy a service profile from a template, go to the “Servers” tab, then “Servers->Service Profile Templates->root->Sub-Organizations”. Right-click the appropriate service profile template and select “Create service profiles from template”. Select a naming prefix and the amount of service profiles you’d like to create:
To associate a physical server with the newly created profile, right-click the service profile and select “Change service profile association”. In the “Associate Service Profile” dialog box, choose “Select existing server” from the “Server Assignment” drop down menu. Select the appropriate blade and click “OK”:
You can have UCS manager automatically assign a service profile to a physical blade by associating the service profile template to a server pool. However, the way in which UCS automatically assigns a profile to a blade is usually not desired by most people and this way allows you assign profiles to specific slots for better organization.
Configuring Call Home
Go to the “Admin” tab and then “Communication Management->Call Home”. In the right-hand pane, turn the admin state to “On” and fill out all required fields:
In the “Profiles” tab, add firstname.lastname@example.org to the “Profile CiscoTAC-1″. Add the internal email address to the “Profile full_txt”:
Under “Call Home Policies”, add the following. More policies could be added but this is a good baseline that will alert you to any major equipment problems:
Under “System Inventory”, select “On” next to “Send Periodically” and change to a desirable interval. Select “Save Changes” and then click the “Send System Inventory Now” button and an email should be sent to email@example.com:
In the “Admin” tab, select “Time Zone Management”. Click “Add NTP Server” in the right-hand pane to add an NTP server and select “Save Changes” at the bottom:
Backing up the Configuration
Go to the “Admin” tab and then “All”. In the right-hand pane, select “Backup Configuration”. From the “Backup Configuration” dialog box, choose “Create Backup Operation”. Change Admin states to “Enabled” and do a “Full State” and then an “All Configuration” backup. Make sure to check “Preserve Identities:” when doing an “All Configuration” backup and save both backups to the local computer and then to an easily accessible network location:
After backing up your configuration you can start your ESXi/Windows/Linux/etc. host configurations! Now that all the basic prep-work has been done, deploying multiple servers from this template should be a breeze. Again, it’s important to note that what is shown above are some common settings typically seen in UCS environments, particularly when setting up ESXi service profile templates. Certainly, there could be much more tweaking (BIOS, QoS settings, MAC Pool naming conventions, etc.) but these general settings should give you a general idea of what is needed for a basic UCS config.
I’ve had a number of customers ask me about the steps needed in order to setup Windows boot from SAN in a Cisco UCS environment. There are a number of resources out there already, but I wanted to go ahead and create my own resource that I could consistently point people to when the question comes up. So, without further ado…
Assuming the service profile has already been built with a boot policy specifying CD-ROM and then SAN storage as boot targets, complete the following steps to install Microsoft Windows in a boot from SAN environment on Cisco UCS:
1. First, download the Cisco UCS drivers from Cisco.com. Use the driver .iso file that matches the level of firmware you are on:
2. Next, boot the server and launch the KVM console. From the “Virtual Media” tab, add the Windows server boot media as well as the drivers .iso file downloaded in the previous step and map the Windows boot media. After the server is booted, zone only one path to your storage array (e.g. vHBA-A -> SPA-0). Once the path has been zoned, you can also register the server on the array and add to the appropriate storage groups. Remember, it is very important that you only present one path to your storage array until multipathing can be configured on Windows after the installation. A failure to do this will result in LUN corruption.
3. Once the installation reaches the point where you select the disk to install Windows on, the installation process will notify you that drivers were not found for the storage device. Go back to the “Virtual Media” tab and map the drivers .iso file:
6. After selecting the appropriate driver, the new drive should appear (you may have to select “Refresh” if it does not show up immediately). Re-map the Windows media and continue with the installation:
7. After Windows is fully installed, configure the desired multipathing software and zone and register the rest of the paths to the array.
That’s about it! This is really a very simple procedure, the most important things to note are to get the appropriate drivers and zone only one path during installation.
VMware has a KB article detailing a bug present in ESXi 5.0 that has been known to cause a variety of networking issues in iSCSI environments. Until last week, I had not encountered this particular bug and thought I’d detail my experiences troubleshooting this issue for those still on 5.0 that may experience this issue.
The customer I was working with had originally called for assistance because their storage array was only reporting 2 out of 4 available paths “up” to each connected iSCSI host. All paths had originally been up/active until a recent power outage and since then, no manner of rebooting or disabling/re-enabling had been successful in bringing them all back up simultaneously. Their iSCSI configuration was fairly standard, with 2 iSCSI port groups connected to a single vSwitch per-server and each port group connected to separate iSCSI networks. Each port group in this configuration has a different NIC specified as an “Active Adapter” and the other is placed under the “Unused Adapters” heading.
One of the first things that I wanted to rule out was a hardware issue related to the power outage. However, after not much time troubleshooting, I quickly discovered that simply doing some NIC disable/re-enable on the iSCSI switches would cause the “downed” paths to become active again within the storage array and the path that was previously “up” would go down. As expected, a vmkping was never successful through a NIC that was not registering properly on the storage array. Everything appeared to be configured correctly within the array, the switches and the ESXi hosts so at this point I had no clear culprit and needed to rule out potential causes. Luckily these systems had not been placed into production yet and so I was granted a lot of leeway in my troubleshooting proccess.
- Test #1. For my first test I wanted to rule out the storage array. I was working with this customer remotely, so I had them unplug the array from the iSCSI switches and plug into some Linksys switch they had lying around. I then had them plug their laptop into this same switch and assign it an IP address on each of the iSCSI networks. All ping tests to each interface was successful so I was fairly confident at this point the array was not the cause of this issue.
- Test #2. For my second test I wanted to rule out the switches. I had the customer plug all array interfaces back into the original iSCSI switches. I then had them unplug a few ESXi hosts from the switches. Then they assigned their laptop the same IP addresses as the unplugged ESXi host iSCSI port groups and ran additional ping tests from the same ports the ESXi hosts were using. All ping tests on every interface was successful, so it appeared unlikely that the switches were the culprit.
At this point it appeared almost certain that the ESXi hosts were the cause of the problems here. They were the only component that appeared to be having any communication issues as all other components taken in isolation communicated just fine. At this point it was also evident that something with the NIC failover/failback wasn’t working correctly (given the behavior when we disabled/re-enabled ports) so I put the iSCSI port groups on separate vSwitches. BINGO! Within a few seconds of doing this I could vmkping on all ports and the storage array was showing all ports active again. Given that this is not a required configuration for iSCSI networking for ESXi, I immediately started googling for known bugs. Within a few minutes I ran across this excellent blog post by Josh Townsend and the KB article I linked to above. The issue caused by the bug is that it will actually send traffic down the “unused” NIC during a failover scenerio.
This is why me separating the iSCSI port groups “fixed” the issue. There was no unused NIC in the portgroup for ESXi to mistakenly send the traffic to. In addition, it also explained the behavior where disabling/re-enabling a downed port would cause it to become active again (and vice versa). In this case ESXi was sending traffic down the unused port and my disable/re-enable caused a failover scenario that caused ESXi to send traffic down the active adapter again.
In my case, upgrading to 5.0 Update 1 completely fixed this issue. I’ll update this post if I run across this problem with any other version of ESXi, just note the workaround I spoke of above and outlined in both links.
Both VMware View and Citrix XenDesktop require permissions within vCenter to provision and manage virtual desktops. VMware and Citrix both have documentation on the exact permissions required for this user account. Creating a service account with the minimal amount of permissions necessary, however, can be cumbersome and as a result, many businesses have elected to just create an account with “Administrator” permissions within vCenter. While much easier to create, this configuration will not win you any points with a security auditor.
To make this process a bit easier I’ve created a couple quick scripts, one for XenDesktop and one for View, that create “roles” with the minimal permissions necessary for each VDI platform. For XenDesktop, the script will create a role called “Citrix XenDesktop” with the privileges specified here. For View, that script will create a role called “VMware View” with privileges specified on page 87-88 here. VMware mentions creating three roles in its documentation, but I just created one with all the permissions necessary for View Manager, Composer and local mode. Removing the “local mode” permissions is easy enough in the script if you don’t think you’re going to use it and the vast majority of View deployments I’ve seen use Composer, so I didn’t see it as necessary to separate that into a different role either. You’ll also note that I used the privilege “Id” instead of “Name”. The problem I ran into there is that “Name” is not unique within privileges (e.g. there is a “Power On” under both “vApp” and “Virtual Machine”) while “Id” is unique. So, for consistencies sake I just used “Id” to reference every privilege. The only thing that will need to be modified in these scripts is to make sure to enter your vCenter IP/Hostname after “Connect-VIServer”.
Of course, these scripts could be expanded to automate more tasks, such as creating a user account and giving access to specific folders or clusters, etc., but I will let all the PowerCLI gurus out there handle that. Really, the only goal of these scripts is to automate the particular task that most people skip due to its tedious nature. Feel free to download, critique and expand as necessary.
Documentation for creating custom load evaluators in Citrix has existed for some time. Articles detailing the folly of using the “Default” load evaluator have been around for a while as well. Citrix even has an excellent whitepaper titled “Top 10 items found by Citrix Consulting on Assessments” that lists improper load management as the 2nd overall most common misconfigured item found by Citrix consulting and even gives an example baseline custom load evaluator. Despite all this, environments using the Default load evaluator are still prevalent and make up at least half the Citrix assessments I’m involved with. When words fail to make an impression, sometimes a visual can help:
The problem with the Default load evaluator is clear, it takes user distribution into account but not actual server resource consumption. Citrix load indexes are calculated on a 0-10,000 scale (you can see the value for each server with the “qfarm /load” command), with 10,000 being a “full” server. As you can see above, Server03 is the least busy from a Citrix perspective (since it has the least amount of users logged on), despite being the most busy from a server perspective. Further, the Default load evaluator sets the maximum amount of users per server at “100″ while the environment above will not support more than 25-30. So from a load distribution and capacity perspective, the Default load evaluator is clearly ill-suited for any production environment.
A custom load evaluator that accounts for resource consumption takes less than 5 minutes to create and apply to the appropriate servers in your farm. As mentioned previously, the Citrix whitepaper I linked to above has a good baseline custom load evaluator that should get you started. So, take the time to make this simple farm optimization, your users will thank you!