Licensing bug brings down VMware ESX data clusters
Could everyone's VMware licenses really have expired on August 12? That's the question hundreds of major data centers found themselves asking, right after midnight when they realized they weren't rebooting or resuming.
In what appears to be a fault with its license validation, virtualized data clusters worldwide running on VMware's ESX hypervisor found themselves unable to boot yesterday. Admins received messages saying their licenses had expired, whether or not they actually had.
"http://msg.License.product.expired This product has expired," reads a cut-and-paste from a message posted to VMware's support forum. "Be sure that your host machine's date and time are set correctly."
The problem appears limited to the VMware ESX 3.5 and ESXi 3.5 Update 2 hypervisors, and that includes clusters where VMotion is installed. VMotion is a dynamic tool that performs automatic maintenance on virtual servers -- which should presumably include license updates -- and which moves the physical location of virtual servers to better performing systems when necessary.
Not only could virtual machines not be restarted after midnight on August 12, but once suspended, they couldn't be resumed. And though VMotion was relied upon to provide the solution in some cases, it didn't.
"The issue was caused by a piece of code that was mistakenly left enabled for the final release of Update 2," Maritz wrote. This piece of code was left over from the pre-release versions of Update 2 and was designed to ensure that customers are running on the supported generally available version of Update 2." He went on to accept the blame for not disabling the code for the final Update 2, and not catching the problem during the QA process.
Maritz was the former Microsoft executive in charge of Windows, during the time that company was fighting its antitrust case with the US Justice Dept. He replaced Diane Greene, who founded VMware, after EMC's buyout of the software company.
Users of VMware's support forums suggested that others plagued with the problem, prior to installing patches, reset their system clocks to August 10. But VMware warns today that admins avoid that route, warning in a Knowledgebase memo today about "very serious side affects [sic] that could impact production environments. Any Virtual Machines that sync time with the ESX host and server time sensitive applications would be broken. These include, but are not limited to database servers, mail servers, & domain administration systems."
Could the damage from virtual system outages be mitigated if the central administration system for virtual clusters, called Virtual Center (VC), be run on its own virtual system instead of a physical system? Or should VC be run on a separate physical system in all cases? Those are among the questions admins had today, during their efforts to piece their clusters back together.
As one support forum user aenagy pointed out, VMware's documentation doesn't appear to be against the idea of running VC in a virtual system. "Except, of course, in the very unlikely event that the software determines it has 'expired,"' he added, "and then regardless of where your license server is, you're screwed."