'Amateur' Linux IBM mainframe failure blamed for stranding New Zealand flyers
12:05 pm EDT October 11, 2009 · The president of a design firm that specializes in data center power efficiency, and that was working on a new design last year for the Auckland-based data center that failed Friday morning, told Betanews today that even if changes were being made to that data center, if both the original design and the changeover plan were implemented properly, the data center failure would not have happened.
"What seems strange about this incident is that they are blaming it on a generator failure during testing," stated California Data Center Design Group President Ron Hughes, whose organization was not responsible either for the data center's current design or the changeover. "If this failure did occur during testing, the question I would ask is why didn't the redundant generators assume the load or why didn't they just switch back to utility power."
Though Hughes has no specific knowledge of last Friday's incident, his insight does shed more light on the situation.
"A properly designed Tier 3 data center -- which is the minimum level required for any critical applications -- should have no single points of failure in its design. In other words, the failure of a single piece of equipment should not impact the customer," Hughes told Betanews. "A generator failure is a fairly common event, which is why we build redundancy into a system. In a Tier 3 data center, if you need one generator to carry the load, you install two. If you need two, you install three. This is described as N+1 redundancy. It allows you to have a failure without impacting your ability to operate...In a Tier 3 data center, it should take 2 failure events before the customer is impacted."
The CEO of Air New Zealand -- one of the few major CEOs anywhere to have been elevated to the top post from a CIO position -- expressed his disgust last weekend over what he describes as the poor handling of a data center failure at his airline's outsourcing partner, IBM. Rob Fyfe's e-mail, made public by IDG's Randal Jackson, excoriates IBM for its handling of a systems outage that took place at 9:30 am local time Friday morning, and that lasted for at least six hours.
During the entire time, ticketing, baggage handling, and traffic rerouting procedures for the entire airline were at a standstill, causing chaos for airports there. This at a time when Air New Zealand was engaged in a public showdown with its chief rivals there, Pacific Blue and Qantas subsidiary Jetstar, challenging them to meet ANZ's standards for flight punctuality.
"In my 30-year working career," Fyfe told his colleagues, "I am struggling to recall a time where I have seen a supplier so slow to react to a catastrophic system failure such as this and so unwilling to accept responsibility and apologize to its client and its client's customers...We were left high and dry and this is simply unacceptable. My expectations of IBM were far higher than the amateur results that were delivered yesterday."
Indeed, even as of this morning, IBM New Zealand has issued no public statements. The data center failure apparently affected all of IBM's customers in the region, not just the airline, although there is no word yet as to the identity of those customers or the extent of damage to their operations.
The move to outsource data center operations to IBM appears to have happened partly under Fyfe's watch as CIO, and was heavily touted by the time by IBM's marketing literature as a "design win" for mainframe-based Linux. Though some mainframe database operations for ANZ came online as early as 1999, the most lucrative move came in August 2002, when the airline replaced its mid-range Windows NT-based in-house network made up of 150 Compaq z800 workstations, with a single eServer zSeries Linux outsourced mainframe hosted by IBM Global Services. The airline's CIO at the time of the move, Andrew Care, said maintaining the outsourced zSeries would cost his airline 30% less in maintenance fees, and save $600,000 in software licenses.
The migration was seen as a huge loss for Microsoft, whose NT operating system was already well on its way to having been branded a failure for mid-level networks.
IBM established the global airline industry standard software for transaction processing as far back as 1960, in a joint project with American Airlines called the Airlines Control Program, which made possible the original, groundbreaking Sabre system. Since 1979, IBM has sold other airlines a commercial version of this system, called Transaction Processing Facility (TPF).
To this day, the transaction format used by airlines everywhere is based on ACS' half-century-old protocol. It isn't the format that has needed evolution, but rather the software that runs it; and IBM itself has been the key innovator here, developing a new class of software this decade called the Airlines Control System (ALCS). Originally seen as a mid-level alternative to a higher-class TPF system for smaller airlines that couldn't afford big iron, ALCS -- a TPF emulator -- now runs on bigger iron, thanks to the evolution in hardware as well.
Air New Zealand was one of ALCS' biggest customer wins in August 2002. Up to now, the airline has been one of ALCS' more active supporters, contributing a big chunk of new requirements for the software's latest version, according to literature from the UK-based ALCS User Group.
At this point, Air New Zealand may have too much investment tied up in the software to be in any position to migrate its applications to an IBM competitor -- if there even really is one in this field. But the airline's problem may not be with so much with the software but with its current host.
According to an ANZ group general manager cited in local radio news reports, the offline incident was traced to a single generator failure at IBM's Newton Data Center in Auckland. Usually data centers have redundant power sources, and normally the Newton center would not be an exception. An August 2008 article in Data Center Journal by the designer of new energy-efficient data center power generators with redundant sourcing, specifically mentioned the Auckland center as one of his customers at that time.
"I've seen numerous references recently to reducing the amount of redundancy as a way to achieve higher energy efficiency," wrote engineer Ron Hughes, president of California-based Data Center Design Group, referring to his Auckland data center project. "While I have no doubt that it is true, it may not be in the long term interest of the client. Data center outages can be career changing events. That extra redundancy may be the difference between a component failure with little impact and a system-wide outage."
Current ANZ CIO Julia Raue has been overseeing an innovative new information systems project at her airline, which has involved the creation of customizable self-serve ticketing kiosks, which customers themselves can change online using selectable widgets to suit their airport demands. In an interview with CIO Magazine last month, iGoogle was credited as a design inspiration for the self-serve system. But the entire system revolves around the zSeries mainframe, whose uptime last week appeared to have revolved around a single faulty generator.
While CEO Fyfe certainly has understandable reasons for wanting to abandon IBM, with his entire information strategy dependent on the move ANZ made in 2002, he may not have many alternatives open.