Outsourced IT probably hurt Delta Airlines when its power went out

DELTA_WING_UP_PFDelta Airlines last night suffered a major power outage at its data center in Atlanta that led to a systemwide shutdown of its computer network, stranding airliners and canceling flights all over the world. You already know that. What you may not know, however, is the likely role in the crisis of IT outsourcing and offshoring.

Whatever the cause of the Delta Airlines power outage, data center recovery pretty much follows the same routine I used 30 years ago when I had a PDP-8 minicomputer living in my basement and heating my house. First you crawl around and find the power shut-off and turn off the power. I know there is no power but the point is that when power returns we don’t want a surge damaging equipment. Then you crawl around some more and turn off power to every individual device. Wait in the dark for power to be restored, either by the utility or a generator. Once power has been restored turn the main power switch back on then crawl back to every device, turning them back on in a specific order that follows your emergency plan. You do have an emergency plan, right? In the case of the PDP-8, toggle in the code to launch the boot PROM loader (yes, I have done this is complete darkness). Reboot all equipment and check for errors. Once everything is working well together then reconnect data to the outside world.

Notice the role in this process of crawling around in the dark? How do you do that when your network technicians are mainly in India?

Advertisement

Yes, every data center of a certain size has bodies on-site, but few have enough such bodies to do all the crawling required for a building with thousands of individual devices.

Modern data centers can switch to UPS power very quickly, usually less than 1/30th of a second. They will run on battery power for a few minutes while the generators come on line. Smarter data centers drop power to the HVAC system until the generators are on line and handling the full load. Smarter IT departments also monitor the quality of the electric power coming into the data center. They can see the effect of bad weather on the power grid. When there are storms approaching the area they proactively switch to generator power, which even if it isn’t needed is a good test. Better to fire up the generators, have them go into phase with the utility then take over the load gracefully rather than all at once. It is doubtful that happened last night at Delta.

Delta Airlines was an IBM outsourcing customer, it may still be today, I don’t know. The company hasn’t returned my call.

Loss of power in a data center usually triggers a disaster recovery plan. When that happens you have two basic choices: switch to your backup systems somewhere else or fix the outage and recover your primary systems. The problem with going to backup systems is those backups usually do not have capacity for 100 percent of the workload so only the most critical functions are moved. Then once everything is fixed you have to move your workload back to your production systems. That is often high risk, a major pain, and takes a lot of effort. So in a traditional disaster recovery setup, the preference will always be to recover the primary services.

Anything less than a 100 percent service backup isn’t disaster recovery, it is disaster coping.

Now if the IT support team is thousands of miles away, offshore, the process for restarting hundreds -- perhaps thousands -- of systems can be slow and painful. If you lose the data link between your support team and the data center due to that same power outage your support team can do nothing until the data link is fixed.

In the old days a smart IT department would put 50 people in the data center with pre-printed documentation on how to recover the systems. They’d go into a massive divide and conquer effort to restart everything. One person can work on several systems at the same time. While debugging the systems the IT team can spot and diagnose network and SAN (Storage Area Network) problems and work shoulder to shoulder with the network team until everything is fixed. Today due to long distance data connections, offshore IT folks can only work on a few systems at a time. All discussions with other teams is done via conference calls. Language issues make the challenges much harder.

A further problem with this scenario is that network, application, and storage support can be on completely separate contracts with vendors who may not play well together. Some vendors simply refuse to cooperate because cooperation isn’t a contract term.

Now I don’t know if any of this applies to Delta Airlines because it is too busy to answer a question from little old me, but I’m sure the answers will appear in coming days. Hopefully other IT departments will learn from Delta’s experience.

23 Responses to Outsourced IT probably hurt Delta Airlines when its power went out

  1. Bob Grant says:

    Sounds like Delta was the first major example of just how stupid this IBM outsourcing crap is.

    • ATL_VM says:

      >>first major example of just how stupid this IBM outsourcing crap is.<<

      You are a primary example of how CLUELESS you idiots are, you see IBM and ASSUME that is the problem its not.

      PEOPLE running the IBM software hardware are the problem. IBM test it its solid technology and many companies you never heard of use it without issue. PEOPLE that made the decision to NOT have proper backups in place, so that systems should not go down. Generators, UPS, and redundant power should be in place never lose power, ask Google how they do it, they don't lose power they have wind, solar, and numerous batteries in addition to electricity powering their systems.

      STOP blaming the 'system' for the inequities of PEOPLE, PEOPLE are the damn problem not the technology you FOOL

  2. Bumblefish says:

    Almost makes me yearn for the days of the IBM 30xx water cooled mainframes, no ups or generator needed since it took so much power to run the thing and the refrigerated water equipment it was better to let it crash and ipl it when the power came back.

  3. barely_normal says:

    One of the requirements I have always thought should be in place is people on the phone who speak English without accent of any kind. [Certain British accents are near unintelligible, as are the Indian varieties.]

    • Caroljrose says:

      <<o:i. ★★✫★★✫★★✫★★✫★★✫★★✫★★✫★★✫★★✫★★✫★★✫★★✫★★✫★★✫★★✫★★✫★★✫★★:::::::!!bz764a:....,..

    • ATL_VM says:

      >>speak English without accent of any kind. [Certain British accents are near unintelligible, as are the Indian varieties.<<

      Hey we agree on something! I find the British accent even harder to understand than heavy Mandarin or hindi accents, I cringe when I get a British accent and its thick.. I pretty much have to decide to hang up and call back or see if I can deal with it.. usually I cannot.

      I had a boss with a fairly heavy accent, and it was annoying had daily meetings I had to listen to it.. most nerve wracking thing ever.

      • David Brown says:

        >>British accent even harder to understand<<

        So "English" as spoken in England has an accent? Have we come to the point where accent-free English is what's spoken in Boston, or maybe San Francisco?

        Strangely, I discovered that Toronto English to be indistinguishable from English in Bishop, California.

  4. TWP says:

    I can just hear the international conference calls now.

    This is happening, I believe, in every software intensive industry, including some people ought to stop and think about. I'm wrapping up a software career in a field credited with saving many lives, and my last job is training Indian engineers.

    You know we're sitting on four million pounds of fuel, one nuclear weapon and a thing that has 270,000 moving parts built by the lowest bidder. Makes you feel good, doesn't it? - Armageddon

    There are worse things that can happen to one than a flight delay, and software has a heck of a lot more than 270,000 moving parts.

    • ATL_VM says:

      software is NOT the issue UPS, generators, redundant power supplies, all of these things should be in place if one fails another takes over..

      This is PURE incompetence, a datacenter for a carrier as big as Delta should never lose power. FAA is probably telling Delta this RIGHT NOW, this is a never occur scenario..

      Ask Wall Street..how many times have they had a "glitch" due to software a couple .. maybe.. how many times have they lost power? NEVER

      Why because they understand what "backup systems" mean, apparently Delta does not.

  5. ATL_VM says:

    Somebody is getting fired

    >>Loss of power in a data center usually triggers a disaster recovery
    plan. When that happens you have two basic choices: switch to your
    backup systems somewhere else or fix the outage and recover your primary
    systems.<<

    Loss of power should NEVER happen in a datacenter especially in this era.. there are redundant systems, power should NEVER go out in a datacenter.. especially a data center, EVER.

    Whoever was in charge is no longer in charge and probably won't find a job even at McDonald's..

    • rottenapple1 says:

      Most businesses small and large that need to be operational 24/7 have generators and UPS on every single active computer they own. Critical backup systems such as generators are normally tested monthly at a minimum. They have multiple off site data centers across the country or even the world that can handle 100% of the workload. Because of this, incidents like this normally only happen to major airlines if a major disaster occurs. This is definitely user error that could have been prevented.

  6. Bryan Mills says:

    Delta outsourced its reservation contact centers to centers to a well known offshore company in India 14 years ago. The cost per employee went down from $46,000 per year for a US employee to $6,000 per year for an Indian employee according to "Outsourcing Success" by Alpesh Patel. Delta also has its planes repaired in Mexico according to a Vanity Fair article. They are by no means the only company using offshore services for critical tasks. When you contract something out, you lose a large degree of control over the quality of the service and hope the other company can deliver and not destroy your business. Most US companies are poorly managed and are only thinking about the next couple years of profits which is why they rise and fall so quickly, instead of seeing steady growth. It's hard for them to go out of business so they just stay on a roller coaster.

  7. partypop says:

    Outsourcing = problems. Being down a whole day cost hundreds of times what they saved. Idiots.

  8. Richard Saunders says:

    That's one reason I chose to be a network engineer. You cannot offshore your network engineers and expect high uptime at the same time.

  9. Putranto Sangkoyo says:

    Ok, let’s do some simple stupid forensic here, quoted from news: “Delta initially pointed to a loss of electricity from Georgia Power, which serves its Atlanta hub, when its worldwide computer network crashed at 2:30 a.m. Monday”. And then they said: “Delta Air Lines said Tuesday that an internal problem, not the loss of power from a local utility, was to blame for the disruption that caused hundreds of flight cancellations and delayed tens of thousands of travelers Monday.” … So, initially they didn’t know that it was their electrical system failure. Presumably they must be arguing for sometime. And so far no mention of overseas outsourced data center. Before we go further in analysing “data backup”, RAID, redundancy, routers and so on, if they say that “power outage” is the cause of all this mess, then shouldn’t we get to the basic first, namely electricity ? Aside from UPS which is just a temporary short time solution, a “big” and critical data center such as Delta should have diesel generators installed, no ? Next question is how long were they “in the dark” ? No matter how long, once the electricity was up again, how long does it take for the system to get back to “current status”, or like in “good old days” Windows “last good known configuration” ? Presumably, data were not destroyed, just interrupted, cut off, so there was no need for systemwide data restore, perhaps needed some synchronization, which presumably for a sophisticated system like Delta I presume was designed that way, able to synchronize. Then, last but not least, is the entire system “modular” with “modules priority structure”, so that when the “reservation system” is up for example, the other systems which are in lower priority list, could be synchronized later.

  10. __AKA_CHUP says:

    Pay Peanuts and what do you get? outages happens, but when you need to coordinate multiple teams spread around the globe, being underpaid and having most of the resources working in IT just because they get paid better than the average… do not expect to get the best quality service.. And trust me, based on individual salaries, support teams in India might be cheaper than USA/Europe ( just slightly as hourly rate in India are increasing ) , but in terms of how long does a task take to get it done , I can tell by experience ( I've quantified this in the the company I work for ) task takes 5-10 times more than what it would take a local person. What happens is Senior Manager are useless in terms of how to measure this or quantify the cost of incompetence ..Keep outsourcing your core services offshore and you will end up running out of business or paying, like Delta, millions of dollars on penalties, compensations, etc etc

  11. Tom says:

    The end of STEM jobs for American workers as we know them...

© 1998-2020 BetaNews, Inc. All Rights Reserved. Privacy Policy - Cookie Policy.