Skype: Users Rebooting Brought Service Down
A system outage that impacted Skype users for about 48 hours last week has officially been attributed to a multitude of Windows-based clients receiving critical security patches and rebooting at roughly the same time. According to the company, the reboots triggered a flood of logon requests that collided at Skype's network hub, like a circumstantial form of denial-of-service.
Coupled with a reduction in the P2P capacity of the Internet at the time those Windows reboots were going on, there simply wasn't enough capacity in the network to handle the network traffic, as the company's Villu Arak explained this morning.
"The high number of restarts affected Skype's network resources," Arak stated on the company's blog. "This caused a flood of log-in requests, which, combined with the lack of peer-to-peer network resources, prompted a chain reaction that had a critical impact."
It's an unusual explanation, which omits the obvious fact that Patch Tuesday happened...on a Tuesday, while the Skype failure upon which the reboots were blamed happened on a Thursday. The new explanation doesn't appear to coincide with status reports given by Arak to customers during the outage.
Last Thursday, while connection problems were on the rise, he attributed the problem to Skype server software as though it had already been diagnosed there. "This problem occurred because of a deficiency in an algorithm within Skype networking software. This controls the interaction between the user's own Skype client and the rest of the Skype network."
This morning's explanation shifts the blame from the server to problems on the client side, or at least server problems that were triggered by unforeseen network conditions on the client side.
Immediately, one of the world's most trusted security researchers - University of Illinois, Urbana-Champaign programmer John C. Bambenek - saw evidence of a possible huge security hole, if Arak's explanation proves to be accurate.
Obviously, the first problem is the need for huge monthly security patches in the first place. But simply addressing that need points to a sadly common consumer behavior, Bambenek pointed out: "The second interesting note, is that if Skype's explanation is true," he wrote, "that means that vast majority of Skype users have machines that don't require a login on boot. Those machines simply happily login as the default user (and I bet almost all have full admin rights) and the login in to Skype (and their other start-on-boot applications)."
If Skype were an application installed mostly on servers as opposed to clients, admins might be cautious enough to reset the defaults so patches and updates didn't always get requested at 3:00 am. That way, requests wouldn't collide and the network would run smoother. (Microsoft did not report any problems in actually serving patches at roughly the same time.) Typical customer behavior, Bambenek pointed out, is for consumers to leave software set to their defaults. Download patches at the default time...reboot computers in the default way.
But being a security engineer, Bambenek was smart enough to ask first, all things being equal, why did all those reboots take 48 hours?
At its height, the outage prompted some comments from one of Skype's competitors in the P2P conferencing field, SightSpeed. CTO and founder Aron Rosenberg remarked to one of his company's bloggers that Skype's network infrastructure is actually a kind of star/P2P hybrid topology, in which systems in-between the hub and general clients act as supernodes. The identity or location of supernodes isn't planned in advance; rather, certain client systems with higher capacity for marshaling and regulating P2P conferencing traffic get promoted on the fly.
A Cornell University study in 2006 concluded that Skype's supernode architecture was key to minimizing bandwidth use across the network, while at the same time reducing the quantity of network traffic in which noise reduction algorithms needed to be used. This could be the key to Skype's relatively high quality of service, last week notwithstanding.
SightSpeed doesn't think so. "In theory this is a good idea," writes SightSpeed's Peter D. Csathy, "but the problem happens if your network starts to destabilize. Skype, as a company, has no physical or programmatic control over the most vital piece of its product. Skype instead is at the mercy of and vulnerable to the people who unknowingly run the SuperNodes."
Theoretically, since there would be a fewer number of supernodes than general nodes in the P2P network, it would take less leverage on the part of a massive software update by Microsoft or anyone else to force those supernodes to reboot at about the same time.
Readers of the independent blog Skype-Watch have expressed skepticism about the company's ability to communicate well with its customer base, especially since it was purchased by eBay. One of its readers writes for its forum, "Microsoft has released several forced reboot patches in the past, including the last three years - the timeframe that the Skype Bug was present. So what makes this one different? And since there was no massive forced Skype Update, then the Skype community can have this happen [again] until there is one, next month or next time MS releases a forced update security patch. When will Skype update the vast majority of clients to avoid this ticking time bomb? It would seem Skype is not very worried about this happening again, yet doing nothing to update the hive to protect it."
Once again, as was the case with AOL Instant Messenger at the turn of the decade, users who pay nothing find themselves paying the price.