Skype explains outage, offers vouchers to customers
A systemwide failure which took down the popular VoIP service for about 24 hours beginning at 4pm GMT December 22 was caused by a server overload, chief information officer Lars Rabbe disclosed on Wednesday. This overload in turn caused delayed communication with Skype clients, in some cases causing crashes.
The problems even prompted CEO Tony Bates to e-mail users of the service, offering a credit voucher worth about 30 minutes of international airtime to its users -- and again publicly apologizing for the issues.
About half of all Skype users are using Skype for Windows 188.8.131.52, which had an issue with how it handled delayed responses from the Skype servers. With nearly 40 percent of those installs, the client would crash. Some of those affected clients were also "supernodes," which further exacerbated the problem.
In Skype's network, the supernode plays a critical part in the handling of VoIP traffic, acting as a directory and helping to route calls. Obviously without this the system had a difficult time handling traffic. The load was then passed on to the remaining supernodes, which then began to shut down as a result of a massively increased load.
"This further increased the load on remaining supernodes and caused a positive feedback loop, which led to the near complete failures that occurred a few hours after the triggering event," Rabbe explained.
Essentially, a domino effect of failures eventually made much of the network unavailable for nearly a day as Skype worked to stabilize the system. Now, with the system apparently stable, the service is working to prevent it from happening again.
Rabbe said the company was looking for bugs in its software like the one in the Windows client, and providing hotfixes as necessary. In addition, the company would review its early warning capabilities to help detect serious problems much earlier and thus prevent a similar major failure.
"Lessons will be learned and we will use this as an opportunity to identify and introduce areas of improvement to our software, further assess and invest in capacity and stability, and develop better processes for outage recovery and communications to our user base," Rabbe concluded.