PDC 2009: Scuttling huge chunks of Vista architecture for a faster Windows 7
The reason Windows Vista seemed slow, and somehow, strangely seemed even slower over time, is now abundantly clear to Microsoft's architects: The evolution of computer hardware, particularly the CPU, exceeded anyone's expectations at the time of Vista's premiere in early 2007. But the surge in virtualization, coupled with the rise of the multicore era, produced a new reality where suddenly Vista found itself managing systems with more than 64 total cores.
Architects had simply not anticipated that the operating system would be managing this many cores, this soon -- at least, that appears to be the underlying message we're receiving here at PDC 2009 in Los Angeles. As independent scientists were speculating about possible performance drop-offs after 8 cores, server administrators were already seeing it. There were design tradeoffs for Windows Vista -- tradeoffs in efficiencies that could have been obtained through complex methods, for simplicity.
Those tradeoffs were fair enough for the dual-core era, but that only lasted a short while. Quad-core processors are quickly becoming commonplace, even in laptops. So with Vista's architecture, users could actually feel the lack of scalability. In fact, they were making investments in quad-core systems earlier in Vista's lifecycle than originally anticipated, and they when they didn't see four cores as right around double the performance of two cores...and later when they saw Vista's lag times slow down their computers over time, some critical elements of Vista's architecture became not an advantage but a burden.
Microsoft performance expert Mark Russinovich is one of the more popular presenters every year at PDC, mainly because he demonstrates from the very beginning of his talks that he absolutely understands what they're going through. It's difficult for a performance expert to put a good face on Vista...and Russinovich, to his credit, didn't even try.
After having quizzed the audience as to how many used Windows 7 on a daily basis (virtually all of the crowd of about 400 people), Russinovich quizzed them, "How many people are sticking with Windows Vista because that's so awesome?" He pretended to wait for an answer, and just before everyone's hands had descended, he answered his own question: "Yea, that's what I thought.
"One of the things we had decided to do with Windows 7 was, we got a message loud and clear, especially with the trend of netbooks, on top of [other] things," he went on. "People wanted small, efficient, fast, battery-efficient operating systems. So we made a tremendous effort from the start to the finish, from the design to the implementation, measurements, tuning, all the way through the process to make sure that Windows 7 was fast and nimble, even though it provided more features. So this is actually the first release of Windows that has a smaller memory footprint than a previous release of Windows, and that's despite adding all [these] features."
To overcome the Vista burden, Windows 7 had to present scalability that everyday users could see and appreciate.
As kernel engineer Arun Kishan explained, "When we initially decided to be able to support 256 logical processors, we set the scalability goal to be about 1.3 - 1.4x, up at the high end. And our preliminary TPCC number was about 1.4x scalability on 128 LPs [logical processors], when compared to a 64 LP system. So that's not bad; but when we dug into that, we saw that about 15% of the CPU time was spent waiting for a contended kernel spinlock." What Kishan means by that term is, while one thread is executing a portion of the kernel, other threads have to wait their turn. About the only way they can do that and remain non-idle is by spinning their wheels, quite literally -- a kind of "running in place" called spinlock.
"If you think about it, 15% of the time on a 128-processor system is, more than 15 of these CPUs are pretty much full-time just waiting to acquire contended locks. So we're not getting the most out of this hardware."
The part of the older Windows kernel that had responsibility for managing scheduling was the dispatcher, and it was protected by a global lock. "The dispatcher database lock originally protected the integrity of all the scheduler-related data structures," said Kishan. "This includes things like thread priorities, ready queues, any object that you might be able to wait on, like an event, semaphore, mutex, I/O completion port timers, asynchronous procedure calls -- all of it was protected by the scheduler, which protected everything by the dispatcher lock.
"Over time, we moved some paths out of the dispatcher lock by introducing additional locks, such as thread locks, timer table locks, processor control block locks, etc.," Kishan continued. "But still, the key thing that the dispatcher lock was used for was to synchronize thread state transitions. So if a thread's running, and it waits on a set of objects and goes into a wait state, that transition was synchronized by the dispatcher lock. The reason that needed a global lock was because the OS provides pretty rich semantics on what applications can do, and an application can wait on a single object, it can wait on a single object with a timeout, it can wait on multiple objects and say, 'I just want to wait on any of these,' or it can say, 'I just want to wait on all of these. It can mix and match types of objects that it's using in any given wait call. So in order to provide this kind of flexibility, the back end had to employ this global dispatcher lock to manage the complexity. But the downside of that, of course, was that it ended up being the most contended lock in most of our workloads, by an order of magnitude or more as you went to these high-end systems."
In the new kernel for Win7 and Windows Server 2008 R2, the dispatcher lock is completely gone -- a critical element of Windows architecture up until Vista, absolutely erased. Its replacement is something called fine-grained locking, with eleven types of locks for the new scheduler -- for threads, processors, timers, objects -- and rules for how locks may be obtained to avoid what engineers still call, and rightly so, deadlock. Synchronization at a global level is no longer observed, Kishan explained, so many operations are now lock-free. In its place is a kind of parallel wait path made possible by transactional semantics -- a complex way for threads, and the LPs that execute them, to be negotiated symbolically.
But the threads themselves won't really "know" about the change. "Everything works exactly as it did before," Kishan said, and this is a totally under-the-covers transparent change to applications, except for the fact that things scale better now."
Next: Speeding processes up by putting processors to sleep...