PDC 2009: Scuttling huge chunks of Vista architecture for a faster Windows 7

Speeding processes up by putting processors to sleep
Some of the design changes Windows 7 architects made may seem counter-intuitive on the surface -- explained too simply, you might think they went the wrong direction.
For example, Mark Russinovich told the audience this morning, the new system is designed to increase the idle time for processors (both logical and physical), to make them latent for longer stretches. Sending processors fewer clock ticks is one way to bring this about. Why? "Timer coalescing means we minimize the number of timer interrupts that come into the system," Russinovich explained, "so that the processors stay idle for longer, and then go into sleep states. And then tick skipping means that we don't send timer interrupts to processors that are sleeping, so we don't wake them up needlessly."
In other words, keeping processors busy to reduce latency -- one of the methodologies that we were told years ago would help Vista -- actually reduces overall efficiency. Multicore processors work better when their logical processors (LPs) can be put to sleep, or "parked," and their active threads shipped to another LP. Keeping LPs awake put more of a load on the scheduler, and all that scheduling chatter was a burden on Core 0, where all the scheduling activity used to take place -- serially, one call at a time.
Here's another counter-intuitive notion: It's more efficient for a system to use as much memory as possible -- not to fill it with data, necessarily, but to populate memory pages with something. In an illustration for his part of this morning's workshop, Microsoft Distinguished Engineer Landy Wang showed a Windows 7 Task Manager panel where a machine with 8 GB of DRAM, running just a handful of regular processes, ended up with 97 MB free, or completely "zeroed." And that was a good thing.
"A lot of people might think, 'Wow, 97 megabytes doesn't seem like a lot of free memory on a machine of that size,' said Wang. "And we like to see this row actually be very small, 97 MB, because we figure that free and zero pages won't generally have a lot of use in this system. What we would rather do, if we have free and zero pages, is populate them with speculated disk or network reads, such that if you need the data later, you won't have to wait for a very slow disk or a very slow network to respond. So we will typically take these free and zero pages as we come across them, and pre-populate them with any files you might have read before, or executables we think you might run in the future -- we will get that in advance, so you don't have to wait when you click on something. It's already in memory, but you shouldn't take this low counter as implying that we're using a lot of memory."
Then Wang paused before adding, "We really are using a lot of memory, but we think we're using it in a smart way that you really want us to be using it in."
Microsoft's Arun Kishan explained the page dispatcher lock, and its abolition in Windows 7, replaced by a more complex symbolic system of semantics that lets threads execute in a more parallel, efficient fashion in the end. Locks that never seemed to be a problem in the Windows XP era ended up being a serious obstacle for Vista in more than one respect, as Landy Wang explained: "As we go into higher and higher numbers of cores, the page frame number [PFN] lock was something that we had historically used for nearly 20 years to manage the page frame database array -- a virtually contiguous, although it can be physically sparse, array."
A page frame number entry describes the physical state of the page of memory, Wang reminded attendees -- Is it zeroed out or free, is it on standby, is it active, is it shared, how many processes are communicating with it concurrently? "Basically all the data that we need about the page, such that we can manipulate it into a state transition into any other state that's needed at any point in time. The size of this array is critical, as well as how to best manage the information."
On a 32-bit system with 64 MB, at 4K per page, that's about 16 million pages. Each PFN database is 28 bytes each, fitting into a 32-byte segment, for a total database size of 450 MB of virtual address space. "You would think that's a fairly cheap price, a cheap tax to pay. It's definitely below 1% of the physical memory in your machine, so you would think this is pretty good. But for us it wasn't enough, because we realized that while the physical cost is cheap, the virtual cost is high."
As was the case with the page dispatcher lock, Windows 7 architects had to do away with certain other methodologies that were implemented for simplicity in Vista, but which failed as workloads increased and cores multiplied.
"The problem with the PFN lock is that the huge majority of all virtual memory operations were synchronized by a single, system-wide PFN lock," remarked Microsoft's Landy Wang. "We had one lock that covered this entire array, and this worked...okay 20 years ago, where a four-processor system was a big system, 64 MB was almost unheard of in a single machine, and so your PFN database was fairly small -- several thousand entries at most -- and you didn't have very many cores contending for it."
But more operations and data structures were tacked onto the PFN lock; at the same time, the number of cores and memory in systems ballooned to proportions that engineers had originally planned for something closer to 2016. That increased the pressure on global locks...and it was in Vista where these old architectures began to fail. It was here where Wang presented an astounding statistic that surprised no one in the room who dealt with this subject personally -- it confirmed what they already knew:
While spinlocks comprised 15% of CPU time on systems with about 16 cores, that number rose terribly, especially with SQL Server. "As you went to 128 processors, SQL Server itself had an 88% PFN lock contention rate. Meaning, nearly one out of every two times it tried to get a lock, it had to spin to wait for it...which is pretty high, and would only get worse as time went on."
So this global lock, too, is gone in Windows 7, replaced with a more complex, fine-grained system where each page is given its own lock. As a result, Wang reported, 32-processor configurations running some operations in SQL Server and other applications, ended up running them 15 times faster on Windows Server 2008 R2 than in its WS2K8 predecessor -- all by means of a new lock methodology that is binary-compatible with the old system. The applications' code does not have to change.
We've said before that, for the end user's intents and purposes, Windows 7 "is 'Vista Service Pack 3.'" But in these critical departments of architectural change, where concepts dating back as much as two decades ended up faltering in Vista were scuttled for seemingly complex but more efficient replacements -- ideas that two years ago may have been considered for 2015 -- make the new operating system more like Windows 9.
However, we know that processor power and virtualization will only continue to explode, and to magnify each other's magnitude. So the huge changes under the hood for Win7 may actually end up being stopgap measures, before the onset of a time when more drastic sacrifices will be considered.