Sun's New UltraSPARC T2 Has 64 Threads
Sun Microsystems' new UltraSPARC T2 processor, announced yesterday, promises to break new ground in CPU parallelism not only by offering eight cores per chip, but eight threads per core. But will this necessarily mean 64 times the processor power? The answer depends on how you define "thread."
For at least the past half-decade, Sun Microsystems has known that the key to performance improvement in the "post-megahertz" era of microprocessors is to discover how to implement parallelism without dedicating a whole processor core to each thread.
Multicore architecture is efficient in many respects, and it satisfies the "requirement" of Moore's Law to cram more transistors into each new design. But it's inefficient in the sense that it should not always require four times the processors to perform four times (or less) the work, even if we compact everything onto a single die.
Parallelism in computing is the capability to execute multiple sequences of instructions at once. Semiconductor manufacturers have different approaches to this concept: Just prior to the dawn of the multicore era, Intel tried (and in some instances, is still trying) "hyperthreading," which refers to its CPUs' capability to suspend one set of instructions, complete with their respective registers, and concentrate for a time upon a second set. This kind of implicit parallelism enables programs to be compiled for a single-threaded processor, and yet still receive some benefits of being bunched up.
True explicit multithreading -- which Intel initially tried in its first-generation Itanium architecture -- enables the CPU to schedule sequences' execution much more logically, though it depends on programmers to compile their software to discretely instruct the CPU about how those threads should be scheduled. Itanium and Itanium 2 processors don't need multiple cores to pull this off; they use symmetric multithreading (SMT), which might have revolutionized the CPU industry much earlier than Core Microarchitecture had its instruction set been compatible with x86.
Then there is the type of parallelism exhibited by today's graphics cards, which is altogether different. There, a single instruction can be executed on multiple groups of data simultaneously, by way of multiple pipelines. This is very useful in a low-count instruction-set environment where your code spends most of its time shading polygons.
For Sun's UltraSPARC T2 series announced yesterday - the culmination of the long-rumored Niagara 2 project - the so-called "server-on-a-chip" borrows a little from all three of these concepts. In a concerted effort to blast its way right back into the CPU market, where its designs once caught fire among workstation builders, the T2 sports eight cores with eight threads apiece, all on a single chip.
But what's a thread? Or rather, how does Sun think of "thread?" It isn't clear from the company's marketing literature, but Sun does have its own idea. Yet it's not being ambivalent about sharing what that idea truly is; in fact, by opening up the T2 architecture and its associated documentation under the General Public License, Sun is being very direct and honest about the fact that its threads are different than others'.
|An up-close look at Sun's UltraSPARC T2 processor. Here you can plainly see the eight processor cores in the center, flanked on the left and right by shared resources such as caching and embedded microcode. Among the features of this microcode are built-in resources for expediting cryptography algorithms. (Courtesy Sun Microsystems)|
In one critical respect, Sun's multithreading and Intel's hyperthreading are quite similar. The way Sun sees it, software doesn't have to know it's being executed in parallel. In other words, you don't have to compile it as multithreaded. As Sun's designed the T2, each of the eight cores is capable of maintaining a virtual machine state (all that work with Java finally paid off) for as many as eight independent threads apiece. Each of those VMs maintains a separate set of registers and resources to give threads the appearance of having a single-threaded processor all to themselves. In fact, Sun's documentation likens the effect to multiplying a single first-generation UltraSPARC processor.
Each virtual machine state, complete with its pipeline full of instructions waiting to be executed and the registers it needs to maintain that state, is referred to as a strand. In fact, Sun might have called its new processor "multi-stranded" if not for the negative connotation. And in the sense that we tend to say a CPU executes a set of instructions, in the virtualized model of UltraSPARC parallelism, the device that executes instructions is called a CMT. It stands for...something. Chip MultiThreaded, or "Chip MultiThreading," or maybe something else. Never mind the ambiguity; these are the three letters Sun chose.
Sun's documentation describes the setup like this: "In general, each virtual processor of a CMT processor behaves functionally as if it was an independent processor. This is an important aspect of CMT processors because user code running on a virtual processor does not need to know whether or not that virtual processor is part of a CMT processor."
That's seven "processors" just in one paragraph. But to put it another way, each virtual machine has all the digitally-represented resources necessary to enable running software to be given the impression that it's running on a dedicated, single thread. By default, software is non-privileged, which means it doesn't know of the existence of other threads or even other CMTs. But system software (such as the Solaris operating system) and security programs may be given higher privileges.
Is this necessarily a good thing? Originally, the impetus for developing parallelism for processors is so that software that encountered relatively heavier tasks could break down those tasks into easier-to-digest chunks, and distribute those chunks among its logic units. A program compiled for Itanium, for instance, can break itself down and distribute its functionality among the available SMT threads.
But as Intel discovered with HT as opposed to SMT, compartmentalizing single threads into two pigeon-holes didn't lead to performance improvements across the board. In fact, some testers in 2005 discovered that the faster an HT chip was clocked, the slower it performed certain HT benchmarks - evidence of a real bottleneck.
Could Sun's choice of architecture multiply that problem by four? The question is one of scalability: Specifically, can UltraSPARC T2's performance scale upward in rough proportion with its thread propagation? Initial performance test numbers revealed by Sun yesterday may leave us scratching our heads. First, Sun made sure we knew these were performance estimates, which may not mean they're based on direct observations. But it's reporting that the T2 scored a 78.3 (we'll assume Sun means "peak" performance and not "base") using the latest SPECint_rate2006 (integer tasks) benchmark, and a 62.3 on the SPECfp_rate2006 (floating-point tasks).
How good is that? In the latest performance rankings from the SPEC organization, an HP ProLiant DL360 G5 using 2.66 GHz Intel Xeon X5355 processors scored a peak observed score of 82.1 in SPECint_rate2006 and a 58.6 on SPECfp_rate2006. Of course, that's with two quad-core chips, not one octo-core. Still, that single chip was capable of running 64 threads, and yet was pretty much matched by a pair of chips that could run a total of 8.
There could still be a payoff, however, in the price department. Sun hasn't announced official prices for UltraSPARC T2, but promises a roster that's "starting well below $1,000." Which could mean processor power could be half as expensive as Xeon. It doesn't necessarily mean servers based on this chip will be half as expensive; and if they're not, then all of Sun's promises of an 8x8 multi-strand future for processors may lead it into a sadly familiar corner of the marketplace where SPARC has gone before.