Top 10 Windows 7 Features #5: Multitouch
For close to two decades now, the design of applications has changed surprisingly very little. At their core, apps wait for users to generate input, and they respond -- a server/client model of processing on a very local scale. So in a very real way, what applications do has been a function of how they respond -- the whole graphical environment thingie you've read about has really been a sophisticated way to break down signals the user gives into tokens the application can readily process.
The big roadblock that has suspended the evolution of applications from where they are now, to systems that can respond to such things as voice and language -- sophisticated processes that analyze input before responding to it -- is the token-oriented nature of their current fundamental design. At the core of most typical Windows applications, you'll find a kind of switchboard that's constantly looking for the kinds of simple input signals that it already recognizes -- clicking on this button, pulling down this menu command, clicking on the Exit box -- and forwarding the token for that signal to the appropriate routine or method. Grafting natural-language input onto these typical Windows apps would require a very sophisticated parser whose products would be nothing more than substitutes for the mouse, and probably not very sufficient substitutes at that.
If we're ever to move toward an analytical model of user input, we have to introduce some sophisticated processes in-between -- we have to start endowing apps with the capability to ask the question, "What does the user mean?" And while this edition of Top 10 isn't really about natural language processing at all, it's about Microsoft's first genuine steps in the direction of evolving applications in the direction they need to go, to be able to absorb natural language as the next big step.
That first step, perhaps unexpectedly, is multitouch. It's a somewhat simpler version of solving the bigger problem of ascertaining meaning through input, because now that Windows apps will begin being able to process input coming from two or more points on the screen at once, the relationships between those multiple points will need to be analyzed. For example, when a program receives a series of signals that appear to show multiple points, arranged in a generally vertical pattern, moving slowly and then very quickly to the right...could that mean the user wants to "wipe the slate clean?" "Start over?" "Clear screen?" What's the probability of this?
Analyzing input changes everything, and if enough developers invest their true talents in addressing this necessary element of evolution, this could change everything for Windows apps -- everything.
Where we are now on the evolutionary scale is not too far ahead of where we started roundabout 1982. The creation of Common User Access (the use of graphical menus and dialog boxes that follow a standard format) led to the development of a kind of "switchboard" model for processing input. And if you follow the model of Windows programming espoused ever since the days of Charles Petzold's first edition of Programming Windows, that switchboard is the part you build first -- if you're a developer, everything you make your program do follows from what you put in its menu.
Veteran Windows programmers going back to straight C know this switchboard as the WndProc() procedure; and although languages that have crept into IDEs since the Windows/386 days do use different models and conventions, for the most part, they just provide developers with shortcuts to building this basic procedure. Literally, this procedure looks for the unique ID numbers of recognized input signals, called "window messages" or "mouse events," denoted in the nomenclature by the prefix WM_. A long switch clause goes down a list, checking the most recently received event ID against each possible response, one at a time, and connecting with the procedure associated with that response once there's a match. It's like looking for a plug on a telephone switchboard left-to-right, top-to-bottom, every time there's an incoming call.
Throughout the history of Windows, the evolution of application architecture has typically taken place in either of two ways: Either Microsoft improves some underlying facet of the operating system, which leads to an improvement (or at the very least, an obvious change) in how the user perceives her work immediately; or Microsoft implements a change which developers have to seize upon later in order for users to see the benefits down the road. As you'll certainly recall, the change to the security of the system kernel in Windows Vista was immediate, and its benefits and detriments were felt directly. But when Microsoft began introducing Windows Presentation Foundation (WPF) during the twilight of Windows XP's lifecycle, it took time for new developers to transition away from Foundation Classes (MFC), and many still choose not to.
Multitouch is one of those changes that falls into the latter category. Since December when the company published this white paper, Microsoft has been calling its latest layer of user input Windows Touch. Windows 7 will have it, and presumably at this point, Vista can be retrofitted with it.
Windows Touch is an expansion of WPF to incorporate the ability to do something the company's engineers call coalescing -- ascertaining whether multiple inputs have a collective meaning. Using two fingers to stretch or shrink a photograph is one example, or at least, it can be: In an application where two fingers may be used for a variety of purposes -- twirling, changing perspective, maybe even minimizing the workspace -- an act of coalescing would involve the application's ability to register the multiple inputs, and then ascertain what the user's intent must be based on the geometry that WPF has returned.
The concept of coalescing was introduced to us last October by Reed Townsend, the company's lead multitouch engineer. As he told attendees at the PDC 2008 conference that month, the revised Win32 Application Programming Interface (which will probably still be called that long after the transition to 64-bit is complete) will contain a new "mouse event" called WM_GESTURE, and it will handle part of the job of coalescing motions, in order to return input messages to an application that go just beyond what the mouse pointer can do for itself. Rotation, zooming, and panning, for instance, may require a bit more decision making on the part of Windows Touch -- a kind of on-the-spot forensics that filters out panning motions from scrolling motions, for instance.
Since the first editions of the Tablet SDK were produced during XP's lifecycle, Microsoft has been building up a vocabulary of coalesced gestures that developers may be able to utilize through the assembly of gesture messages. Perhaps a down stroke followed by a right stroke may have some unique meaning to an app; and perhaps the length of that right stroke may impart a deeper meaning.
Simply endowing programs with the capability to judge what the user means based on the input she appears to be giving, changes the architecture of those programs, particularly in Windows. Born from the company's experiments with Surface, a simple manipulable globe application has provoked Microsoft's engineers to think very differently about how input is processed.
The idea, as we saw last October, is simple enough: Place a globe on the screen which the user can twist and turn, but then zoom in to see detail. The project became more detailed once the engineers began to appreciate the complexity of the problems they were tackling. Stretching and shrinking a photograph is a simple enough concept, particularly because a photo is a) two-dimensional, and b) rectangular. But a three-dimensional sphere adds new dimensions, as users are likely to want to use the device spherically. "Down" and "up" become curved, sweeping motions; and zooming in on an area becomes a very complex exercise in transformation. Add two-dimensional floating menus to the equation, and suddenly the application has to analyze whether a sweeping motion means, "Get that menu out of my way," "Pin that menu to that building," or "Make that menu smaller for me." And when a big 3D object such as the Seattle Space Needle comes into view, when the user touches its general vicinity, does he mean to be touching the building itself? Or the globe on which the building is positioned?
In order to make input simpler -- to have a moving globe that folks will already know how to move and zoom into and look closely at and do things automatically with -- the application has to incorporate orders of magnitude more complexity than they've ever worked with before. The moment the program "guesses wrong," and the globe behaves in a way users didn't expect, they'll feel disconnected with it, as though an object in their real world were to wink out of existence or morph into some foreign substance.
Next: The immediate upside of multitouch...