Google releases its data encoding format to compete with XML
In an effort to solve the bulk and time-consumption problem when encoding large databases, Google developed its own alternative to XML. Yesterday, the company began evangelizing others to use it as an alternative to the industry standard.
There's an argument that open standards are only truly useful when one standard applies to any given category of service -- an argument that was raised in the matter of application formats. Now the broader category of data encoding -- handled nowadays by XML -- is about to receive a big challenge, ironically from the group perceived as the champion of open standards in Internet communication: Google.
Yesterday afternoon, Google publicly released documentation for a system it has been using internally, called Protocol Buffers, inviting others to use it as well. And in a surprising blog post, one of its own software engineers argued that its system was preferable to XML because it's less expensive to deploy, and can more easily scale up to very large databases.
"As nice as XML is, it isn't going to be efficient enough for this scale. When all of your machines and network links are running at capacity, XML is an extremely expensive proposition," wrote Google software engineer Kenton Varda. "Not to mention, writing code to work with the DOM tree can sometimes become unwieldy."
Google's public documentation shows Protocol Buffers (which has yet to be formally abbreviated) is indeed conceptually different from XML, in that it's rooted more in procedural logic than structural declaration. In XML, there's a schema which defines the structures of tables and recordsets, which is separate from the document that relates the contents of records in that structure.
In Protocol Buffers, by contrast, one file contains class declarations whose composition looks much more like C++. They're called .proto files, and they define structural prototypes for tables using object-oriented language with which many programmers are already familiar. Each member of a class -- analogous to an entry in a database -- has characteristics that define their types in memory, just like variables.
But here, in an unusual departure from the norm, the default values for these members are set to digits (for strings or literals) or values (for numerals) that define their place in a sequence -- where they fall within a record. Imagine if data were streamed onto recording tape, the way it used to be in the late 1960s and '70s. It's that streaming of the data sequence, without all the fenceposts, that differentiates XML from Protocol Buffers, by taking out all those markups that say when an entry or a record starts and stops.
Setting the data contents then takes place programmatically, using programming language constructs rather than a marked-up data file.
Under the heading, "Why not just use XML?" an overview page in the Protocol Buffers documentation reads, "Protocol buffers have many advantages over XML for serializing structured data. Protocol buffers: are simpler, are 3 to 10 times smaller, are 20 to 100 times faster, are less ambiguous, [and] generate data access classes that are easier to use programmatically."
Some might argue that, in the effort to solve the bulk problem, Google didn't really invent anything new at all -- it simply reverted to the older concept of the interface definition language (IDL), a defining feature of the era of COM and CORBA. Google anticipated that argument, and yesterday Varda offered a pre-emptive counter-argument to the question, "Isn't it just another IDL?"
"Yes, you could call it that. But, IDLs in general have earned a reputation for being hopelessly complicated," Varda wrote. "On the other hand, one of Protocol Buffers' major design goals is simplicity. By sticking to a simple lists-and-records model that solves the majority of problems and resisting the desire to chase diminishing returns, we believe we have created something that is powerful without being bloated. And, yes, it is very fast -- at least an order of magnitude faster than XML."