Finding the Hidden Meanings in Google Code Search
Generally with a search engine, you get the best results when your queries contain just enough related terms or criteria for the engine to determine a common context. Each new term in a query strengthens the context of the information you're trying to locate - for example, with the Google query strawberry fields -Beatles location.
Historically, finding a good example of source code on the Web has been a matter of crafting the right query that will hit on the common language text that's adjacent to the code you're trying to find. Today, Google's new Code Search facility begins exploring a new premise: Is there a way to apply the same search tool as for a common language query, in order to locate a passage of source code from Google's vast library of open-source pages?
The jury may still be out on this one. A lot about what one says using a common language can be decipherable using a search engine like Google without the need for a sophisticated language interpreter; usually, just the fact that certain terms are used close to one another is enough to establish a common context for them.
But source code isn't lexical by design, it's algebraic. Its reliance on purely variable symbolism is commemorated by the fact that programmers still use throwaway symbols like foo and bar as utility variables, usually with the lowest and least restrictive scopes.
Suppose, for instance, I were developing a route finder application for a mapping service, such as Google Maps. Something algorithmic that I'd be interested in seeing that pertains to that job would be a so-called "Hamiltonian cycle" or "Hamiltonian circuit" - finding the shortest round-trip path through a given number of vertices or map points, such that no single point is passed through more than once.
I can actually find such an algorithm quite quickly using the ordinary Google query line. But what if I want to see a plethora of examples? Since there's nothing specific about an algorithm that dictates the names of the symbols it must use (symbology in most code need only be consistent within the confines of its own modules), what I have to hope for is that the programmers whose works are archived here were generous enough to have supplied comment lines that use the term Hamiltonian, or perhaps thought to use the term either as or within the name of a variable.
The first seemingly meaningful snippet I turned up using Google Code Search was written in Python, from a library that generates a class whose name is Hamiltonian. At first, I thought I was in luck. But judging from the name of the program to which the library belonged -- called "Lanthanide" -- I realized this actually pertained to measuring the dual-photon absorption rate of lanthanide compounds, which are used in the doping of semiconductors. Interesting concept, but clearly the wrong context.
Further down the list of returned search results, I found a snippet of C++ code that represents a class called AbsHamiltonian. But following the logic in my mind and not seeing it pertain to mapping, I felt inclined to track down its terminology once again. This time, I discovered the snippet pertained to the simulation of a Hamiltonian matrix - an array of values used in multi-body mechanics and molecular dynamics, including quantum dynamics. Another very interesting field, but again, a digression.
The third time ended up the charm, as I finally located a snippet that pertained to the branch of science I was interested in, without the detour past two other fields of endeavor that were also similarly inspired by the work of Sir William Hamilton.
In this particular instance, I was interested in some element of source code whose purpose I could describe using common language. But what if I were interested in code not for its context, but for its construction instead? Would I be able to recall the specific way that certain class was instantiated, or just how many lines there were within the if loop I may be looking for?
Although Google Code Search is one way to attract traffic to the wide, wonderful sea of public source code, it may also be a lazy way for Google to provide access to it while avoiding having to catalog and categorize it all, hoping that its otherwise powerful query line tool would compensate for the lack of "handles" that would make code snippets truly useful.
Of course, more mischievous minds than mine would be the first to try this new feature out and pronounce their findings: Someone already discovered that software license key generators also fell under the category of "public source code," and located techniques for generating access keys to commercial software.
Google's is not the first attempt at a search engine for source code. Koders.com has been running a similar search line for quite some time now, although it also offers a tie-in feature with Microsoft Visual Studio. This way, programmers can find, download and link their code to pre-existing implementations of problems they may already be trying to solve, such as finding the simplest route that links all given points on a map.
Google may need to adopt similar tools if it has any interest in being competitive in this field; otherwise, this first rendition of Code Search, at least at first, feels like going deep sea fishing with a crossbow. It's a nice place, and it's a nice tool, but they don't mix.