Yahoo chases spammers, builds Webmap on Hadoop open source framework
Yahoo has turned to the Hadoop open source cluster framework for crunching the data from "500 million users per month and billions of interesting events per day," said Yahoo's Ajay Anand, speaking at the Cloud Computing Expo in New York City this week.
Researchers from Yahoo Mail have already used Hadoop as a platform for finding botnets that are spewing out spam. Other Yahoo researchers have collaborated on Webmap, "a gigantic table of information [showing] every Web site, page, and link that Yahoo knows about," he told a packed hotel conference room on Tuesday.
First launched as an open source project back in 2005, Hadoop is a framework for running application on large clusters built from inexpensive commodity hardware.
The framework is based on the Google File System (GFS) and Google's Map/Reduce, a computational paradigm that divides applications into many small fragements of work which can be distributed to any computing node in the cluster.
Although Hadoop is written in Java, the Hadoop team at Yahoo has pitched in along the way with contributions that include Pig Latin, a high level language that generates Map/Reduce jobs. "Pig Latin is much simple [for users] to understand," he contended. Myriad other contributions to the project have included Facebook's Hive, a data warehousing layer.
Why did Yahoo join in on the open source project, anyway? "Data is the next big opportunity for the Web," Anand explained, giving credit to Dr. Tim Berners-Lee's description of Web 2.0. "The second phase [of the Web] is based on doing some intelligent kinds of analysis on the various kinds of data you have."
Aside from analyzing its log files, Yahoo was initially interested in creating better search indexes, optimizing ads, and experimenting with machine learning. But Web 2.0 also offers "opportunities to focus on thing you wouldn't necessarily otherwise do."
Anand said that the Hadoop framework provides the capabilities needed for drilling down into huge volumes of data: massive scalability; cost effective performance on commodity hardware; a reliable infrastructure which is able to deal with disk failures; and the ability to share computing resources across applications.
If one department of a company requires 4,000 computers to run an application, for example, the department no longer has to go out and buy 4,000 computers. Instead, the same servers can be used by multiple departments.
The Hadoop Distributed File System (HDFS) achieves reliability by replicating data across multiple hosts. Under the Hadoop approach, data storage scales horizontally, whereas metadata -- or "data about the data" -- scales vertically. Essentially, tasks are moved to "where the data is, for better performance," he said.
Yahoo's Hadoop team, which started out with five engineers or so, has since mushroomed to 15 or 20. Meanwhile, less technically inclined researchers at Yahoo have learned Pig Latin for programming the framework. Hadoop can also be programmed with the use of scripting languages and C/C++, in addition to Java, he noted.
Hadoop has "encouraged researchers at Yahoo to get new insights into data," according to Anand. Many of them have "started skunkworks project [and] scaled up to 400 nodes" without having to go through an approval process, since new hardware didn't need to get purchased.
Computational processing is a lot faster, too. Yahoo built its entire Webmap on Hadoop in just 70 hours, the audience was told.