Why Hadoop?
Hadoop is garnering a lot of attention these days as companies ponder how to address problems involving large amounts of data (tera to petabytes). It also gathers attention from companies interested in distributed computing. Lastly, it is seen as an alternative to an RDBMS. So what is Hadoop actually and how would it be used in the enterprise now and in the near future?
What is Hadoop?
Briefly, Hadoop is an Apache Open Source project that mirrors technologies developed by Google to power its search engine and other products (Google Maps, Analytics, etc.). Hadoop consists of a number of sub-projects but the main ones are Hadoop File System (HDFS), Map Reduce and Hbase (multi-dimensional sorted map used as a data store). The HDFS creates a massive, durable, cost-efficient file system by using three way replication of all files across a cluster of commodity servers.
For more on Hadoop: http://hadoop.apache.org
Use Case #1: Searching large quantities of unstructured data quickly.
Hadoop gathers most of its interest from companies that would like to search, analyze and store a large amount of data, in the tera to petabyte range. As Google’s Google File System (GFS) is the basis for Google’s search engine, Hadoop’s HDFS is the basis for Yahoo!’s search engine. Since the impetus for creating Hadoop was Internet search, obviously, the main usage for Hadoop has been in searching unstructured data such as web pages or other documents.
Use Case #2: Accessible, programmable distributed computing paradigm.
Distributed computing has long been seen as a shining light at the end of the processing tunnel, but has largely been unused in enterprise environments due to its complexity. The Map Reduce framework is starting to change that paradigm as it is a conceptually simple programming model that brings distributed computing down to a reachable level. Map Reduce combined with HDFS has proven successful by abstracting away any knowledge of where the data lives. Map Reduce works by pushing the processing to the data rather than moving the data to the processing. It has only become viable as computing power and memory has increased to the point where complex computations could be completed quickly on commodity hardware.
Use Case #3: RDBMS replacement for large data sets.
There would be little reason to replace your RDBMS with Hadoop if you have a reasonable database size (GB range). Where Hadoop starts looking legitimate is when the cost/benefit of purchasing larger and larger servers to run the RDBMS crosses the cost/benefit of purchasing and maintaining a large cluster of commodity servers. Throw in retraining your development resources from SQL based development to NOSQL based and that further complicates the picture against Hadoop. Where Hadoop wins out is when the RDBMS can no longer support the queries taking place on a massive data set. A restructure of the data set to Hadoop and Hbase can return the query times to a more reasonable level. Obviously, this is not a trivial amount of work but necessary if other alternatives fail.