Hadoop – What’s in a name?

Juliet:
“What’s in a name? That which we call a rose By any other name would smell as sweet.”

Romeo and Juliet

What is Hadoop?

Name of a baby Elephant, isn’t it amusing?  Look at the Hadoop logo.

hadoop_logo

Do you see baby elephant? When Doug Cutting was looking for name for his project that wasn’t already a web domain and wasn’t trademarked so he tried various names and finally settled down for the yellow stuff elephant (His son used to play with this toy).

Let us look at the ancestors of hadoop.

Lucene -> Nutch -> Hadoop.

Apache Lucene is a high-performance, full-featured text search engine. Nutch is an extension of Lucene and it is a complete web-search engine.  The major issue with Nutch was to handle distributed processing, redundancy, automatic failover, load balancing and scalability.

In the past, Google published a paper on Google File System and MapReduce framework both were instrumental tools in scaling up search engine. An intelligent man  – Doug quickly realized the applicability of 2 tools in Nutch, that’s how Hadoop was born.

Again what’s in name? Lucene is the middle name of his wife and his toddler son used Nutch as the all-purpose word for meal. It was a bit of history behind name.

Now serious talk

Hadoop is an open source framework for massive data processing in distributed computing environment. It is designed to scale up from single server to thousands of machines.

Salient Characteristics of Hadoop:

  • Scalable – Hadoop scales up linearly by adding /deleting new hardware /nodes.
  • Reliable – It is designed to run on commodity hardware so it architected with assumption of frequent hardware malfunction
  • Cost Effective – Hadoop is designed to run on commodity hardware and heavily relies on parallel computing.
  • Flexible – It is schema less and it can gracefully handle unstructured and semi-structured data

Why we need Hadoop?

Most of the data available in the world are unstructured and enterprises have not explored the unexplored. The exponential growth of data led the computing world to find the unconventional way to process the data, that’s where Hadoop comes into the picture.Before throwing light on Hadoop’s components, let us compare SQL and Hadoop

Hadoop is meant for processing large set of unstructured data where SQL is designed to process structured data. SQL and Hadoop can be complimentary as SQL is a query language which can be implemented on the top Hadoop as an execution engine.

  • Vertical Scaling and Horizontal Scaling In general, Hadoop scales out horizontally where SQL scales up vertically. To process massive datasets, in general, SQL requires very high end server whereas Hadoop needs a set of commodity hardware. Scaling out makes Hadoop cost effective.
  • Key-Pair Values and Tables Hadoop is schemaless which primarily designed for unstructured data and works on Key-Pair values. SQL has got schema and data resided in table and it is primarily designed for  structured data and legacy applications
  • MapReduce and SQL – MapReduce is a functional paradigm where programmer mentions how to achieve the result where SQL is a declarative language where programmer mentions what to achieve.

To handle massive sets of data, Hadoop has got 2 components

  • MapReduce
  • HDFS

MapReduce-

Primarily, MapReduce relies on 2 components – Mappers and Reducers. Mapper does transformation and filtering of data and Reducer aggregates the data.

HDFS

Hadoop follows Master/slave architecture for both distributed storage and distributed computing. The distributed storage is called HDFS. (Hadoop File System).

More details in next post.

The Article is based on a book -“Hadoop In Action” by Chuck Lam.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s