“What’s in a name? That which we call a rose By any other name would smell as sweet.”
Romeo and Juliet
What is Hadoop?
Name of a baby Elephant, isn’t it amusing? Look at the Hadoop logo.
Do you see baby elephant? When Doug Cutting was looking for name for his project that wasn’t already a web domain and wasn’t trademarked so he tried various names and finally settled down for the yellow stuff elephant (His son used to play with this toy).
Let us look at the ancestors of hadoop.
Lucene -> Nutch -> Hadoop.
Apache Lucene is a high-performance, full-featured text search engine. Nutch is an extension of Lucene and it is a complete web-search engine. The major issue with Nutch was to handle distributed processing, redundancy, automatic failover, load balancing and scalability.
In the past, Google published a paper on Google File System and MapReduce framework both were instrumental tools in scaling up search engine. An intelligent man – Doug quickly realized the applicability of 2 tools in Nutch, that’s how Hadoop was born.
Again what’s in name? Lucene is the middle name of his wife and his toddler son used Nutch as the all-purpose word for meal. It was a bit of history behind name.
Now serious talk –
Hadoop is an open source framework for massive data processing in distributed computing environment. It is designed to scale up from single server to thousands of machines.
Salient Characteristics of Hadoop:
- Scalable – Hadoop scales up linearly by adding /deleting new hardware /nodes.
- Reliable – It is designed to run on commodity hardware so it architected with assumption of frequent hardware malfunction
- Cost Effective – Hadoop is designed to run on commodity hardware and heavily relies on parallel computing.
- Flexible – It is schema less and it can gracefully handle unstructured and semi-structured data
Why we need Hadoop?
Most of the data available in the world are unstructured and enterprises have not explored the unexplored. The exponential growth of data led the computing world to find the unconventional way to process the data, that’s where Hadoop comes into the picture.Before throwing light on Hadoop’s components, let us compare SQL and Hadoop
Hadoop is meant for processing large set of unstructured data where SQL is designed to process structured data. SQL and Hadoop can be complimentary as SQL is a query language which can be implemented on the top Hadoop as an execution engine.
- Vertical Scaling and Horizontal Scaling In general, Hadoop scales out horizontally where SQL scales up vertically. To process massive datasets, in general, SQL requires very high end server whereas Hadoop needs a set of commodity hardware. Scaling out makes Hadoop cost effective.
- Key-Pair Values and Tables Hadoop is schemaless which primarily designed for unstructured data and works on Key-Pair values. SQL has got schema and data resided in table and it is primarily designed for structured data and legacy applications
- MapReduce and SQL – MapReduce is a functional paradigm where programmer mentions how to achieve the result where SQL is a declarative language where programmer mentions what to achieve.
To handle massive sets of data, Hadoop has got 2 components
Primarily, MapReduce relies on 2 components – Mappers and Reducers. Mapper does transformation and filtering of data and Reducer aggregates the data.
Hadoop follows Master/slave architecture for both distributed storage and distributed computing. The distributed storage is called HDFS. (Hadoop File System).
More details in next post.
The Article is based on a book -“Hadoop In Action” by Chuck Lam.