Simple Hadoop Overview - clear the haze

As per the Hadoop website:

Hadoop is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

I’d like to provide as clear and concise of a description of Hadoop that I can, an introduction to an extremely powerful tool. And note that a node is a computer or server existing in the cluster.

Need

Hadoop was designed to support several core problems facing the computational and data needs of the 21st century.

The system must properly distribute data across a system evenly and safely.
The system must support partial failure of a node in the system. This means, if a node goes down, the operations within the cluster continue without change in the final outcome.
If there is a failure, the system should be able to recover the data through the existence of backup (later referred to as replicated blocks).
When a node is brought back online, it should be able to rejoin the system immediately
The system shall maintain linear scalability, meaning addition of resources will increase performance linearly, just as removal of resources would decrease performance linearly.

When talking about Hadoop, I’m referring also to a number of components and additional extensions which exist to operate the architecture.

Components

There are 2 core areas, each with a few components, which must be familiar before understanding the big picture.

HDFS

HDFS, the Hadoop Distributed File System, is responsible for storing data on the cluster. Data is split into blocks and distributed across multiple nodes in the cluster. Each block is typically 64MB or 128MB in size (64 MB by default). Files are split into those blocks, and any remainder of a block not filled is the size of the file or remainder. So, a 100 MB file in a 64 MB block is stored in a 64 MB block and a 34 MB block.

Each block is replicated multiple times. By default, the replication value is three. Replicas are stored on different nodes. This ensures both reliability and availability. It is adjustable, but keep in mind the overall cost in the data of too many replications. Due to this replication, RAIDing drives is not necessary because of the software-implemented redundancy.

Files in HDFS are written once, not allowed to write to files once they are in HDFS. Also, HDFS is intended for reading large files, streaming through the whole file, not random access reads.

NameNode

The NameNode is the master of the data in the cluster. It maintains the reference to all of the replicated blocks of data in the DataNodes, and manages redistribution and restoration when nodes are removed or added. There must be one in the cloud to operate HDFS, and it must be running to access the cluster.

It stores all metadata about where blocks exist in RAM, and also maintains a file for restoring the metadata.

In configuring the NameNode, you can specify rack distribution of nodes to balance replication distribution. This way, if the switch on a single rack goes down, data is still accessible through other replicated DataNodes.

DataNode

DataNodes are machines which contain the blocks of data. The data is stored in what resembles a filesystem.

Note: The total number of DataNodes is the size of the cluster.

MapReduce

MapReduce is the system of executing processes on the data in the cluster.

There are two key phases, with minor phases in between–Map and Reduce. I will provide a more in-depth look at MapReduce in another article.

In summary, the Map phase is when a function is executed across all individual records of a data set. These records are most likely files, but can be other records of data. Just remember that the Mapper function executes individually against the different records. In a multi-node cluster, multiple Mappers will execute for a single job, because multiple DataNodes contain the data and can execute. The output of each Mapper is a key and a value (represented as <k, v>) for each record processed. The output is organized and passed to Reducers as a unique key and a set of values found for that key. The Reducer then finalizes the process and outputs the values to HDFS.

JobTracker

The JobTracker receives the request from the client to execute a MapReduce job (and the payload required to execute it, for example a jar with the classes of the job). One exists within the cluster, and does not have to exist on the NameNode. More to follow after the next component.

TaskTracker

The TaskTracker resides on all DataNodes. The appropriate place because they execute the Mappers and Reducers for a MapReduce job, and require access to the data.

During operation, the TaskTracker has a particular number of map slots available for a job to use. When one is empty, the TaskTracker heartbeats to the JobTracker, asking it for work. The JobTracker does not order TaskTrackers to work! This is important to the concept of Hadoop for accepting partial system failure. If a TaskTracker doesn’t ask for work because it’s down, the system is spared the JobTracker asking a dead node to do work and hanging/crashing.

The JobTracker checks with the NameNode to check the locality of data for the requesting TaskTracker to see if any of the Map tasks are better than others, but it will issue another task if need be. Locality is not required for execution! This means, the TaskTracker executing a map task requiring data COUNTY_DATA may not have COUNTY_DATA in its blocks. It will just try to retrieve it through the network.

The other important thing to know about Hadoop’s job execution is that it employs speculative execution. This means, if the TaskTracker executing a task takes too long to finish, the JobTracker will issue the task to another node, and accept the results of the first to finish and ignore anything else it receives.

Summary

The plug-in ability of the cluster running Hadoop is phenomenal. It fully implements the data and accessibility requirements.

The MapReduce architecture allows for amazingly fast operations on huge, huge data sets.

There are many tools and interfaces we can use with Hadoop, and so I will try to provide updates as I go.

Finally, if you are new to Hadoop and MapReduce development, have no fear. After a few practice programs, you’ll be off and running in no time!