How to run hadoop – map reduce jobs without a cluster




This document is indented to aid basic java developers to kick start how to run hadoop and practical investigation on Hadoop map reduce jobs without any cluster set up on their end. To understand this document you need to possess basic theoretical knowledge on  Hadoop, hdfs and map reduce jobs. It is also advisable to have some prior knowledge on […]

Read more

Word Count – Hadoop Map Reduce Example




Word count is a typical example where Hadoop map reduce example developers start their hands on with. This sample map reduce is intended to count the no of occurrences of each word  in the provided input files. What are the minimum requirements? Input text files – any text file Cloudera test VM The mapper, reducer and driver classes to process the […]

Read more

Hadoop Distributed File System (HDFS) and MapReduce




The Hadoop Distributed File System (HDFS) HDFS is a fault tolerant and self-healing distributed file system designed to turn a cluster of industry standard servers into a massively scalable pool of storage. Developed specifically for large-scale data processing workloads where scalability, flexibility and throughput are critical, HDFS accepts data in any format regardless of schema, optimizes for high bandwidth streaming, […]

Read more

Apache YARN Hadoop NextGen MapReduce




MapReduce has undergone a complete overhaul in hadoop-0.23 and we now have, what we call, MapReduce 2.0 (MRv2) or YARN Hadoop. The fundamental idea of MRv2 is to split up the two major functionalities of the JobTracker, resource management and job scheduling/monitoring, into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application […]

Read more

Hadoop Distributed File System (HDFS) for Big Data




The Hadoop Distributed File System (HDFS) is designed to store very large data sets reliably, and to stream those data sets at high bandwidth to user applications. In a large cluster, thousands of servers both host directly attached storage and execute user application tasks. By distributing storage and computation across many servers, the resource can grow with demand while remaining […]

Read more

hadoop fileinputformat Partitioning in MapReduce




Partitioning in MapReduce As you may know, when a job (it is a MapReduce term for program) is run it goes to the the mapper, and the output of the mapper goes to the reducer. Ever wondered how many mapper and how many reducers is required for a job execution? What are parameters taken into consideration for deciding number of […]

Read more