Hadoop Hive Introduction




Hadoop Hive Overview Hadoop Hive is very similar to Apache Pig. What it does is let you create tables and load external files into tables using SQL. Then it creates MapReduce jobs in Java.  Java is a very wordy language so using Pig and Hive is simpler. Some have said that Hadoop Hive is a data warehouse tool (Bluntly put, […]

Read more

Hadoop TeraSort Benchmark Example




Hadoop TeraSort is one of Hadoop’s widely used benchmarks. Hadoop’s distribution contains both the input generator and sorting implementations: the TeraGen generates the input and TeraSort conducts the sorting. Here, we provide a short tutorial for using the Hadoop TeraSort benchmark.   Hadoop Terasort provides terabyte (TB) sort competition to run Hadoop benchmarking with sorting large data files e.g. 1TB or 1PB (1000x 1TB). […]

Read more

Hadoop Cluster Overview




Hadoop cluster In talking about Hadoop cluster, first we need to define two terms: cluster and node. A cluster is a collection of nodes. A node is a process running on a virtual or physical machine or in a container. We say process because a code would be running other programs beside Hadoop. When Hadoop is not running in cluster […]

Read more