Hadoop TeraSort Benchmark Example




Hadoop TeraSort is one of Hadoop’s widely used benchmarks. Hadoop’s distribution contains both the input generator and sorting implementations: the TeraGen generates the input and TeraSort conducts the sorting. Here, we provide a short tutorial for using the Hadoop TeraSort benchmark.

 

Hadoop Terasort provides terabyte (TB) sort competition to run Hadoop benchmarking with sorting large data files e.g. 1TB or 1PB (1000x 1TB). The performance of HDFS and MapReduce workloads are evaluated with the speed of TeraSort computation, for example, sorting 1 terabyte was done in 3.48 minutes in 2008 by Yahoo! Inc. with 910 x 4 dual-core processors, but sorting 494.6 terabytes was done in the same amount of time in 2013 with 2100 nodes x hexa-core processors. The combination of hardware setup and software configuration does accelerate the performance of Hadoop and TeraSort program is used to measure the performance of a Hadoop system. There are three packages to conduct the benchmark: TeraGen which generates input data given by the size, TeraSort which sorts the input data files, and TeraValidate which validates the results. The elapsed time of TeraSort is the measure of the performance of Hadoop.

 

  • Generate a input data with 1GB (Change the size as you wish) for Hadoop Terasort
hadoop jar /opt/hadoop/share/hadoop/mapreduce/hadoop-*examples*.jar  teragen `expr 1024 * 1024 * 1024` /user/$USER/terasort-input

Options

Add this option between teragen and the size above. This is a number of map tasks.

  • -Dmapred.map.tasks=(vcpu numbers – 1)

Hadoop TeraSort

hadoop jar /opt/hadoop/share/hadoop/mapreduce/hadoop-*examples*.jar  terasort /user/$USER/terasort-input /user/$USER/terasort-output

Options

Add this option between terasort and the size above. This is a number of reduce tasks.

  • -Dmapred.reduce.task=(vcpu numbers divided by 2)



Validate the sorted output data of TeraSort

TeraValidate ensures that the output data of Hadoop TeraSort is globally sorted.

The syntax for TeraValidate:

$ hadoop jar hadoop-*examples*.jar teravalidate 
<output dir> <terasort-validate dir>

Options

One reduce task is fine because teravalidate is a simple program to combine results.

  • -Dmapred.reduce.task=1

Clean Up

Also, do not forget to clean up the terasort data between runs (and after testing is finished).

 

References
  • Page: http://bdaafall2015.readthedocs.io/en/latest/terasort.html
  • http://www.informit.com/articles/article.aspx?p=2453563&seqNum=2