How to shuffle cards🃏 like a pro?
Shuffle
- Different from the conventional ‘shuffle’ function, shuffle is an important stage in Map-Reduce. It works between Map and Reduce stage. Basically, after the master split the job, there are 3 main steps on Map worker: partition, spill & sort and merge [1]. While on Reduce worker, there are: fetch, sort(merge), reduce.
- In Map, it partition incoming data based on reducer (in memory); then when data size approaches threshold, it will spill them (into disk), and manipulate them; then merge them into a sorted data structure file.
- In Reduce, it fetches data from Map worker thru HTTP GET; sorts partly-sorted data from different Map into a larger sorted one; then reduces them.
- So the Shuffle stage happens between merge step of Map and fetch step of Reduce. To improve the performance, we can have a combine stage after merge, i.e. in word count application, an output of [“aaa”, {1, 1, 1}] can be combined into [“aaa”, {3}] [2], so to save storage space and decrease network transmission load.
- Also, on Reduce worker, we can set whether to store input data (from Map) into memory or disk, and the threshold of it. When storing data into disk, there can be 2 types: hash-based and sort-based, where the former may produce more tiny partitions and the latter may consume more time [3].
References
- Hadoop MapReduce Shuffle Procedure by Hokein
- MapReduce: detailed explanation of Shuffle procedure
- Comparion of Shuffle between Spark and Hadoop