I am very new to the Hadoop system and the learning phase.
I noticed that Spill occurs as long as the MapOutputBuffer reaches 80% in the Shuffle and Sort phases (I think this can also be configured).
Why is the overflow phase needed now?
Because MapOutputBuffer is a circular buffer, if we don’t empty it, it may cause data overwriting and memory leaks?
I wrote a good article that covers this topic: http://0x0fff .com/hadoop-mapreduce-comprehensive-description/
Generally speaking:
>When there is not enough memory to hold all the mapper output, overflow will occur The amount of memory available for this is set by mapreduce.task.io.sort.mb> This happens when 80% of the buffer is occupied, because the overflow is done in a separate thread and will not interfere with the mapper If the buffer reaches 100% utilization, the mapper thread must stop and wait for the overflow thread to release the space. To avoid this, choose a threshold of 80%> when the mapper is completed, the overflow occurs at least once because of the output of the mapper Should be sorted and saved to disk so that the reducer process can read it. And there is no need to invent a separate function for the last “save to disk”, because usually it performs the same task
p>
I am very new to the Hadoop system and the learning phase.
I noticed that in the Shuffle and Sort phases, as long as the MapOutputBuffer reaches 80%, Spill will occur (I think this is also OK Configuration).
Why do we need the overflow phase now?
Because MapOutputBuffer is a circular buffer, if we don’t empty it, it may cause data overwriting and memory leaks?
I wrote a very good article covering this topic: http://0x0fff.com/hadoop-mapreduce-comprehensive-description/ < p>
Generally speaking:
>When there is not enough memory to hold all the mapper output, an overflow occurs. The amount of memory available for this is determined by mapreduce.task. io.sort.mb settings> This happens when 80% of the buffer is occupied, because the overflow is done in a separate thread and will not interfere with the mapper. If the buffer reaches 100% utilization, the mapper The thread must stop and wait for the overflow thread to release the space. To avoid this, choose a threshold of 80%> when the mapper is finished, the overflow occurs at least once, because the output of the mapper should be sorted and saved to disk for the reducer process Read it. And there is no need to invent a separate function for the last “save to disk”, because usually it performs the same task