External Sorting

2019-04-17 18:11:34

External sorting refers to a sorting algorithm that can handle extremely large amounts of data. Generally speaking, the data processed by external sorting cannot be loaded into the internal memory at one time, and can only be placed on the external memory (usually a hard disk) with slow reading and writing. Outer sorting usually adopts a “sort-merge” strategy. In the sorting stage, the amount of data that can be placed in the memory is first read, sorted and output to a temporary file, and then the data to be sorted is organized into multiple ordered temporary files. Afterwards, these temporary files are combined into a large ordered file in the merging stage, that is, the sorted result.

An example of external sorting is external merge sort (External merge sort), which reads some data that can be placed in memory, and outputs it as a sequence after sorting in memory (that is, internal data Temporary files in an orderly manner), and then merge after processing all the data.

For example, if you want to sort 900 MB of data, but there is only 100 MB of available memory on the machine, the outer merge sort operation is as follows:

  • Read 100 MB of data into the memory, and use some conventional methods (such as quick sort, heap sort, merge sort, etc.) to complete the sort in the memory.
  • Write the sorted data to disk.
  • Repeat steps 1 and 2 until all the data is stored in different 100 MB blocks (temporary files). In this example, there are 900 MB of data, and the size of a single temporary file is 100 MB, so 9 temporary files will be generated.
  • Read in the first 10 MB (= 100 MB / (9 blocks + 1)) of each temporary file (sequence) data into the input buffer in the memory, and the last 10 MB as output Buffer. (In practice, the input buffer should be appropriately reduced, and the output buffer can be increased appropriately to get better results.)
  • Execute the nine-way merge algorithm and output the result to the output buffer. Once the output buffer is full, write the data in the buffer to the target file and clear the buffer. Once one of the 9 input buffers becomes empty, read the next 10M data from the file associated with this buffer, unless the file has been read.

This is the key step for “external merge sort” to complete the sort outside of main memory – because the “merge algorithm” only does one round of access to each chunk in sequence (Merge), each block does not need to be completely loaded into main memory.

Leave a Comment

Your email address will not be published.