Sort a 2GB File

20 January, 2014 - 12 min read

External Sorting:

Given a huge file with 2GB size with one string per line.Which algorithm to be used to sort the file and why?
Concept:

The catch here is "2GB" file. By saying "2GB", the question stresses that we cannot bring the entire data in memory at the same time. Therefore we will have to sort the data in chunks. This can be done by a class of sorting algorithms called external sorting. External sorting works as follows:
Suppose we have 200 MB of available memory,

Read 200 MB of data in main memory and sort it by some conventional method like quicksort in O(n log n).
Write the sorted data back to a temporary file on disk.
Repeat steps 1 and 2 until all the data is in sorted 200 MB chunks (2GB/200MB = 10 chunks). We will now merge these chunks into a single output file.
Read the first 15 MB of of each chunk (15*10 = 150 MB total) into input buffer in main memory and allocate remaining 50 MB for output buffer.
Perform a 10-way merge on the input buffer and store the result in the output buffer. If the output buffer is full, write it to the final sorted file on the disk and empty it. If any of the 10 input buffers get empty, fill it again with the next 20 MB of the associated 200 MB chunk until no more data from the chunk is available.

One example of external sorting is the external merge sort algorithm, which sorts chunks that each fit in RAM, then merges the sorted chunks together. For example, for sorting 900 megabytes of data using only 100 megabytes of RAM:

Read 100 MB of the data in main memory and sort by some conventional method, like quicksort.
Write the sorted data to disk.
Repeat steps 1 and 2 until all of the data is in sorted 100 MB chunks (there are 900MB / 100MB = 9 chunks), which now need to be merged into one single output file.
Read the first 10 MB (= 100MB / (9 chunks + 1)) of each sorted chunk into input buffers in main memory and allocate the remaining 10 MB for an output buffer. (In practice, it might provide better performance to make the output buffer larger and the input buffers slightly smaller.)
Perform a 9-way merge and store the result in the output buffer. Whenever the output buffer fills, write it to the final sorted file and empty it. Whenever any of the 9 input buffers empties, fill it with the next 10 MB of its associated 100 MB sorted chunk until no more data from the chunk is available. This is the key step that makes external merge sort work externally -- because the merge algorithm only makes one pass sequentially through each of the chunks, each chunk does not have to be loaded completely; rather, sequential parts of the chunk can be loaded as needed.

Another Example:

Imagine you have the numbers 1 - 9

9  7  2  6  3  4  8  5  1

And let's suppose that only 3 fit in memory at a time.

So you'd break them into chunks of 3 and sort each, storing each result in a separate file:

279
346
158

Now you'd open each of the three files as streams and read the first value from each:

2 3 1

Output the lowest value 1, and get the next value from that stream, now you have:

2 3 5

Output the next lowest value 2, and continue onwards until you've outputted the entire sorted list.

External Sorting--This term is used to refer to sorting methods that are employed when the data to be sorted is too large to fit in primary memory.

Characteristics of External Sorting

During the sort, some of the data must be stored externally. Typically the data will be stored on tape or disk.
The cost of accessing data is significantly greater than either bookkeeping or comparison costs.
There may be severe restrictions on access. For example, if tape is used, items must be accessed sequentially.

Criteria for Developing an External Sorting Algorithm

Minimize number of times an item is accessed.
Access items in sequential order

Sort Merge

Sort merge is the strategy of choice for external sorting because it:

Accesses records sequentially
Minimizes block accesses
Gives a stable sort

Sort Merge Strategy

Divide the file into runs such that the size of a run is small enough to fit into main memory
Sort each run in main memory using a fast in-memory sorting algorithm
Merge the resulting runs together into successively bigger runs, until the file is sorted.

Balanced Multiway Merging

Strategy

Select an equal number of I/O units (e.g., tapes, disks)
Sort Step

*   Select half of the I/O units to be output files
*   Using an in-memory sorting algorithm, create the initial runs and divide them evenly among the output files

Merge Step

*   __(a) __On each merge step, alternate the I/O units so that on one step, a unit is an input file and on the next step it is an output file
*   __(b)__Read one run from each of the input files, merge the runs, and store the resulting run on an output file. Alternate output files so that the runs are evenly distributed on the output files.
*   __(c)__Repeat step __b__ until all the runs on the input files have been merged.
*   __(d)__Repeat steps __a-c__ until the file has been sorted

Performance

If M records can be sorted in-memory and the file consists of N records, then the number of initial runs is N / M.
If there are 2P I/O units available, then the number of subsequent passes is ceiling(log_P(N / M)) since each pass reduces the number of runs by P. Here ceiling(x) means the smallest integer greater than or equal to x.

If you have 1 million integers, how would you sort them efficiently? (modify a specific sorting algorithm to solve this).

You could easily sort 1 million integers in memory on an ordinary computer nowadays, this problem imposes artificial memory restrictions. The most suitable algorithm for this task is Merge Sort. We’ll have to modify it to sort chunks of data from the original file/input into memory and merge them with an already sorted file.

Here’s an implementation in C++:

   1: void read_chunk(FILE* f, int* v, int &n) {

   2:    int k = 0;

   3:    while(!feof(f) && k < n) {

   4:      fscanf(f, "%i", v + k);

   5:      k++;

   6:    }

   7:    n = k;

   8: }

   9: void merge(int * v, int p, int q) {

  10:   int* c = new int[q - p + 1];

  11:   int mid = (p + q) / 2;

  12:   int i = p;

  13:   int j = mid + 1;

  14:   int k = 0;

  15:   while (i <= mid && j <= q) {

  16:     if (v[i] < v[j]) c[k++] = v[i++];

  17:     else c[k++] = v[j++];

  18:   }

  19:   for (int t = i; t <= mid; t++) c[k++] = v[t];

  20:   for (int t = j; t <= q; t++) c[k++] = v[t];

  21:   for (int t = 0; t < k; t++) v[p + t] = c[t];

  22:   delete[] c;

  23: }

  24: void merge_sort(int* v, int p, int q) {

  25:   if (p < q) {

  26:     int mid = (p + q) / 2;

  27:     merge_sort(v, p, mid);

  28:     merge_sort(v, mid + 1, q);

  29:     merge(v, p, q);

  30:   }

  31: }

  32: void file_merge(int* v, int n, char* filename) {

  33:   FILE* dest = fopen("temp_file", "w");

  34:   FILE* source = fopen(filename, "r");

  35:   int i = 0;

  36:   int a;

  37:   bool read = true;

  38:   while(!feof(source) && i < n) {

  39:     if (read)

  40:       fscanf(source, "%i", &a);

  41:     if (a < v[i]) {

  42:       fprintf(dest, "\n%i", a);

  43:       read = true;

  44:     }

  45:     else {

  46:       fprintf(dest, "\n%i", v[i]);

  47:       read = false;

  48:       i++;

  49:     }

  50:   }

  51:   while(!feof(source)) {

  52:     if (read)

  53:      fscanf(source, "%i", &a);

  54:     read = true;

  55:     fprintf(dest, "\n%i", a);

  56:   }

  57:   for (int k = i; k < n; k++) fprintf(dest, "\n%i", v[k]);

  58:   fclose(source);

  59:   fclose(dest);

  60:   remove(filename);

  61:   rename("temp_file", filename);

  62: }

  63: #define MAX_READ 10000

  64: void sort(char* filename) {

  65:   FILE* source = fopen(filename, "r");

  66:   int v[MAX_READ];

  67:   fclose(fopen("dest_temp_file", "w"));

  68:   while (!feof(source)) {

  69:     int n = MAX_READ;

  70:     read_chunk(source, v, n);

  71:     merge_sort(v, 0, n - 1);

  72:     file_merge(v, n, "dest_temp_file");

  73:   }

  74:   fclose(source);

  75:   remove(filename);

  76:   rename("dest_temp_file", filename);

  77: }