What is this algorithm called and what is the time complexity? -
let's call amount of ram available r.
we have unsorted file of 10 gigs 1 column of keys (duplicates allowed).
you split file k files, each of have size r. sort each file , write file disk.
you read (10 / r) gigs each file input buffers. perform k-way merge read first key first file , compare every other key in input buffers find minimum. add output buffer should hold (10 / r) gigs of data.
once output buffer full, write disk final sorted file.
repeat process until k files have been read. if input buffer empty, fill next (10 / r) gigs of corresponding file until file has been entirely read. can buffer refilling in parallel.
what official name algorithm? k - way merge sort?
the first part, split k files o((n / k) log (n / k)) second part, merge o(nk)?
if wrong, can have explanation? if external merge sort, how optimize further?
this textbook external merge sort time complexity o(n log n)
here's wikipedia's entry on (linked above):
one example of external sorting external merge sort algorithm, sorts chunks each fit in ram, merges sorted chunks together. example, sorting 900 megabytes of data using 100 megabytes of ram:
1) read 100 mb of data in main memory , sort conventional method, quicksort.
2) write sorted data disk.
3) repeat steps 1 , 2 until of data in sorted 100 mb chunks (there 900mb / 100mb = 9 chunks), need merged 1 single output file.
4) read first 10 mb (= 100mb / (9 chunks + 1)) of each sorted chunk input buffers in main memory , allocate remaining 10 mb output buffer. (in practice, might provide better performance make output buffer larger , input buffers smaller.)
5) perform 9-way merge , store result in output buffer. whenever output buffer fills, write final sorted file , empty it. whenever of 9 input buffers empties, fill next 10 mb of associated 100 mb sorted chunk until no more data chunk available. key step makes external merge sort work externally -- because merge algorithm makes 1 pass sequentially through each of chunks, each chunk not have loaded completely; rather, sequential parts of chunk can loaded needed.
historically, instead of sort, replacement-selection algorithm used perform initial distribution, produce on average half many output chunks of double length.
Comments
Post a Comment