algorithm - Uniform sampling of k integers from [0:n) -
my goal sample k integers 0, ... n-1 without duplication. order of sampled integers doesn't matter. @ every each call (which occurs often), n , k vary not (n 250,000 , k 2,000). i've come following amortized o(k) algorithm:
- prepare array items 0, 1, 2, ... , n-1. takes o(n) since n relatively stable, cost can made amortized constant.
- sample random number r [0:i] = n - 1. here cost in fact related n, n not big, dependency not critical.
- swap rth item , ith item in array a.
- decrease 1.
- repeat k times steps 2~4; have random permutation of length k @ tail of a. copy this.
- we should roll initial state (0, ... , n-1) keep cost of step 1 constant. can done push r stack of length k @ each pass of step 2. preparation of stack requires amortized constant cost.
i think uniform sampling of permutation/combination should exhaustively studied problem, either (1) there better solution, or @ least (2) solution (minor modification of) well-known solution. thus,
- in case (1), want know better solution.
- in case (2), want find reference.
please me. thanks.
if
klessn-- say, less half ofn-- efficient solution keep numbers generated in hash table (actually, hash set, since there no value associated key). if random number happens in hash table, reject , generate 1 in place. actual values ofk,nsuggested (k ∼ 2000; n ∼ 250,000) expected number of rejections generatekunique samples less 10, hardly noticeable. size of hash table o(k), , can deleted @ end of sample generation.it possible simulate fyk shuffle algorithm using hash table instead of vector of
nvalues, thereby avoiding having reject generated random numbers. if using vectora, start initializinga[i]i, every0 ≤ < k. hash tableh, start empty hash table, , use conventionh[i]considerediif keyinot in hash table. step 3 in algorithm -- "swapa[r]a[i]" -- becomes "addh[r]next element of sample , seth[r]h[i]". note unnecessary seth[i]because element never referred again: subsequent random numbersrgenerate range not includei.because hash table in case contains both keys , values, larger hash set used in alternative 1, above, , increased size (and consequent increase in memory cache misses) cause more overhead saved eliminating rejections. however, has advantage of working if
kclosen.finally, in proposed algorithm, quite easy restore
ain o(k) time. valuea[j]have been modified algorithm if:a.
n − k ≤ j < n, orb. there
isuchn − k ≤ < n,a[i] = j.consequently, can restore vector
alooking @ eacha[i]n − k ≤ < n: first, ifa[i] < n−k, seta[a[i]]a[i]; then, unconditionally seta[i]i.
Comments
Post a Comment