algorithm - Uniform sampling of k integers from [0:n) -


my goal sample k integers 0, ... n-1 without duplication. order of sampled integers doesn't matter. @ every each call (which occurs often), n , k vary not (n 250,000 , k 2,000). i've come following amortized o(k) algorithm:

  1. prepare array items 0, 1, 2, ... , n-1. takes o(n) since n relatively stable, cost can made amortized constant.
  2. sample random number r [0:i] = n - 1. here cost in fact related n, n not big, dependency not critical.
  3. swap rth item , ith item in array a.
  4. decrease 1.
  5. repeat k times steps 2~4; have random permutation of length k @ tail of a. copy this.
  6. we should roll initial state (0, ... , n-1) keep cost of step 1 constant. can done push r stack of length k @ each pass of step 2. preparation of stack requires amortized constant cost.

i think uniform sampling of permutation/combination should exhaustively studied problem, either (1) there better solution, or @ least (2) solution (minor modification of) well-known solution. thus,

  • in case (1), want know better solution.
  • in case (2), want find reference.

please me. thanks.

  1. if k less n -- say, less half of n -- efficient solution keep numbers generated in hash table (actually, hash set, since there no value associated key). if random number happens in hash table, reject , generate 1 in place. actual values of k , n suggested (k ∼ 2000; n ∼ 250,000) expected number of rejections generate k unique samples less 10, hardly noticeable. size of hash table o(k), , can deleted @ end of sample generation.

  2. it possible simulate fyk shuffle algorithm using hash table instead of vector of n values, thereby avoiding having reject generated random numbers. if using vector a, start initializing a[i] i, every 0 ≤ < k. hash table h, start empty hash table, , use convention h[i] considered i if key i not in hash table. step 3 in algorithm -- "swap a[r] a[i]" -- becomes "add h[r] next element of sample , set h[r] h[i]". note unnecessary set h[i] because element never referred again: subsequent random numbers r generate range not include i.

    because hash table in case contains both keys , values, larger hash set used in alternative 1, above, , increased size (and consequent increase in memory cache misses) cause more overhead saved eliminating rejections. however, has advantage of working if k close n.

  3. finally, in proposed algorithm, quite easy restore a in o(k) time. value a[j] have been modified algorithm if:

    a. n − k ≤ j < n, or

    b. there i such n − k ≤ < n , a[i] = j.

    consequently, can restore vector a looking @ each a[i] n − k ≤ < n: first, if a[i] < n−k, set a[a[i]] a[i]; then, unconditionally set a[i] i.


Comments

Popular posts from this blog

What is happening when Matlab is starting a "parallel pool"? -

angular - DownloadURL return null in below code -

php - Cannot override Laravel Spark authentication with own implementation -