I generally use a slightly simpler algorithm in various different projects:
//[0, bound)
static unsigned long random_bounded(unsigned long bound)
{
unsigned long ret;
const unsigned long max_mod_bound = (1 + ~bound) % bound;
if (bound < 2)
return 0;
do
ret = random_integer();
while (ret < max_mod_bound);
return ret % bound;
}
//[min, max_plus_one)
static unsigned long random_range(unsigned long min, unsigned long max_plus_one)
{
return random_bounded(max_plus_one - min) + min;
}
Is the motivation behind using Lemire that you avoid the division (via
the modulo) in favor of a multiplication?
On Sun, 24 Mar 2019 at 21:47:50 +0100, Jason A. Donenfeld wrote:
> I generally use a slightly simpler algorithm in various different projects:
>
> //[0, bound)
> static unsigned long random_bounded(unsigned long bound)
> {
> unsigned long ret;
> const unsigned long max_mod_bound = (1 + ~bound) % bound;
>
> if (bound < 2)
> return 0;
> do
> ret = random_integer();
> while (ret < max_mod_bound);
> return ret % bound;
> }
>
> Is the motivation behind using Lemire that you avoid the division (via
> the modulo) in favor of a multiplication?
Yes. If we define eps = max_mod_bound * ldexp(1.0, -BITS_PER_LONG) as
the probability of one retry, and retries = eps / (1 - eps) as the
expected number of retries, then both algorithms take 1+retries
random_integer()s.
The above agorithm takes 2 divisions, always. Divides are slow, and
usually not pipelined, so two in short succession gets a latency penalty.
Lemire's mutiplicative algorithm takes 1 multiplication on the fast
path (probability 1 - 2*eps on average), 1 additional division on the slow
path (probability 2*eps), and 1 multiplication per retry.
In the common case when bound is much less than ULONG_MAX, eps is
tiny and the fast path is taken almost all the time, and it's
a huge win.
Even in the absolute worst case of bound = ULONG_MAX/2 + 2 when
eps ~ 0.5 (2 multiplies, 0.5 divide; there's no 2*eps penalty in
this case), it's faster as long as 2 mutiplies cost less than 1.5
divides.
I you want simpler code, we could omit the fast path and stil get
a speedup. But a predictable branch for a divide seemed like
a worthwhile trade.
(FYI, this all came about as a side project of a kernel-janitor project
to replace "prandom_u32() % range" by "prandom_u32() * range >> 32".
I'm also annoyed that get_random_u32() and get_random_u64() have
separate buffers, even if EFFICIENT_UNALIGNED_ACCESS, but that's
a separate complaint.)
P.S. The cited paper calls your algorithm the "OpenBSD algorithm"
and has a bunch of benchmarks comparing it to others in Fisher-Yates
shuffles of sizes 1e3..1e9.
Including all overhead (base PRNG, shuffle), it's 3x slower for
32-bit operations and 8x slower for 64-bit up to arrays of size
1e6, after which cache misses slow all algorithms, reducing the
ratio.
If you want a faster division-based agorithm, the "Java algorithm"
does 1+retries divides:
unsigned long java(unsigned long s)
{
unsigned long x, r;
do {
x = random_integer();
r = x % s;
} while (x - r > -s);
return r;
}