2020-03-28 16:43:49

by George Spelvin

[permalink] [raw]
Subject: [RFC PATCH v1 09/50] <linux/random.h> prandom_u32_max() for power-of-2 ranges

Add special-case optimization for compile-time constant power-of-2
ranges (as a simple shift) so it's *always* at least as efficient as
"prandom_u32() % range".

This saves all callers from having to worry about the issue. You
may use "% range" or "& (range - 1)" if you know the range is a
power of 2, but you don't have to.

Rename the parameter from the un-mnemonic "ep_ro" to "range".

Also expand comments, and add to prandom_32() documentation
that "prandom_u32() % range" should ''not'' be used.

Signed-off-by: George Spelvin <[email protected]>
Cc: Stephen Hemminger <[email protected]>
Cc: Hannes Frederic Sowa <[email protected]>
---
include/linux/random.h | 56 ++++++++++++++++++++++++++++++++++--------
lib/random32.c | 9 ++++---
2 files changed, 52 insertions(+), 13 deletions(-)

diff --git a/include/linux/random.h b/include/linux/random.h
index f189c927fdea5..109772175c833 100644
--- a/include/linux/random.h
+++ b/include/linux/random.h
@@ -125,20 +125,56 @@ void prandom_seed_full_state(struct rnd_state __percpu *pcpu_state);
DO_ONCE(prandom_seed_full_state, (pcpu_state))

/**
- * prandom_u32_max - returns a pseudo-random number in interval [0, ep_ro)
- * @ep_ro: right open interval endpoint
+ * prandom_u32_max - returns a pseudo-random number in interval [0, @range)
+ * @range: Upper bound on the return value; the first value never returned.
*
- * Returns a pseudo-random number that is in interval [0, ep_ro). Note
- * that the result depends on PRNG being well distributed in [0, ~0U]
- * u32 space. Here we use maximally equidistributed combined Tausworthe
- * generator, that is, prandom_u32(). This is useful when requesting a
- * random index of an array containing ep_ro elements, for example.
+ * Returns a pseudo-random number 0 <= @x < @range.
+ * This is useful when requesting a random index into an array containing
+ * @range elements, for example. This is similar to, but more efficient
+ * than, "prandom_u32() % @range".
*
- * Returns: pseudo-random number in interval [0, ep_ro)
+ * It's also a tiny bit more uniform than the modulo algorithm. If @range
+ * does not divide 1<<32, then some results will occur ceil((1<<32)/@range)
+ * times, while others will occur floor((1<<32)/range) times. The modulo
+ * algorithm makes the lowest (1<<32) % @range possible results occur
+ * more often, and the higher values all occur slightly less often.
+ *
+ * A multiplicative algorithm spreads out the over- and underrepresented
+ * outputs evenly across the range. For example, if @range == 7, then
+ * 2, 4 and 6 occur slighty less often than 0, 1, 3 or 5.
+ *
+ * (This all assumes the PRNG is well distributed in [0, ~0U] u32 space,
+ * a property that is provided by prandom_u32().)
+ *
+ * It is possible to extend this to avoid all bias and return perfectly
+ * uniform pseudorandom numbers by discarding and regenerating sometimes,
+ * but if your needs are that stringent, you might want to use a stronger
+ * generator (like get_random_u32()).
+ *
+ * Ref:
+ * Fast Random Integer Generation in an Interval
+ * Daniel Lemire
+ * https://arxiv.org/abs/1805.10941
+ *
+ * Return: pseudo-random number in interval [0, @range)
*/
-static inline u32 prandom_u32_max(u32 ep_ro)
+static inline u32 prandom_u32_max(u32 range)
{
- return (u32)(((u64) prandom_u32() * ep_ro) >> 32);
+ /*
+ * If the range is a compile-time constant power of 2, then use
+ * a simple shift. This is mathematically equivalent to the
+ * multiplication, but GCC 8.3 doesn't optimize that perfectly.
+ *
+ * We could do an AND with a mask, but
+ * 1) The shift is the same speed on a decent CPU,
+ * 2) It's generally smaller code (smaller immediate), and
+ * 3) Many PRNGs have trouble with their low-order bits;
+ * using the msbits is generaly preferred.
+ */
+ if (__builtin_constant_p(range) && (range & (range - 1)) == 0)
+ return prandom_u32() / (u32)(0x100000000 / range);
+ else
+ return reciprocal_scale(prandom_u32(), range);
}

/*
diff --git a/lib/random32.c b/lib/random32.c
index 763b920a6206c..fc369f4e550c2 100644
--- a/lib/random32.c
+++ b/lib/random32.c
@@ -75,15 +75,18 @@ EXPORT_SYMBOL(prandom_u32_state);
* A 32 bit pseudo-random number is generated using a fast
* algorithm suitable for simulation. This algorithm is NOT
* considered safe for cryptographic use.
+ *
+ * To generate a random number in a specified interval, use
+ * prandom_u32_max(), which uses reciprocal_scale(). Please do
+ * NOT use a modulo operation (prandom_u32() % range), which is
+ * more expensive.
*/
u32 prandom_u32(void)
{
struct rnd_state *state = &get_cpu_var(net_rand_state);
- u32 res;
+ u32 res = prandom_u32_state(state);

- res = prandom_u32_state(state);
put_cpu_var(net_rand_state);
-
return res;
}
EXPORT_SYMBOL(prandom_u32);
--
2.26.0


2020-03-28 17:33:38

by Stephen Hemminger

[permalink] [raw]
Subject: Re: [RFC PATCH v1 09/50] <linux/random.h> prandom_u32_max() for power-of-2 ranges

On Sat, 16 Mar 2019 02:32:04 -0400
George Spelvin <[email protected]> wrote:

> +static inline u32 prandom_u32_max(u32 range)
> {
> - return (u32)(((u64) prandom_u32() * ep_ro) >> 32);
> + /*
> + * If the range is a compile-time constant power of 2, then use
> + * a simple shift. This is mathematically equivalent to the
> + * multiplication, but GCC 8.3 doesn't optimize that perfectly.
> + *
> + * We could do an AND with a mask, but
> + * 1) The shift is the same speed on a decent CPU,
> + * 2) It's generally smaller code (smaller immediate), and
> + * 3) Many PRNGs have trouble with their low-order bits;
> + * using the msbits is generaly preferred.
> + */
> + if (__builtin_constant_p(range) && (range & (range - 1)) == 0)
> + return prandom_u32() / (u32)(0x100000000 / range);
> + else
> + return reciprocal_scale(prandom_u32(), range);


The optimization is good, but I don't thin that the compiler
is able to propogate the constant property into the function.
Did you actually check the generated code.

2020-03-28 18:11:35

by George Spelvin

[permalink] [raw]
Subject: Re: [RFC PATCH v1 09/50] <linux/random.h> prandom_u32_max() for power-of-2 ranges

On Sat, Mar 28, 2020 at 10:32:29AM -0700, Stephen Hemminger wrote:
> On Sat, 16 Mar 2019 02:32:04 -0400
> George Spelvin <[email protected]> wrote:
>
>> +static inline u32 prandom_u32_max(u32 range)
>> {
>> - return (u32)(((u64) prandom_u32() * ep_ro) >> 32);
>> + /*
>> + * If the range is a compile-time constant power of 2, then use
>> + * a simple shift. This is mathematically equivalent to the
>> + * multiplication, but GCC 8.3 doesn't optimize that perfectly.
>> + *
>> + * We could do an AND with a mask, but
>> + * 1) The shift is the same speed on a decent CPU,
>> + * 2) It's generally smaller code (smaller immediate), and
>> + * 3) Many PRNGs have trouble with their low-order bits;
>> + * using the msbits is generaly preferred.
>> + */
>> + if (__builtin_constant_p(range) && (range & (range - 1)) == 0)
>> + return prandom_u32() / (u32)(0x100000000 / range);
>> + else
>> + return reciprocal_scale(prandom_u32(), range);

> The optimization is good, but I don't think that the compiler
> is able to propogate the constant property into the function.
> Did you actually check the generated code?

Yes, I checked repeatedly during development. I just rechecked the
exact code (it's been a while), and verified that

unsigned foo(void)
{
return prandom_u32_max(256);
}

compiles to
foo:
.LFB1:
.cfi_startproc
subq $8, %rsp
.cfi_def_cfa_offset 16
call prandom_u32@PLT
shrl $24, %eax
addq $8, %rsp
.cfi_def_cfa_offset 8
ret
.cfi_endproc
.LFE1:
.size foo, .-foo

But you prompted me to check a few other architectures, and
it's true for them too. E.g. m68k:

foo:
jsr prandom_u32
moveq #24,%d1
lsr.l %d1,%d0
rts

(68k is one architecture where the mask is faster than the shift,
so I could handle it separately, but it makes the code even uglier.
Basically, use masks for small ranges, and shifts for large ranges,
and an arch-dependent threshold that depends on the available
immediate constant range.)

ARM, PowerPC, and MIPS all have some hideously large function preamble
code, but the core is a right shift. E.g.

foo:
.LFB1:
.cfi_startproc
stwu 1,-16(1)
.cfi_def_cfa_offset 16
mflr 0
.cfi_register 65, 0
bcl 20,31,.L2
.L2:
stw 30,8(1)
.cfi_offset 30, -8
mflr 30
addis 30,30,.LCTOC1-.L2@ha
stw 0,20(1)
addi 30,30,.LCTOC1-.L2@l
.cfi_offset 65, 4
bl prandom_u32+32768@plt
lwz 0,20(1)
lwz 30,8(1)
addi 1,1,16
.cfi_restore 30
.cfi_def_cfa_offset 0
srwi 3,3,24
mtlr 0
.cfi_restore 65
blr