2022-01-12 23:52:09

by Eric Biggers

[permalink] [raw]
Subject: Re: [PATCH crypto 1/2] lib/crypto: blake2s-generic: reduce code size on small systems

On Tue, Jan 11, 2022 at 02:49:33PM +0100, Jason A. Donenfeld wrote:
> Re-wind the loops entirely on kernels optimized for code size. This is
> really not good at all performance-wise. But on m68k, it shaves off 4k
> of code size, which is apparently important.
>
> Cc: Geert Uytterhoeven <[email protected]>
> Cc: Herbert Xu <[email protected]>
> Cc: Ard Biesheuvel <[email protected]>
> Signed-off-by: Jason A. Donenfeld <[email protected]>
> ---
> lib/crypto/blake2s-generic.c | 30 ++++++++++++++++++------------
> 1 file changed, 18 insertions(+), 12 deletions(-)
>
> diff --git a/lib/crypto/blake2s-generic.c b/lib/crypto/blake2s-generic.c
> index 75ccb3e633e6..990f000e22ee 100644
> --- a/lib/crypto/blake2s-generic.c
> +++ b/lib/crypto/blake2s-generic.c
> @@ -46,7 +46,7 @@ void blake2s_compress_generic(struct blake2s_state *state, const u8 *block,
> {
> u32 m[16];
> u32 v[16];
> - int i;
> + int i, j;
>
> WARN_ON(IS_ENABLED(DEBUG) &&
> (nblocks > 1 && inc != BLAKE2S_BLOCK_SIZE));
> @@ -86,17 +86,23 @@ void blake2s_compress_generic(struct blake2s_state *state, const u8 *block,
> G(r, 6, v[2], v[ 7], v[ 8], v[13]); \
> G(r, 7, v[3], v[ 4], v[ 9], v[14]); \
> } while (0)
> - ROUND(0);
> - ROUND(1);
> - ROUND(2);
> - ROUND(3);
> - ROUND(4);
> - ROUND(5);
> - ROUND(6);
> - ROUND(7);
> - ROUND(8);
> - ROUND(9);
> -
> + if (IS_ENABLED(CONFIG_CC_OPTIMIZE_FOR_SIZE)) {
> + for (i = 0; i < 10; ++i) {
> + for (j = 0; j < 8; ++j)
> + G(i, j, v[j % 4], v[((j + (j / 4)) % 4) + 4], v[((j + 2 * (j / 4)) % 4) + 8], v[((j + 3 * (j / 4)) % 4) + 12]);
> + }

How about unrolling the inner loop but not the outer one? Wouldn't that give
most of the benefit, without hurting performance as much?

If you stay with this approach and don't unroll either loop, can you use 'r' and
'i' instead of 'i' and 'j', to match the naming in G()?

Also, please wrap lines at 80 columns.

- Eric


2022-01-12 23:53:37

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [PATCH crypto 1/2] lib/crypto: blake2s-generic: reduce code size on small systems

On Wed, Jan 12, 2022 at 7:32 PM Eric Biggers <[email protected]> wrote:
> How about unrolling the inner loop but not the outer one? Wouldn't that give
> most of the benefit, without hurting performance as much?
>
> If you stay with this approach and don't unroll either loop, can you use 'r' and
> 'i' instead of 'i' and 'j', to match the naming in G()?

All this might work, sure. But as mentioned earlier, I've abandoned
this entirely, as I don't think this patch is necessary. See the v3
patchset instead:

https://lore.kernel.org/linux-crypto/[email protected]/

2022-01-13 01:12:23

by David Laight

[permalink] [raw]
Subject: RE: [PATCH crypto 1/2] lib/crypto: blake2s-generic: reduce code size on small systems

From: Jason A. Donenfeld
> Sent: 12 January 2022 18:51
>
> On Wed, Jan 12, 2022 at 7:32 PM Eric Biggers <[email protected]> wrote:
> > How about unrolling the inner loop but not the outer one? Wouldn't that give
> > most of the benefit, without hurting performance as much?
> >
> > If you stay with this approach and don't unroll either loop, can you use 'r' and
> > 'i' instead of 'i' and 'j', to match the naming in G()?
>
> All this might work, sure. But as mentioned earlier, I've abandoned
> this entirely, as I don't think this patch is necessary. See the v3
> patchset instead:
>
> https://lore.kernel.org/linux-crypto/[email protected]/

I think you mentioned in another thread that the buffers (eg for IPv6
addresses) are actually often quite short.

For short buffers the 'rolled-up' loop may be of similar performance
to the unrolled one because of the time taken to read all the instructions
into the I-cache and decode them.
If the loop ends up small enough it will fit into the 'decoded loop
buffer' of modern Intel x86 cpu and won't even need decoding on
each iteration.

I really suspect that the heavily unrolled loop is only really fast
for big buffers and/or when it is already in the I-cache.
In real life I wonder how often that actually happens?
Especially for the uses the kernel is making of the code.

You need to benchmark single executions of the function
(doable on x86 with the performance monitor cycle counter)
to get typical/best clocks/byte figures rather than a
big average for repeated operation on a long buffer.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2022-01-13 01:13:56

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [PATCH crypto 1/2] lib/crypto: blake2s-generic: reduce code size on small systems

Hi David,

On 1/12/22, David Laight <[email protected]> wrote:
> I think you mentioned in another thread that the buffers (eg for IPv6
> addresses) are actually often quite short.
>
> For short buffers the 'rolled-up' loop may be of similar performance
> to the unrolled one because of the time taken to read all the instructions
> into the I-cache and decode them.
> If the loop ends up small enough it will fit into the 'decoded loop
> buffer' of modern Intel x86 cpu and won't even need decoding on
> each iteration.
>
> I really suspect that the heavily unrolled loop is only really fast
> for big buffers and/or when it is already in the I-cache.
> In real life I wonder how often that actually happens?
> Especially for the uses the kernel is making of the code.
>
> You need to benchmark single executions of the function
> (doable on x86 with the performance monitor cycle counter)
> to get typical/best clocks/byte figures rather than a
> big average for repeated operation on a long buffer.
>
> David

This patch has been dropped entirely from future revisions. The latest
as of writing is at:

https://lore.kernel.org/linux-crypto/[email protected]/

If you'd like to do something with blake2s, by all means submit a
patch and include various rationale and metrics and benchmarks. I do
not intend to do that myself and do not think my particular patch here
should be merged. But if you'd like to do something, feel free to CC
me for a review. However, as mentioned, I don't think much needs to be
done here.

Again, v3 is here:
https://lore.kernel.org/linux-crypto/[email protected]/

Thanks,
Jason