2022-01-21 18:57:19

by Miles Chen

[permalink] [raw]
Subject: [PATCH] lib/crypto: blake2s: fix a CFI failure

From: Miles Chen <[email protected]>

With CONFIG_CFI_CLANG=y, we observe a CFI failure of
blake2s_compress_generic.

Reverting commit 6048fdcc5f26 ("lib/crypto: blake2s: include as built-in")
is a solution for this problem. So I looked into the patch
and I think it is caused by the weak symbols use by blake2s_compress().

To fix it, remove the weak symbol and use CRYPTO_ARCH_HAVE_LIB_BLAKE2S
to select blake2s_compress_arch/blake2s_compress_generic.

log:
[ 0.000000][ T0] Kernel panic - not syncing: CFI failure (target: blake2s_compress_generic+0x0/0x1444)
[ 0.000000][ T0] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.16.0-mainline-06981-g076c855b846e #1
[ 0.000000][ T0] Hardware name: MT6873 (DT)
[ 0.000000][ T0] Call trace:
[ 0.000000][ T0] dump_backtrace+0xfc/0x1dc
[ 0.000000][ T0] dump_stack_lvl+0xa8/0x11c
[ 0.000000][ T0] panic+0x194/0x464
[ 0.000000][ T0] __cfi_check_fail+0x54/0x58
[ 0.000000][ T0] __cfi_slowpath_diag+0x354/0x4b0
[ 0.000000][ T0] blake2s_update+0x14c/0x178
[ 0.000000][ T0] _extract_entropy+0xf4/0x29c
[ 0.000000][ T0] crng_initialize_primary+0x24/0x94
[ 0.000000][ T0] rand_initialize+0x2c/0x6c
[ 0.000000][ T0] start_kernel+0x2f8/0x65c
[ 0.000000][ T0] __primary_switched+0xc4/0x7be4
[ 0.000000][ T0] Rebooting in 5 seconds..

Fixes: 6048fdcc5f26 ("lib/crypto: blake2s: include as built-in")
Signed-off-by: Miles Chen <[email protected]>
---
lib/crypto/blake2s-generic.c | 3 +--
lib/crypto/blake2s.c | 6 ++++++
2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/lib/crypto/blake2s-generic.c b/lib/crypto/blake2s-generic.c
index 75ccb3e633e6..22fa3ea1689e 100644
--- a/lib/crypto/blake2s-generic.c
+++ b/lib/crypto/blake2s-generic.c
@@ -38,8 +38,7 @@ static inline void blake2s_increment_counter(struct blake2s_state *state,
}

void blake2s_compress(struct blake2s_state *state, const u8 *block,
- size_t nblocks, const u32 inc)
- __weak __alias(blake2s_compress_generic);
+ size_t nblocks, const u32 inc);

void blake2s_compress_generic(struct blake2s_state *state, const u8 *block,
size_t nblocks, const u32 inc)
diff --git a/lib/crypto/blake2s.c b/lib/crypto/blake2s.c
index 93f2ae051370..4055aa593ec4 100644
--- a/lib/crypto/blake2s.c
+++ b/lib/crypto/blake2s.c
@@ -16,6 +16,12 @@
#include <linux/init.h>
#include <linux/bug.h>

+#if IS_ENABLED(CONFIG_CRYPTO_ARCH_HAVE_LIB_BLAKE2S)
+# define blake2s_compress blake2s_compress_arch
+#else
+# define blake2s_compress blake2s_compress_generic
+#endif
+
void blake2s_update(struct blake2s_state *state, const u8 *in, size_t inlen)
{
__blake2s_update(state, in, inlen, blake2s_compress);
--
2.18.0


2022-01-21 19:02:45

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCH] lib/crypto: blake2s: fix a CFI failure

(+ Sami, Eric)

On Wed, 19 Jan 2022 at 10:00, Jason A. Donenfeld <[email protected]> wrote:
>
> Hi Miles,
>
> Thanks for the patch. Could you let me know which architecture and
> compiler this was broken on? If I had to guess, I'd wager arm32, and
> you hit this by enabling optimized blake2s?
>
> If so, I'm not sure the problem is with weak symbols. Why should CFI
> break weak symbols? Rather, perhaps the issue is that the function is
> defined in blake2s-core.S? Are there some CFI macros we need for that
> definition?
>

We should try to understand why CFI thinks the prototypes of the two
symbols are different. There are still a number of issues with CFI, so
papering over them by reverting stuff that we want for good reasons is
not the way to go imo.

In the short term, you can work around it by avoiding the indirect
call to blake2s_compress, e.g.,

diff --git a/lib/crypto/blake2s.c b/lib/crypto/blake2s.c
index 93f2ae051370..fef2ff678431 100644
--- a/lib/crypto/blake2s.c
+++ b/lib/crypto/blake2s.c
@@ -16,9 +16,15 @@
#include <linux/init.h>
#include <linux/bug.h>

+static void __blake2s_compress(struct blake2s_state *state, const u8 *block,
+ size_t nblocks, const u32 inc)
+{
+ return blake2s_compress(state, block, nblocks, inc);
+}
+
void blake2s_update(struct blake2s_state *state, const u8 *in, size_t inlen)
{
- __blake2s_update(state, in, inlen, blake2s_compress);
+ __blake2s_update(state, in, inlen, __blake2s_compress);
}
EXPORT_SYMBOL(blake2s_update);

2022-01-21 19:03:12

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCH] lib/crypto: blake2s: fix a CFI failure

On Wed, 19 Jan 2022 at 10:09, Ard Biesheuvel <[email protected]> wrote:
>
> (+ Sami, Eric)
>
> On Wed, 19 Jan 2022 at 10:00, Jason A. Donenfeld <[email protected]> wrote:
> >
> > Hi Miles,
> >
> > Thanks for the patch. Could you let me know which architecture and
> > compiler this was broken on? If I had to guess, I'd wager arm32, and
> > you hit this by enabling optimized blake2s?
> >
> > If so, I'm not sure the problem is with weak symbols. Why should CFI
> > break weak symbols? Rather, perhaps the issue is that the function is
> > defined in blake2s-core.S? Are there some CFI macros we need for that
> > definition?
> >
>
> We should try to understand why CFI thinks the prototypes of the two
> symbols are different. There are still a number of issues with CFI, so
> papering over them by reverting stuff that we want for good reasons is
> not the way to go imo.
>
> In the short term, you can work around it by avoiding the indirect
> call to blake2s_compress, e.g.,
>
> diff --git a/lib/crypto/blake2s.c b/lib/crypto/blake2s.c
> index 93f2ae051370..fef2ff678431 100644
> --- a/lib/crypto/blake2s.c
> +++ b/lib/crypto/blake2s.c
> @@ -16,9 +16,15 @@
> #include <linux/init.h>
> #include <linux/bug.h>
>
> +static void __blake2s_compress(struct blake2s_state *state, const u8 *block,
> + size_t nblocks, const u32 inc)
> +{
> + return blake2s_compress(state, block, nblocks, inc);
> +}
> +
> void blake2s_update(struct blake2s_state *state, const u8 *in, size_t inlen)
> {
> - __blake2s_update(state, in, inlen, blake2s_compress);
> + __blake2s_update(state, in, inlen, __blake2s_compress);
> }
> EXPORT_SYMBOL(blake2s_update);

Ehm, maybe not. As Jason points out, the typedef does not have quite
the right type, so that is most likely the culprit, and this
workaround would trigger CFI in exactly the same way.

Interestingly, the compiler does not seem to mind, right? Or are you
seeing any build time warnings on the reference to blake2s_compress in
the call to __blake2s_update() ?

2022-01-21 19:06:40

by Miles Chen

[permalink] [raw]
Subject: Re: [PATCH] lib/crypto: blake2s: fix a CFI failure

Hi,

>Hi Miles,
>
>I'm actually not able to reproduce your oops. I'm using vanilla clang
>13, cross compiling for arm64, with thin LTO enabled and CFI enabled.
>Kernel seems to run fine.
>
>
>Are there other settings that are needed to trigger this? Do you see
>it in upstream clang or just the Android fork of clang?
>
I will try another clang (the previous version I use).
I am using Android fork of clang and there is a clang upgrade in this merge.

Miles

2022-01-21 19:06:50

by Miles Chen

[permalink] [raw]
Subject: Re: [PATCH] lib/crypto: blake2s: fix a CFI failure

Hi,

> We should try to understand why CFI thinks the prototypes of the two
> symbols are different. There are still a number of issues with CFI, so
> papering over them by reverting stuff that we want for good reasons is
> not the way to go imo.
>
> In the short term, you can work around it by avoiding the indirect
> call to blake2s_compress, e.g.,

Thanks for the patch. I tried it and the issue remains.
As Jason said, he cannot reproduce this issue. I will try another version
of clang next.

Miles

2022-01-21 19:07:41

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCH] lib/crypto: blake2s: fix a CFI failure

On Wed, 19 Jan 2022 at 11:06, Miles Chen <[email protected]> wrote:
>
> Hi,
>
> >Hi Miles,
> >
> >I'm actually not able to reproduce your oops. I'm using vanilla clang
> >13, cross compiling for arm64, with thin LTO enabled and CFI enabled.
> >Kernel seems to run fine.
> >
> >
> >Are there other settings that are needed to trigger this? Do you see
> >it in upstream clang or just the Android fork of clang?
> >
> I will try another clang (the previous version I use).
> I am using Android fork of clang and there is a clang upgrade in this merge.
>

One thing that could be worth a try is to make __blake2s_update() and
__blake2s_final() __always_inline rather than just inline, which by
itself does not appear to be sufficient for the code to get inlined.
(If it were, the indirect call should have disappeared as well)

Given that indirect calls suck on x86, we should probably apply that
change in any case, regardless of CFI.

2022-01-21 19:12:48

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [PATCH] lib/crypto: blake2s: fix a CFI failure

The below kludge of a patch fixes the issue. Still unclear whether we
should go with something like this or get clang fixed or what.

diff --git a/arch/arm/crypto/blake2s-shash.c b/arch/arm/crypto/blake2s-shash.c
index 17c1c3bfe2f5..be8cde5f1719 100644
--- a/arch/arm/crypto/blake2s-shash.c
+++ b/arch/arm/crypto/blake2s-shash.c
@@ -13,12 +13,12 @@
static int crypto_blake2s_update_arm(struct shash_desc *desc,
const u8 *in, unsigned int inlen)
{
- return crypto_blake2s_update(desc, in, inlen, blake2s_compress);
+ return crypto_blake2s_update(desc, in, inlen);
}

static int crypto_blake2s_final_arm(struct shash_desc *desc, u8 *out)
{
- return crypto_blake2s_final(desc, out, blake2s_compress);
+ return crypto_blake2s_final(desc, out);
}

#define BLAKE2S_ALG(name, driver_name, digest_size) \
diff --git a/arch/x86/crypto/blake2s-shash.c b/arch/x86/crypto/blake2s-shash.c
index f9e2fecdb761..c81ffedb4865 100644
--- a/arch/x86/crypto/blake2s-shash.c
+++ b/arch/x86/crypto/blake2s-shash.c
@@ -18,12 +18,12 @@
static int crypto_blake2s_update_x86(struct shash_desc *desc,
const u8 *in, unsigned int inlen)
{
- return crypto_blake2s_update(desc, in, inlen, blake2s_compress);
+ return crypto_blake2s_update(desc, in, inlen);
}

static int crypto_blake2s_final_x86(struct shash_desc *desc, u8 *out)
{
- return crypto_blake2s_final(desc, out, blake2s_compress);
+ return crypto_blake2s_final(desc, out);
}

#define BLAKE2S_ALG(name, driver_name, digest_size) \
diff --git a/crypto/blake2s_generic.c b/crypto/blake2s_generic.c
index 72fe480f9bd6..050874588a84 100644
--- a/crypto/blake2s_generic.c
+++ b/crypto/blake2s_generic.c
@@ -5,6 +5,7 @@
* Copyright (C) 2015-2019 Jason A. Donenfeld <[email protected]>. All
Rights Reserved.
*/

+#define FORCE_BLAKE2S_GENERIC
#include <crypto/internal/blake2s.h>
#include <crypto/internal/hash.h>

@@ -15,12 +16,12 @@
static int crypto_blake2s_update_generic(struct shash_desc *desc,
const u8 *in, unsigned int inlen)
{
- return crypto_blake2s_update(desc, in, inlen, blake2s_compress_generic);
+ return crypto_blake2s_update(desc, in, inlen);
}

static int crypto_blake2s_final_generic(struct shash_desc *desc, u8 *out)
{
- return crypto_blake2s_final(desc, out, blake2s_compress_generic);
+ return crypto_blake2s_final(desc, out);
}

#define BLAKE2S_ALG(name, driver_name, digest_size) \
diff --git a/include/crypto/internal/blake2s.h
b/include/crypto/internal/blake2s.h
index d39cfa0d333e..fec7eead93fc 100644
--- a/include/crypto/internal/blake2s.h
+++ b/include/crypto/internal/blake2s.h
@@ -24,14 +24,14 @@ static inline void blake2s_set_lastblock(struct
blake2s_state *state)
state->f[0] = -1;
}

-typedef void (*blake2s_compress_t)(struct blake2s_state *state,
- const u8 *block, size_t nblocks, u32 inc);
-
/* Helper functions for BLAKE2s shared by the library and shash APIs */

+#ifdef FORCE_BLAKE2S_GENERIC
+#define blake2s_compress blake2s_compress_generic
+#endif
+
static inline void __blake2s_update(struct blake2s_state *state,
- const u8 *in, size_t inlen,
- blake2s_compress_t compress)
+ const u8 *in, size_t inlen)
{
const size_t fill = BLAKE2S_BLOCK_SIZE - state->buflen;

@@ -39,7 +39,7 @@ static inline void __blake2s_update(struct
blake2s_state *state,
return;
if (inlen > fill) {
memcpy(state->buf + state->buflen, in, fill);
- (*compress)(state, state->buf, 1, BLAKE2S_BLOCK_SIZE);
+ blake2s_compress(state, state->buf, 1, BLAKE2S_BLOCK_SIZE);
state->buflen = 0;
in += fill;
inlen -= fill;
@@ -47,7 +47,7 @@ static inline void __blake2s_update(struct
blake2s_state *state,
if (inlen > BLAKE2S_BLOCK_SIZE) {
const size_t nblocks = DIV_ROUND_UP(inlen, BLAKE2S_BLOCK_SIZE);
/* Hash one less (full) block than strictly possible */
- (*compress)(state, in, nblocks - 1, BLAKE2S_BLOCK_SIZE);
+ blake2s_compress(state, in, nblocks - 1, BLAKE2S_BLOCK_SIZE);
in += BLAKE2S_BLOCK_SIZE * (nblocks - 1);
inlen -= BLAKE2S_BLOCK_SIZE * (nblocks - 1);
}
@@ -55,13 +55,12 @@ static inline void __blake2s_update(struct
blake2s_state *state,
state->buflen += inlen;
}

-static inline void __blake2s_final(struct blake2s_state *state, u8 *out,
- blake2s_compress_t compress)
+static inline void __blake2s_final(struct blake2s_state *state, u8 *out)
{
blake2s_set_lastblock(state);
memset(state->buf + state->buflen, 0,
BLAKE2S_BLOCK_SIZE - state->buflen); /* Padding */
- (*compress)(state, state->buf, 1, state->buflen);
+ blake2s_compress(state, state->buf, 1, state->buflen);
cpu_to_le32_array(state->h, ARRAY_SIZE(state->h));
memcpy(out, state->h, state->outlen);
}
@@ -98,21 +97,19 @@ static inline int crypto_blake2s_init(struct
shash_desc *desc)
}

static inline int crypto_blake2s_update(struct shash_desc *desc,
- const u8 *in, unsigned int inlen,
- blake2s_compress_t compress)
+ const u8 *in, unsigned int inlen)
{
struct blake2s_state *state = shash_desc_ctx(desc);

- __blake2s_update(state, in, inlen, compress);
+ __blake2s_update(state, in, inlen);
return 0;
}

-static inline int crypto_blake2s_final(struct shash_desc *desc, u8 *out,
- blake2s_compress_t compress)
+static inline int crypto_blake2s_final(struct shash_desc *desc, u8 *out)
{
struct blake2s_state *state = shash_desc_ctx(desc);

- __blake2s_final(state, out, compress);
+ __blake2s_final(state, out);
return 0;
}

diff --git a/lib/crypto/blake2s.c b/lib/crypto/blake2s.c
index 9364f79937b8..a13f01ff53a7 100644
--- a/lib/crypto/blake2s.c
+++ b/lib/crypto/blake2s.c
@@ -18,14 +18,14 @@

void blake2s_update(struct blake2s_state *state, const u8 *in, size_t inlen)
{
- __blake2s_update(state, in, inlen, blake2s_compress);
+ __blake2s_update(state, in, inlen);
}
EXPORT_SYMBOL(blake2s_update);

void blake2s_final(struct blake2s_state *state, u8 *out)
{
WARN_ON(IS_ENABLED(DEBUG) && !out);
- __blake2s_final(state, out, blake2s_compress);
+ __blake2s_final(state, out);
memzero_explicit(state, sizeof(*state));
}
EXPORT_SYMBOL(blake2s_final);

2022-01-21 19:15:40

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [PATCH] lib/crypto: blake2s: fix a CFI failure

On Wed, Jan 19, 2022 at 1:19 PM Ard Biesheuvel <[email protected]> wrote:
>
> I'd prefer it if we could avoid magic #define's like this.

I'll send something that just replaces it with a simple bool.

2022-01-21 19:19:01

by David Laight

[permalink] [raw]
Subject: RE: [PATCH] lib/crypto: blake2s: fix a CFI failure

From: Ard Biesheuvel
> Sent: 19 January 2022 12:19
...
> - (*compress)(state, in, nblocks - 1, BLAKE2S_BLOCK_SIZE);
> + if (IS_ENABLED(CONFIG_CRYPTO_ARCH_HAVE_LIB_BLAKE2S))
> + (*compress)(state, in, nblocks - 1, BLAKE2S_BLOCK_SIZE);
> + else
> + blake2s_compress_generic(state, in, nblocks - 1,
> + BLAKE2S_BLOCK_SIZE);

Isn't that a candidate for a 'static call' ?

And, maybe all these inlined functions should be real functions?
No point having all the bloat on every call site.
Much better to call a real function and used the cached instructions.

Although, having looked at the source and the generated code for
x86-64 and arm64 I'm not sure I'd want to try to generate
optimised assembler for it.
(Unless you can a instruction that does exactly what the code wants.)

Basically the compiler can merge the instructions for 4 of the
G() expansions so that they can execute in parallel on a multi-issue
cpu. Doing that by hand will be error prone.
Each G() expansion is pretty much a register dependency chain,
not much chance of parallel execution.

There are clearly optimisations for the top/bottom of the loop.
But they can be done to the generic C version.

The real problem is lack of registers - the code needs 16 for the
v[] array plus a few extras.
So some have to spill to stack.

The unrolled code is about 1200 instructions on arm64 and x86-64.
Each of the 10 rounds reads all 16 of the u32 input values.
So that is about 8 (1200/160) instructions for each read.
Which means there is plenty of memory bandwidth for other
reads.

So 'rolling up' the rounds - which adds in the blake2s_sigma[]
reads could easily be 'almost free'.
Certainly on x86 where you are just (well should be just) adding an
extra memory uop for each input buffer reads.

I'm not sure the 8 G() calls can be folded into two sets of 4
while still getting the compiler to interleave the generated code.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)