2022-01-11 13:49:55

by Jason A. Donenfeld

[permalink] [raw]
Subject: [PATCH crypto 0/2] smaller blake2s code size on m68k and other small platforms

Hi,

Geert emailed me this afternoon concerned about blake2s codesize on m68k
and other small systems. We identified two extremely effective ways of
chopping down the size. One of them moves some wireguard-specific things
into wireguard proper. The other one adds a slower codepath for
CONFIG_CC_OPTIMIZE_FOR_SIZE configurations. I really don't like that
slower codepath, but since it is configuration gated, at least it stays
out of the way except for people who know they need a tiny kernel image

Thanks,
Jason

Jason A. Donenfeld (2):
lib/crypto: blake2s-generic: reduce code size on small systems
lib/crypto: blake2s: move hmac construction into wireguard

drivers/net/wireguard/noise.c | 45 ++++++++++++++++++++++++++++++-----
include/crypto/blake2s.h | 3 ---
lib/crypto/blake2s-generic.c | 30 +++++++++++++----------
lib/crypto/blake2s-selftest.c | 31 ------------------------
lib/crypto/blake2s.c | 37 ----------------------------
5 files changed, 57 insertions(+), 89 deletions(-)

--
2.34.1



2022-01-11 13:49:59

by Jason A. Donenfeld

[permalink] [raw]
Subject: [PATCH crypto 1/2] lib/crypto: blake2s-generic: reduce code size on small systems

Re-wind the loops entirely on kernels optimized for code size. This is
really not good at all performance-wise. But on m68k, it shaves off 4k
of code size, which is apparently important.

Cc: Geert Uytterhoeven <[email protected]>
Cc: Herbert Xu <[email protected]>
Cc: Ard Biesheuvel <[email protected]>
Signed-off-by: Jason A. Donenfeld <[email protected]>
---
lib/crypto/blake2s-generic.c | 30 ++++++++++++++++++------------
1 file changed, 18 insertions(+), 12 deletions(-)

diff --git a/lib/crypto/blake2s-generic.c b/lib/crypto/blake2s-generic.c
index 75ccb3e633e6..990f000e22ee 100644
--- a/lib/crypto/blake2s-generic.c
+++ b/lib/crypto/blake2s-generic.c
@@ -46,7 +46,7 @@ void blake2s_compress_generic(struct blake2s_state *state, const u8 *block,
{
u32 m[16];
u32 v[16];
- int i;
+ int i, j;

WARN_ON(IS_ENABLED(DEBUG) &&
(nblocks > 1 && inc != BLAKE2S_BLOCK_SIZE));
@@ -86,17 +86,23 @@ void blake2s_compress_generic(struct blake2s_state *state, const u8 *block,
G(r, 6, v[2], v[ 7], v[ 8], v[13]); \
G(r, 7, v[3], v[ 4], v[ 9], v[14]); \
} while (0)
- ROUND(0);
- ROUND(1);
- ROUND(2);
- ROUND(3);
- ROUND(4);
- ROUND(5);
- ROUND(6);
- ROUND(7);
- ROUND(8);
- ROUND(9);
-
+ if (IS_ENABLED(CONFIG_CC_OPTIMIZE_FOR_SIZE)) {
+ for (i = 0; i < 10; ++i) {
+ for (j = 0; j < 8; ++j)
+ G(i, j, v[j % 4], v[((j + (j / 4)) % 4) + 4], v[((j + 2 * (j / 4)) % 4) + 8], v[((j + 3 * (j / 4)) % 4) + 12]);
+ }
+ } else {
+ ROUND(0);
+ ROUND(1);
+ ROUND(2);
+ ROUND(3);
+ ROUND(4);
+ ROUND(5);
+ ROUND(6);
+ ROUND(7);
+ ROUND(8);
+ ROUND(9);
+ }
#undef G
#undef ROUND

--
2.34.1


2022-01-11 13:50:02

by Jason A. Donenfeld

[permalink] [raw]
Subject: [PATCH crypto 2/2] lib/crypto: blake2s: move hmac construction into wireguard

Basically nobody should use blake2s in an HMAC construction; it already
has a keyed variant. But for unfortunately historical reasons, Noise,
used by WireGuard, uses HKDF quite strictly, which means we have to use
this. Because this really shouldn't be used by others, this commit moves
it into wireguard's noise.c locally, so that kernels that aren't using
WireGuard don't get this superfluous code baked in. On m68k systems,
this shaves off ~314 bytes.

Cc: Geert Uytterhoeven <[email protected]>
Cc: Herbert Xu <[email protected]>
Cc: Ard Biesheuvel <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Jason A. Donenfeld <[email protected]>
---
drivers/net/wireguard/noise.c | 45 ++++++++++++++++++++++++++++++-----
include/crypto/blake2s.h | 3 ---
lib/crypto/blake2s-selftest.c | 31 ------------------------
lib/crypto/blake2s.c | 37 ----------------------------
4 files changed, 39 insertions(+), 77 deletions(-)

diff --git a/drivers/net/wireguard/noise.c b/drivers/net/wireguard/noise.c
index c0cfd9b36c0b..720952b92e78 100644
--- a/drivers/net/wireguard/noise.c
+++ b/drivers/net/wireguard/noise.c
@@ -302,6 +302,41 @@ void wg_noise_set_static_identity_private_key(
static_identity->static_public, private_key);
}

+static void hmac(u8 *out, const u8 *in, const u8 *key, const size_t inlen, const size_t keylen)
+{
+ struct blake2s_state state;
+ u8 x_key[BLAKE2S_BLOCK_SIZE] __aligned(__alignof__(u32)) = { 0 };
+ u8 i_hash[BLAKE2S_HASH_SIZE] __aligned(__alignof__(u32));
+ int i;
+
+ if (keylen > BLAKE2S_BLOCK_SIZE) {
+ blake2s_init(&state, BLAKE2S_HASH_SIZE);
+ blake2s_update(&state, key, keylen);
+ blake2s_final(&state, x_key);
+ } else
+ memcpy(x_key, key, keylen);
+
+ for (i = 0; i < BLAKE2S_BLOCK_SIZE; ++i)
+ x_key[i] ^= 0x36;
+
+ blake2s_init(&state, BLAKE2S_HASH_SIZE);
+ blake2s_update(&state, x_key, BLAKE2S_BLOCK_SIZE);
+ blake2s_update(&state, in, inlen);
+ blake2s_final(&state, i_hash);
+
+ for (i = 0; i < BLAKE2S_BLOCK_SIZE; ++i)
+ x_key[i] ^= 0x5c ^ 0x36;
+
+ blake2s_init(&state, BLAKE2S_HASH_SIZE);
+ blake2s_update(&state, x_key, BLAKE2S_BLOCK_SIZE);
+ blake2s_update(&state, i_hash, BLAKE2S_HASH_SIZE);
+ blake2s_final(&state, i_hash);
+
+ memcpy(out, i_hash, BLAKE2S_HASH_SIZE);
+ memzero_explicit(x_key, BLAKE2S_BLOCK_SIZE);
+ memzero_explicit(i_hash, BLAKE2S_HASH_SIZE);
+}
+
/* This is Hugo Krawczyk's HKDF:
* - https://eprint.iacr.org/2010/264.pdf
* - https://tools.ietf.org/html/rfc5869
@@ -322,14 +357,14 @@ static void kdf(u8 *first_dst, u8 *second_dst, u8 *third_dst, const u8 *data,
((third_len || third_dst) && (!second_len || !second_dst))));

/* Extract entropy from data into secret */
- blake2s256_hmac(secret, data, chaining_key, data_len, NOISE_HASH_LEN);
+ hmac(secret, data, chaining_key, data_len, NOISE_HASH_LEN);

if (!first_dst || !first_len)
goto out;

/* Expand first key: key = secret, data = 0x1 */
output[0] = 1;
- blake2s256_hmac(output, output, secret, 1, BLAKE2S_HASH_SIZE);
+ hmac(output, output, secret, 1, BLAKE2S_HASH_SIZE);
memcpy(first_dst, output, first_len);

if (!second_dst || !second_len)
@@ -337,8 +372,7 @@ static void kdf(u8 *first_dst, u8 *second_dst, u8 *third_dst, const u8 *data,

/* Expand second key: key = secret, data = first-key || 0x2 */
output[BLAKE2S_HASH_SIZE] = 2;
- blake2s256_hmac(output, output, secret, BLAKE2S_HASH_SIZE + 1,
- BLAKE2S_HASH_SIZE);
+ hmac(output, output, secret, BLAKE2S_HASH_SIZE + 1, BLAKE2S_HASH_SIZE);
memcpy(second_dst, output, second_len);

if (!third_dst || !third_len)
@@ -346,8 +380,7 @@ static void kdf(u8 *first_dst, u8 *second_dst, u8 *third_dst, const u8 *data,

/* Expand third key: key = secret, data = second-key || 0x3 */
output[BLAKE2S_HASH_SIZE] = 3;
- blake2s256_hmac(output, output, secret, BLAKE2S_HASH_SIZE + 1,
- BLAKE2S_HASH_SIZE);
+ hmac(output, output, secret, BLAKE2S_HASH_SIZE + 1, BLAKE2S_HASH_SIZE);
memcpy(third_dst, output, third_len);

out:
diff --git a/include/crypto/blake2s.h b/include/crypto/blake2s.h
index bc3fb59442ce..4e30e1799e61 100644
--- a/include/crypto/blake2s.h
+++ b/include/crypto/blake2s.h
@@ -101,7 +101,4 @@ static inline void blake2s(u8 *out, const u8 *in, const u8 *key,
blake2s_final(&state, out);
}

-void blake2s256_hmac(u8 *out, const u8 *in, const u8 *key, const size_t inlen,
- const size_t keylen);
-
#endif /* _CRYPTO_BLAKE2S_H */
diff --git a/lib/crypto/blake2s-selftest.c b/lib/crypto/blake2s-selftest.c
index 5d9ea53be973..409e4b728770 100644
--- a/lib/crypto/blake2s-selftest.c
+++ b/lib/crypto/blake2s-selftest.c
@@ -15,7 +15,6 @@
* #include <stdio.h>
*
* #include <openssl/evp.h>
- * #include <openssl/hmac.h>
*
* #define BLAKE2S_TESTVEC_COUNT 256
*
@@ -58,16 +57,6 @@
* }
* printf("};\n\n");
*
- * printf("static const u8 blake2s_hmac_testvecs[][BLAKE2S_HASH_SIZE] __initconst = {\n");
- *
- * HMAC(EVP_blake2s256(), key, sizeof(key), buf, sizeof(buf), hash, NULL);
- * print_vec(hash, BLAKE2S_OUTBYTES);
- *
- * HMAC(EVP_blake2s256(), buf, sizeof(buf), key, sizeof(key), hash, NULL);
- * print_vec(hash, BLAKE2S_OUTBYTES);
- *
- * printf("};\n");
- *
* return 0;
*}
*/
@@ -554,15 +543,6 @@ static const u8 blake2s_testvecs[][BLAKE2S_HASH_SIZE] __initconst = {
0xd6, 0x98, 0x6b, 0x07, 0x10, 0x65, 0x52, 0x65, },
};

-static const u8 blake2s_hmac_testvecs[][BLAKE2S_HASH_SIZE] __initconst = {
- { 0xce, 0xe1, 0x57, 0x69, 0x82, 0xdc, 0xbf, 0x43, 0xad, 0x56, 0x4c, 0x70,
- 0xed, 0x68, 0x16, 0x96, 0xcf, 0xa4, 0x73, 0xe8, 0xe8, 0xfc, 0x32, 0x79,
- 0x08, 0x0a, 0x75, 0x82, 0xda, 0x3f, 0x05, 0x11, },
- { 0x77, 0x2f, 0x0c, 0x71, 0x41, 0xf4, 0x4b, 0x2b, 0xb3, 0xc6, 0xb6, 0xf9,
- 0x60, 0xde, 0xe4, 0x52, 0x38, 0x66, 0xe8, 0xbf, 0x9b, 0x96, 0xc4, 0x9f,
- 0x60, 0xd9, 0x24, 0x37, 0x99, 0xd6, 0xec, 0x31, },
-};
-
bool __init blake2s_selftest(void)
{
u8 key[BLAKE2S_KEY_SIZE];
@@ -607,16 +587,5 @@ bool __init blake2s_selftest(void)
}
}

- if (success) {
- blake2s256_hmac(hash, buf, key, sizeof(buf), sizeof(key));
- success &= !memcmp(hash, blake2s_hmac_testvecs[0], BLAKE2S_HASH_SIZE);
-
- blake2s256_hmac(hash, key, buf, sizeof(key), sizeof(buf));
- success &= !memcmp(hash, blake2s_hmac_testvecs[1], BLAKE2S_HASH_SIZE);
-
- if (!success)
- pr_err("blake2s256_hmac self-test: FAIL\n");
- }
-
return success;
}
diff --git a/lib/crypto/blake2s.c b/lib/crypto/blake2s.c
index 93f2ae051370..9364f79937b8 100644
--- a/lib/crypto/blake2s.c
+++ b/lib/crypto/blake2s.c
@@ -30,43 +30,6 @@ void blake2s_final(struct blake2s_state *state, u8 *out)
}
EXPORT_SYMBOL(blake2s_final);

-void blake2s256_hmac(u8 *out, const u8 *in, const u8 *key, const size_t inlen,
- const size_t keylen)
-{
- struct blake2s_state state;
- u8 x_key[BLAKE2S_BLOCK_SIZE] __aligned(__alignof__(u32)) = { 0 };
- u8 i_hash[BLAKE2S_HASH_SIZE] __aligned(__alignof__(u32));
- int i;
-
- if (keylen > BLAKE2S_BLOCK_SIZE) {
- blake2s_init(&state, BLAKE2S_HASH_SIZE);
- blake2s_update(&state, key, keylen);
- blake2s_final(&state, x_key);
- } else
- memcpy(x_key, key, keylen);
-
- for (i = 0; i < BLAKE2S_BLOCK_SIZE; ++i)
- x_key[i] ^= 0x36;
-
- blake2s_init(&state, BLAKE2S_HASH_SIZE);
- blake2s_update(&state, x_key, BLAKE2S_BLOCK_SIZE);
- blake2s_update(&state, in, inlen);
- blake2s_final(&state, i_hash);
-
- for (i = 0; i < BLAKE2S_BLOCK_SIZE; ++i)
- x_key[i] ^= 0x5c ^ 0x36;
-
- blake2s_init(&state, BLAKE2S_HASH_SIZE);
- blake2s_update(&state, x_key, BLAKE2S_BLOCK_SIZE);
- blake2s_update(&state, i_hash, BLAKE2S_HASH_SIZE);
- blake2s_final(&state, i_hash);
-
- memcpy(out, i_hash, BLAKE2S_HASH_SIZE);
- memzero_explicit(x_key, BLAKE2S_BLOCK_SIZE);
- memzero_explicit(i_hash, BLAKE2S_HASH_SIZE);
-}
-EXPORT_SYMBOL(blake2s256_hmac);
-
static int __init blake2s_mod_init(void)
{
if (!IS_ENABLED(CONFIG_CRYPTO_MANAGER_DISABLE_TESTS) &&
--
2.34.1


2022-01-11 14:44:08

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCH crypto 2/2] lib/crypto: blake2s: move hmac construction into wireguard

On Tue, 11 Jan 2022 at 14:49, Jason A. Donenfeld <[email protected]> wrote:
>
> Basically nobody should use blake2s in an HMAC construction; it already
> has a keyed variant. But for unfortunately historical reasons, Noise,

-ly

> used by WireGuard, uses HKDF quite strictly, which means we have to use
> this. Because this really shouldn't be used by others, this commit moves
> it into wireguard's noise.c locally, so that kernels that aren't using
> WireGuard don't get this superfluous code baked in. On m68k systems,
> this shaves off ~314 bytes.
>
> Cc: Geert Uytterhoeven <[email protected]>
> Cc: Herbert Xu <[email protected]>
> Cc: Ard Biesheuvel <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Signed-off-by: Jason A. Donenfeld <[email protected]>

Acked-by: Ard Biesheuvel <[email protected]>

> ---
> drivers/net/wireguard/noise.c | 45 ++++++++++++++++++++++++++++++-----
> include/crypto/blake2s.h | 3 ---
> lib/crypto/blake2s-selftest.c | 31 ------------------------
> lib/crypto/blake2s.c | 37 ----------------------------
> 4 files changed, 39 insertions(+), 77 deletions(-)
>
> diff --git a/drivers/net/wireguard/noise.c b/drivers/net/wireguard/noise.c
> index c0cfd9b36c0b..720952b92e78 100644
> --- a/drivers/net/wireguard/noise.c
> +++ b/drivers/net/wireguard/noise.c
> @@ -302,6 +302,41 @@ void wg_noise_set_static_identity_private_key(
> static_identity->static_public, private_key);
> }
>
> +static void hmac(u8 *out, const u8 *in, const u8 *key, const size_t inlen, const size_t keylen)
> +{
> + struct blake2s_state state;
> + u8 x_key[BLAKE2S_BLOCK_SIZE] __aligned(__alignof__(u32)) = { 0 };
> + u8 i_hash[BLAKE2S_HASH_SIZE] __aligned(__alignof__(u32));
> + int i;
> +
> + if (keylen > BLAKE2S_BLOCK_SIZE) {
> + blake2s_init(&state, BLAKE2S_HASH_SIZE);
> + blake2s_update(&state, key, keylen);
> + blake2s_final(&state, x_key);
> + } else
> + memcpy(x_key, key, keylen);
> +
> + for (i = 0; i < BLAKE2S_BLOCK_SIZE; ++i)
> + x_key[i] ^= 0x36;
> +
> + blake2s_init(&state, BLAKE2S_HASH_SIZE);
> + blake2s_update(&state, x_key, BLAKE2S_BLOCK_SIZE);
> + blake2s_update(&state, in, inlen);
> + blake2s_final(&state, i_hash);
> +
> + for (i = 0; i < BLAKE2S_BLOCK_SIZE; ++i)
> + x_key[i] ^= 0x5c ^ 0x36;
> +
> + blake2s_init(&state, BLAKE2S_HASH_SIZE);
> + blake2s_update(&state, x_key, BLAKE2S_BLOCK_SIZE);
> + blake2s_update(&state, i_hash, BLAKE2S_HASH_SIZE);
> + blake2s_final(&state, i_hash);
> +
> + memcpy(out, i_hash, BLAKE2S_HASH_SIZE);
> + memzero_explicit(x_key, BLAKE2S_BLOCK_SIZE);
> + memzero_explicit(i_hash, BLAKE2S_HASH_SIZE);
> +}
> +
> /* This is Hugo Krawczyk's HKDF:
> * - https://eprint.iacr.org/2010/264.pdf
> * - https://tools.ietf.org/html/rfc5869
> @@ -322,14 +357,14 @@ static void kdf(u8 *first_dst, u8 *second_dst, u8 *third_dst, const u8 *data,
> ((third_len || third_dst) && (!second_len || !second_dst))));
>
> /* Extract entropy from data into secret */
> - blake2s256_hmac(secret, data, chaining_key, data_len, NOISE_HASH_LEN);
> + hmac(secret, data, chaining_key, data_len, NOISE_HASH_LEN);
>
> if (!first_dst || !first_len)
> goto out;
>
> /* Expand first key: key = secret, data = 0x1 */
> output[0] = 1;
> - blake2s256_hmac(output, output, secret, 1, BLAKE2S_HASH_SIZE);
> + hmac(output, output, secret, 1, BLAKE2S_HASH_SIZE);
> memcpy(first_dst, output, first_len);
>
> if (!second_dst || !second_len)
> @@ -337,8 +372,7 @@ static void kdf(u8 *first_dst, u8 *second_dst, u8 *third_dst, const u8 *data,
>
> /* Expand second key: key = secret, data = first-key || 0x2 */
> output[BLAKE2S_HASH_SIZE] = 2;
> - blake2s256_hmac(output, output, secret, BLAKE2S_HASH_SIZE + 1,
> - BLAKE2S_HASH_SIZE);
> + hmac(output, output, secret, BLAKE2S_HASH_SIZE + 1, BLAKE2S_HASH_SIZE);
> memcpy(second_dst, output, second_len);
>
> if (!third_dst || !third_len)
> @@ -346,8 +380,7 @@ static void kdf(u8 *first_dst, u8 *second_dst, u8 *third_dst, const u8 *data,
>
> /* Expand third key: key = secret, data = second-key || 0x3 */
> output[BLAKE2S_HASH_SIZE] = 3;
> - blake2s256_hmac(output, output, secret, BLAKE2S_HASH_SIZE + 1,
> - BLAKE2S_HASH_SIZE);
> + hmac(output, output, secret, BLAKE2S_HASH_SIZE + 1, BLAKE2S_HASH_SIZE);
> memcpy(third_dst, output, third_len);
>
> out:
> diff --git a/include/crypto/blake2s.h b/include/crypto/blake2s.h
> index bc3fb59442ce..4e30e1799e61 100644
> --- a/include/crypto/blake2s.h
> +++ b/include/crypto/blake2s.h
> @@ -101,7 +101,4 @@ static inline void blake2s(u8 *out, const u8 *in, const u8 *key,
> blake2s_final(&state, out);
> }
>
> -void blake2s256_hmac(u8 *out, const u8 *in, const u8 *key, const size_t inlen,
> - const size_t keylen);
> -
> #endif /* _CRYPTO_BLAKE2S_H */
> diff --git a/lib/crypto/blake2s-selftest.c b/lib/crypto/blake2s-selftest.c
> index 5d9ea53be973..409e4b728770 100644
> --- a/lib/crypto/blake2s-selftest.c
> +++ b/lib/crypto/blake2s-selftest.c
> @@ -15,7 +15,6 @@
> * #include <stdio.h>
> *
> * #include <openssl/evp.h>
> - * #include <openssl/hmac.h>
> *
> * #define BLAKE2S_TESTVEC_COUNT 256
> *
> @@ -58,16 +57,6 @@
> * }
> * printf("};\n\n");
> *
> - * printf("static const u8 blake2s_hmac_testvecs[][BLAKE2S_HASH_SIZE] __initconst = {\n");
> - *
> - * HMAC(EVP_blake2s256(), key, sizeof(key), buf, sizeof(buf), hash, NULL);
> - * print_vec(hash, BLAKE2S_OUTBYTES);
> - *
> - * HMAC(EVP_blake2s256(), buf, sizeof(buf), key, sizeof(key), hash, NULL);
> - * print_vec(hash, BLAKE2S_OUTBYTES);
> - *
> - * printf("};\n");
> - *
> * return 0;
> *}
> */
> @@ -554,15 +543,6 @@ static const u8 blake2s_testvecs[][BLAKE2S_HASH_SIZE] __initconst = {
> 0xd6, 0x98, 0x6b, 0x07, 0x10, 0x65, 0x52, 0x65, },
> };
>
> -static const u8 blake2s_hmac_testvecs[][BLAKE2S_HASH_SIZE] __initconst = {
> - { 0xce, 0xe1, 0x57, 0x69, 0x82, 0xdc, 0xbf, 0x43, 0xad, 0x56, 0x4c, 0x70,
> - 0xed, 0x68, 0x16, 0x96, 0xcf, 0xa4, 0x73, 0xe8, 0xe8, 0xfc, 0x32, 0x79,
> - 0x08, 0x0a, 0x75, 0x82, 0xda, 0x3f, 0x05, 0x11, },
> - { 0x77, 0x2f, 0x0c, 0x71, 0x41, 0xf4, 0x4b, 0x2b, 0xb3, 0xc6, 0xb6, 0xf9,
> - 0x60, 0xde, 0xe4, 0x52, 0x38, 0x66, 0xe8, 0xbf, 0x9b, 0x96, 0xc4, 0x9f,
> - 0x60, 0xd9, 0x24, 0x37, 0x99, 0xd6, 0xec, 0x31, },
> -};
> -
> bool __init blake2s_selftest(void)
> {
> u8 key[BLAKE2S_KEY_SIZE];
> @@ -607,16 +587,5 @@ bool __init blake2s_selftest(void)
> }
> }
>
> - if (success) {
> - blake2s256_hmac(hash, buf, key, sizeof(buf), sizeof(key));
> - success &= !memcmp(hash, blake2s_hmac_testvecs[0], BLAKE2S_HASH_SIZE);
> -
> - blake2s256_hmac(hash, key, buf, sizeof(key), sizeof(buf));
> - success &= !memcmp(hash, blake2s_hmac_testvecs[1], BLAKE2S_HASH_SIZE);
> -
> - if (!success)
> - pr_err("blake2s256_hmac self-test: FAIL\n");
> - }
> -
> return success;
> }
> diff --git a/lib/crypto/blake2s.c b/lib/crypto/blake2s.c
> index 93f2ae051370..9364f79937b8 100644
> --- a/lib/crypto/blake2s.c
> +++ b/lib/crypto/blake2s.c
> @@ -30,43 +30,6 @@ void blake2s_final(struct blake2s_state *state, u8 *out)
> }
> EXPORT_SYMBOL(blake2s_final);
>
> -void blake2s256_hmac(u8 *out, const u8 *in, const u8 *key, const size_t inlen,
> - const size_t keylen)
> -{
> - struct blake2s_state state;
> - u8 x_key[BLAKE2S_BLOCK_SIZE] __aligned(__alignof__(u32)) = { 0 };
> - u8 i_hash[BLAKE2S_HASH_SIZE] __aligned(__alignof__(u32));
> - int i;
> -
> - if (keylen > BLAKE2S_BLOCK_SIZE) {
> - blake2s_init(&state, BLAKE2S_HASH_SIZE);
> - blake2s_update(&state, key, keylen);
> - blake2s_final(&state, x_key);
> - } else
> - memcpy(x_key, key, keylen);
> -
> - for (i = 0; i < BLAKE2S_BLOCK_SIZE; ++i)
> - x_key[i] ^= 0x36;
> -
> - blake2s_init(&state, BLAKE2S_HASH_SIZE);
> - blake2s_update(&state, x_key, BLAKE2S_BLOCK_SIZE);
> - blake2s_update(&state, in, inlen);
> - blake2s_final(&state, i_hash);
> -
> - for (i = 0; i < BLAKE2S_BLOCK_SIZE; ++i)
> - x_key[i] ^= 0x5c ^ 0x36;
> -
> - blake2s_init(&state, BLAKE2S_HASH_SIZE);
> - blake2s_update(&state, x_key, BLAKE2S_BLOCK_SIZE);
> - blake2s_update(&state, i_hash, BLAKE2S_HASH_SIZE);
> - blake2s_final(&state, i_hash);
> -
> - memcpy(out, i_hash, BLAKE2S_HASH_SIZE);
> - memzero_explicit(x_key, BLAKE2S_BLOCK_SIZE);
> - memzero_explicit(i_hash, BLAKE2S_HASH_SIZE);
> -}
> -EXPORT_SYMBOL(blake2s256_hmac);
> -
> static int __init blake2s_mod_init(void)
> {
> if (!IS_ENABLED(CONFIG_CRYPTO_MANAGER_DISABLE_TESTS) &&
> --
> 2.34.1
>

2022-01-11 18:10:56

by Jason A. Donenfeld

[permalink] [raw]
Subject: [PATCH crypto v2 0/2] reduce code size from blake2s on m68k and other small platforms

Hi,

Geert emailed me this afternoon concerned about blake2s codesize on m68k
and other small systems. We identified two effective ways of chopping
down the size. One of them moves some wireguard-specific things into
wireguard proper. The other one adds a slower codepath for small
machines to blake2s. This worked, and was v1 of this patchset, but I
wasn't so much of a fan. Then someone pointed out that the generic C
SHA-1 implementation is still unrolled, which is a *lot* of extra code.
Simply rerolling that saves about as much as v1 did. So, we instead do
that in this v2 patchset. SHA-1 is being phased out, and soon it won't
be included at all (hopefully). And nothing performance-oriented has
anything to do with it anyway.

The result of these two patches mitigates Geert's feared code size
increase for 5.17.

Thanks,
Jason


Jason A. Donenfeld (2):
lib/crypto: blake2s: move hmac construction into wireguard
lib/crypto: sha1: re-roll loops to reduce code size

drivers/net/wireguard/noise.c | 45 +++++++++++--
include/crypto/blake2s.h | 3 -
lib/crypto/blake2s-selftest.c | 31 ---------
lib/crypto/blake2s.c | 37 -----------
lib/sha1.c | 117 ++++++++--------------------------
5 files changed, 64 insertions(+), 169 deletions(-)

--
2.34.1


2022-01-11 18:11:05

by Jason A. Donenfeld

[permalink] [raw]
Subject: [PATCH crypto v2 1/2] lib/crypto: blake2s: move hmac construction into wireguard

Basically nobody should use blake2s in an HMAC construction; it already
has a keyed variant. But unfortunately for historical reasons, Noise,
used by WireGuard, uses HKDF quite strictly, which means we have to use
this. Because this really shouldn't be used by others, this commit moves
it into wireguard's noise.c locally, so that kernels that aren't using
WireGuard don't get this superfluous code baked in. On m68k systems,
this shaves off ~314 bytes.

Cc: Geert Uytterhoeven <[email protected]>
Cc: Herbert Xu <[email protected]>
Acked-by: Ard Biesheuvel <[email protected]>
Signed-off-by: Jason A. Donenfeld <[email protected]>
---
drivers/net/wireguard/noise.c | 45 ++++++++++++++++++++++++++++++-----
include/crypto/blake2s.h | 3 ---
lib/crypto/blake2s-selftest.c | 31 ------------------------
lib/crypto/blake2s.c | 37 ----------------------------
4 files changed, 39 insertions(+), 77 deletions(-)

diff --git a/drivers/net/wireguard/noise.c b/drivers/net/wireguard/noise.c
index c0cfd9b36c0b..720952b92e78 100644
--- a/drivers/net/wireguard/noise.c
+++ b/drivers/net/wireguard/noise.c
@@ -302,6 +302,41 @@ void wg_noise_set_static_identity_private_key(
static_identity->static_public, private_key);
}

+static void hmac(u8 *out, const u8 *in, const u8 *key, const size_t inlen, const size_t keylen)
+{
+ struct blake2s_state state;
+ u8 x_key[BLAKE2S_BLOCK_SIZE] __aligned(__alignof__(u32)) = { 0 };
+ u8 i_hash[BLAKE2S_HASH_SIZE] __aligned(__alignof__(u32));
+ int i;
+
+ if (keylen > BLAKE2S_BLOCK_SIZE) {
+ blake2s_init(&state, BLAKE2S_HASH_SIZE);
+ blake2s_update(&state, key, keylen);
+ blake2s_final(&state, x_key);
+ } else
+ memcpy(x_key, key, keylen);
+
+ for (i = 0; i < BLAKE2S_BLOCK_SIZE; ++i)
+ x_key[i] ^= 0x36;
+
+ blake2s_init(&state, BLAKE2S_HASH_SIZE);
+ blake2s_update(&state, x_key, BLAKE2S_BLOCK_SIZE);
+ blake2s_update(&state, in, inlen);
+ blake2s_final(&state, i_hash);
+
+ for (i = 0; i < BLAKE2S_BLOCK_SIZE; ++i)
+ x_key[i] ^= 0x5c ^ 0x36;
+
+ blake2s_init(&state, BLAKE2S_HASH_SIZE);
+ blake2s_update(&state, x_key, BLAKE2S_BLOCK_SIZE);
+ blake2s_update(&state, i_hash, BLAKE2S_HASH_SIZE);
+ blake2s_final(&state, i_hash);
+
+ memcpy(out, i_hash, BLAKE2S_HASH_SIZE);
+ memzero_explicit(x_key, BLAKE2S_BLOCK_SIZE);
+ memzero_explicit(i_hash, BLAKE2S_HASH_SIZE);
+}
+
/* This is Hugo Krawczyk's HKDF:
* - https://eprint.iacr.org/2010/264.pdf
* - https://tools.ietf.org/html/rfc5869
@@ -322,14 +357,14 @@ static void kdf(u8 *first_dst, u8 *second_dst, u8 *third_dst, const u8 *data,
((third_len || third_dst) && (!second_len || !second_dst))));

/* Extract entropy from data into secret */
- blake2s256_hmac(secret, data, chaining_key, data_len, NOISE_HASH_LEN);
+ hmac(secret, data, chaining_key, data_len, NOISE_HASH_LEN);

if (!first_dst || !first_len)
goto out;

/* Expand first key: key = secret, data = 0x1 */
output[0] = 1;
- blake2s256_hmac(output, output, secret, 1, BLAKE2S_HASH_SIZE);
+ hmac(output, output, secret, 1, BLAKE2S_HASH_SIZE);
memcpy(first_dst, output, first_len);

if (!second_dst || !second_len)
@@ -337,8 +372,7 @@ static void kdf(u8 *first_dst, u8 *second_dst, u8 *third_dst, const u8 *data,

/* Expand second key: key = secret, data = first-key || 0x2 */
output[BLAKE2S_HASH_SIZE] = 2;
- blake2s256_hmac(output, output, secret, BLAKE2S_HASH_SIZE + 1,
- BLAKE2S_HASH_SIZE);
+ hmac(output, output, secret, BLAKE2S_HASH_SIZE + 1, BLAKE2S_HASH_SIZE);
memcpy(second_dst, output, second_len);

if (!third_dst || !third_len)
@@ -346,8 +380,7 @@ static void kdf(u8 *first_dst, u8 *second_dst, u8 *third_dst, const u8 *data,

/* Expand third key: key = secret, data = second-key || 0x3 */
output[BLAKE2S_HASH_SIZE] = 3;
- blake2s256_hmac(output, output, secret, BLAKE2S_HASH_SIZE + 1,
- BLAKE2S_HASH_SIZE);
+ hmac(output, output, secret, BLAKE2S_HASH_SIZE + 1, BLAKE2S_HASH_SIZE);
memcpy(third_dst, output, third_len);

out:
diff --git a/include/crypto/blake2s.h b/include/crypto/blake2s.h
index bc3fb59442ce..4e30e1799e61 100644
--- a/include/crypto/blake2s.h
+++ b/include/crypto/blake2s.h
@@ -101,7 +101,4 @@ static inline void blake2s(u8 *out, const u8 *in, const u8 *key,
blake2s_final(&state, out);
}

-void blake2s256_hmac(u8 *out, const u8 *in, const u8 *key, const size_t inlen,
- const size_t keylen);
-
#endif /* _CRYPTO_BLAKE2S_H */
diff --git a/lib/crypto/blake2s-selftest.c b/lib/crypto/blake2s-selftest.c
index 5d9ea53be973..409e4b728770 100644
--- a/lib/crypto/blake2s-selftest.c
+++ b/lib/crypto/blake2s-selftest.c
@@ -15,7 +15,6 @@
* #include <stdio.h>
*
* #include <openssl/evp.h>
- * #include <openssl/hmac.h>
*
* #define BLAKE2S_TESTVEC_COUNT 256
*
@@ -58,16 +57,6 @@
* }
* printf("};\n\n");
*
- * printf("static const u8 blake2s_hmac_testvecs[][BLAKE2S_HASH_SIZE] __initconst = {\n");
- *
- * HMAC(EVP_blake2s256(), key, sizeof(key), buf, sizeof(buf), hash, NULL);
- * print_vec(hash, BLAKE2S_OUTBYTES);
- *
- * HMAC(EVP_blake2s256(), buf, sizeof(buf), key, sizeof(key), hash, NULL);
- * print_vec(hash, BLAKE2S_OUTBYTES);
- *
- * printf("};\n");
- *
* return 0;
*}
*/
@@ -554,15 +543,6 @@ static const u8 blake2s_testvecs[][BLAKE2S_HASH_SIZE] __initconst = {
0xd6, 0x98, 0x6b, 0x07, 0x10, 0x65, 0x52, 0x65, },
};

-static const u8 blake2s_hmac_testvecs[][BLAKE2S_HASH_SIZE] __initconst = {
- { 0xce, 0xe1, 0x57, 0x69, 0x82, 0xdc, 0xbf, 0x43, 0xad, 0x56, 0x4c, 0x70,
- 0xed, 0x68, 0x16, 0x96, 0xcf, 0xa4, 0x73, 0xe8, 0xe8, 0xfc, 0x32, 0x79,
- 0x08, 0x0a, 0x75, 0x82, 0xda, 0x3f, 0x05, 0x11, },
- { 0x77, 0x2f, 0x0c, 0x71, 0x41, 0xf4, 0x4b, 0x2b, 0xb3, 0xc6, 0xb6, 0xf9,
- 0x60, 0xde, 0xe4, 0x52, 0x38, 0x66, 0xe8, 0xbf, 0x9b, 0x96, 0xc4, 0x9f,
- 0x60, 0xd9, 0x24, 0x37, 0x99, 0xd6, 0xec, 0x31, },
-};
-
bool __init blake2s_selftest(void)
{
u8 key[BLAKE2S_KEY_SIZE];
@@ -607,16 +587,5 @@ bool __init blake2s_selftest(void)
}
}

- if (success) {
- blake2s256_hmac(hash, buf, key, sizeof(buf), sizeof(key));
- success &= !memcmp(hash, blake2s_hmac_testvecs[0], BLAKE2S_HASH_SIZE);
-
- blake2s256_hmac(hash, key, buf, sizeof(key), sizeof(buf));
- success &= !memcmp(hash, blake2s_hmac_testvecs[1], BLAKE2S_HASH_SIZE);
-
- if (!success)
- pr_err("blake2s256_hmac self-test: FAIL\n");
- }
-
return success;
}
diff --git a/lib/crypto/blake2s.c b/lib/crypto/blake2s.c
index 93f2ae051370..9364f79937b8 100644
--- a/lib/crypto/blake2s.c
+++ b/lib/crypto/blake2s.c
@@ -30,43 +30,6 @@ void blake2s_final(struct blake2s_state *state, u8 *out)
}
EXPORT_SYMBOL(blake2s_final);

-void blake2s256_hmac(u8 *out, const u8 *in, const u8 *key, const size_t inlen,
- const size_t keylen)
-{
- struct blake2s_state state;
- u8 x_key[BLAKE2S_BLOCK_SIZE] __aligned(__alignof__(u32)) = { 0 };
- u8 i_hash[BLAKE2S_HASH_SIZE] __aligned(__alignof__(u32));
- int i;
-
- if (keylen > BLAKE2S_BLOCK_SIZE) {
- blake2s_init(&state, BLAKE2S_HASH_SIZE);
- blake2s_update(&state, key, keylen);
- blake2s_final(&state, x_key);
- } else
- memcpy(x_key, key, keylen);
-
- for (i = 0; i < BLAKE2S_BLOCK_SIZE; ++i)
- x_key[i] ^= 0x36;
-
- blake2s_init(&state, BLAKE2S_HASH_SIZE);
- blake2s_update(&state, x_key, BLAKE2S_BLOCK_SIZE);
- blake2s_update(&state, in, inlen);
- blake2s_final(&state, i_hash);
-
- for (i = 0; i < BLAKE2S_BLOCK_SIZE; ++i)
- x_key[i] ^= 0x5c ^ 0x36;
-
- blake2s_init(&state, BLAKE2S_HASH_SIZE);
- blake2s_update(&state, x_key, BLAKE2S_BLOCK_SIZE);
- blake2s_update(&state, i_hash, BLAKE2S_HASH_SIZE);
- blake2s_final(&state, i_hash);
-
- memcpy(out, i_hash, BLAKE2S_HASH_SIZE);
- memzero_explicit(x_key, BLAKE2S_BLOCK_SIZE);
- memzero_explicit(i_hash, BLAKE2S_HASH_SIZE);
-}
-EXPORT_SYMBOL(blake2s256_hmac);
-
static int __init blake2s_mod_init(void)
{
if (!IS_ENABLED(CONFIG_CRYPTO_MANAGER_DISABLE_TESTS) &&
--
2.34.1


2022-01-11 18:11:07

by Jason A. Donenfeld

[permalink] [raw]
Subject: [PATCH crypto v2 2/2] lib/crypto: sha1: re-roll loops to reduce code size

With SHA-1 no longer being used for anything performance oriented, and
also soon to be phased out entirely, we can make up for the space added
by unrolled BLAKE2s by simply re-rolling SHA-1. Since SHA-1 is so much
more complex, re-rolling it more or less takes care of the code size
added by BLAKE2s. And eventually, hopefully we'll see SHA-1 removed
entirely from most small kernel builds.

Cc: Geert Uytterhoeven <[email protected]>
Cc: Herbert Xu <[email protected]>
Cc: Ard Biesheuvel <[email protected]>
Signed-off-by: Jason A. Donenfeld <[email protected]>
---
lib/sha1.c | 117 ++++++++++++-----------------------------------------
1 file changed, 25 insertions(+), 92 deletions(-)

diff --git a/lib/sha1.c b/lib/sha1.c
index 9bd1935a1472..f2acfa294e64 100644
--- a/lib/sha1.c
+++ b/lib/sha1.c
@@ -9,6 +9,7 @@
#include <linux/kernel.h>
#include <linux/export.h>
#include <linux/bitops.h>
+#include <linux/string.h>
#include <crypto/sha1.h>
#include <asm/unaligned.h>

@@ -83,109 +84,41 @@
*/
void sha1_transform(__u32 *digest, const char *data, __u32 *array)
{
- __u32 A, B, C, D, E;
+ u32 d[5];
+ unsigned int i = 0;

- A = digest[0];
- B = digest[1];
- C = digest[2];
- D = digest[3];
- E = digest[4];
+ memcpy(d, digest, sizeof(d));

/* Round 1 - iterations 0-16 take their input from 'data' */
- T_0_15( 0, A, B, C, D, E);
- T_0_15( 1, E, A, B, C, D);
- T_0_15( 2, D, E, A, B, C);
- T_0_15( 3, C, D, E, A, B);
- T_0_15( 4, B, C, D, E, A);
- T_0_15( 5, A, B, C, D, E);
- T_0_15( 6, E, A, B, C, D);
- T_0_15( 7, D, E, A, B, C);
- T_0_15( 8, C, D, E, A, B);
- T_0_15( 9, B, C, D, E, A);
- T_0_15(10, A, B, C, D, E);
- T_0_15(11, E, A, B, C, D);
- T_0_15(12, D, E, A, B, C);
- T_0_15(13, C, D, E, A, B);
- T_0_15(14, B, C, D, E, A);
- T_0_15(15, A, B, C, D, E);
+ for (; i < 16; ++i)
+ T_0_15(i, d[(-6 - i) % 5], d[(-5 - i) % 5],
+ d[(-4 - i) % 5], d[(-3 - i) % 5], d[(-2 - i) % 5]);

/* Round 1 - tail. Input from 512-bit mixing array */
- T_16_19(16, E, A, B, C, D);
- T_16_19(17, D, E, A, B, C);
- T_16_19(18, C, D, E, A, B);
- T_16_19(19, B, C, D, E, A);
+ for (; i < 20; ++i)
+ T_16_19(i, d[(-6 - i) % 5], d[(-5 - i) % 5],
+ d[(-4 - i) % 5], d[(-3 - i) % 5], d[(-2 - i) % 5]);

/* Round 2 */
- T_20_39(20, A, B, C, D, E);
- T_20_39(21, E, A, B, C, D);
- T_20_39(22, D, E, A, B, C);
- T_20_39(23, C, D, E, A, B);
- T_20_39(24, B, C, D, E, A);
- T_20_39(25, A, B, C, D, E);
- T_20_39(26, E, A, B, C, D);
- T_20_39(27, D, E, A, B, C);
- T_20_39(28, C, D, E, A, B);
- T_20_39(29, B, C, D, E, A);
- T_20_39(30, A, B, C, D, E);
- T_20_39(31, E, A, B, C, D);
- T_20_39(32, D, E, A, B, C);
- T_20_39(33, C, D, E, A, B);
- T_20_39(34, B, C, D, E, A);
- T_20_39(35, A, B, C, D, E);
- T_20_39(36, E, A, B, C, D);
- T_20_39(37, D, E, A, B, C);
- T_20_39(38, C, D, E, A, B);
- T_20_39(39, B, C, D, E, A);
+ for (; i < 40; ++i)
+ T_20_39(i, d[(-6 - i) % 5], d[(-5 - i) % 5],
+ d[(-4 - i) % 5], d[(-3 - i) % 5], d[(-2 - i) % 5]);

/* Round 3 */
- T_40_59(40, A, B, C, D, E);
- T_40_59(41, E, A, B, C, D);
- T_40_59(42, D, E, A, B, C);
- T_40_59(43, C, D, E, A, B);
- T_40_59(44, B, C, D, E, A);
- T_40_59(45, A, B, C, D, E);
- T_40_59(46, E, A, B, C, D);
- T_40_59(47, D, E, A, B, C);
- T_40_59(48, C, D, E, A, B);
- T_40_59(49, B, C, D, E, A);
- T_40_59(50, A, B, C, D, E);
- T_40_59(51, E, A, B, C, D);
- T_40_59(52, D, E, A, B, C);
- T_40_59(53, C, D, E, A, B);
- T_40_59(54, B, C, D, E, A);
- T_40_59(55, A, B, C, D, E);
- T_40_59(56, E, A, B, C, D);
- T_40_59(57, D, E, A, B, C);
- T_40_59(58, C, D, E, A, B);
- T_40_59(59, B, C, D, E, A);
+ for (; i < 60; ++i)
+ T_40_59(i, d[(-6 - i) % 5], d[(-5 - i) % 5],
+ d[(-4 - i) % 5], d[(-3 - i) % 5], d[(-2 - i) % 5]);

/* Round 4 */
- T_60_79(60, A, B, C, D, E);
- T_60_79(61, E, A, B, C, D);
- T_60_79(62, D, E, A, B, C);
- T_60_79(63, C, D, E, A, B);
- T_60_79(64, B, C, D, E, A);
- T_60_79(65, A, B, C, D, E);
- T_60_79(66, E, A, B, C, D);
- T_60_79(67, D, E, A, B, C);
- T_60_79(68, C, D, E, A, B);
- T_60_79(69, B, C, D, E, A);
- T_60_79(70, A, B, C, D, E);
- T_60_79(71, E, A, B, C, D);
- T_60_79(72, D, E, A, B, C);
- T_60_79(73, C, D, E, A, B);
- T_60_79(74, B, C, D, E, A);
- T_60_79(75, A, B, C, D, E);
- T_60_79(76, E, A, B, C, D);
- T_60_79(77, D, E, A, B, C);
- T_60_79(78, C, D, E, A, B);
- T_60_79(79, B, C, D, E, A);
-
- digest[0] += A;
- digest[1] += B;
- digest[2] += C;
- digest[3] += D;
- digest[4] += E;
+ for (; i < 80; ++i)
+ T_60_79(i, d[(-6 - i) % 5], d[(-5 - i) % 5],
+ d[(-4 - i) % 5], d[(-3 - i) % 5], d[(-2 - i) % 5]);
+
+ digest[0] += d[0];
+ digest[1] += d[1];
+ digest[2] += d[2];
+ digest[3] += d[3];
+ digest[4] += d[4];
}
EXPORT_SYMBOL(sha1_transform);

--
2.34.1


2022-01-11 22:05:22

by Jason A. Donenfeld

[permalink] [raw]
Subject: [PATCH crypto v3 0/2] reduce code size from blake2s on m68k and other small platforms

Hi,

Geert emailed me this afternoon concerned about blake2s codesize on m68k
and other small systems. We identified two effective ways of chopping
down the size. One of them moves some wireguard-specific things into
wireguard proper. The other one adds a slower codepath for small
machines to blake2s. This worked, and was v1 of this patchset, but I
wasn't so much of a fan. Then someone pointed out that the generic C
SHA-1 implementation is still unrolled, which is a *lot* of extra code.
Simply rerolling that saves about as much as v1 did. So, we instead do
that in this patchset. SHA-1 is being phased out, and soon it won't
be included at all (hopefully). And nothing performance-oriented has
anything to do with it anyway.

The result of these two patches mitigates Geert's feared code size
increase for 5.17.

v3 improves on v2 by making the re-rolling of SHA-1 much simpler,
resulting in even larger code size reduction and much better
performance. The reason I'm sending yet a third version in such a short
amount of time is because the trick here feels obvious and substantial
enough that I'd hate for Geert to waste time measuring the impact of the
previous commit.

Thanks,
Jason

Jason A. Donenfeld (2):
lib/crypto: blake2s: move hmac construction into wireguard
lib/crypto: sha1: re-roll loops to reduce code size

drivers/net/wireguard/noise.c | 45 ++++++++++++++---
include/crypto/blake2s.h | 3 --
lib/crypto/blake2s-selftest.c | 31 ------------
lib/crypto/blake2s.c | 37 --------------
lib/sha1.c | 95 ++++++-----------------------------
5 files changed, 53 insertions(+), 158 deletions(-)

--
2.34.1


2022-01-11 22:05:25

by Jason A. Donenfeld

[permalink] [raw]
Subject: [PATCH crypto v3 1/2] lib/crypto: blake2s: move hmac construction into wireguard

Basically nobody should use blake2s in an HMAC construction; it already
has a keyed variant. But unfortunately for historical reasons, Noise,
used by WireGuard, uses HKDF quite strictly, which means we have to use
this. Because this really shouldn't be used by others, this commit moves
it into wireguard's noise.c locally, so that kernels that aren't using
WireGuard don't get this superfluous code baked in. On m68k systems,
this shaves off ~314 bytes.

Cc: Geert Uytterhoeven <[email protected]>
Cc: Herbert Xu <[email protected]>
Acked-by: Ard Biesheuvel <[email protected]>
Signed-off-by: Jason A. Donenfeld <[email protected]>
---
drivers/net/wireguard/noise.c | 45 ++++++++++++++++++++++++++++++-----
include/crypto/blake2s.h | 3 ---
lib/crypto/blake2s-selftest.c | 31 ------------------------
lib/crypto/blake2s.c | 37 ----------------------------
4 files changed, 39 insertions(+), 77 deletions(-)

diff --git a/drivers/net/wireguard/noise.c b/drivers/net/wireguard/noise.c
index c0cfd9b36c0b..720952b92e78 100644
--- a/drivers/net/wireguard/noise.c
+++ b/drivers/net/wireguard/noise.c
@@ -302,6 +302,41 @@ void wg_noise_set_static_identity_private_key(
static_identity->static_public, private_key);
}

+static void hmac(u8 *out, const u8 *in, const u8 *key, const size_t inlen, const size_t keylen)
+{
+ struct blake2s_state state;
+ u8 x_key[BLAKE2S_BLOCK_SIZE] __aligned(__alignof__(u32)) = { 0 };
+ u8 i_hash[BLAKE2S_HASH_SIZE] __aligned(__alignof__(u32));
+ int i;
+
+ if (keylen > BLAKE2S_BLOCK_SIZE) {
+ blake2s_init(&state, BLAKE2S_HASH_SIZE);
+ blake2s_update(&state, key, keylen);
+ blake2s_final(&state, x_key);
+ } else
+ memcpy(x_key, key, keylen);
+
+ for (i = 0; i < BLAKE2S_BLOCK_SIZE; ++i)
+ x_key[i] ^= 0x36;
+
+ blake2s_init(&state, BLAKE2S_HASH_SIZE);
+ blake2s_update(&state, x_key, BLAKE2S_BLOCK_SIZE);
+ blake2s_update(&state, in, inlen);
+ blake2s_final(&state, i_hash);
+
+ for (i = 0; i < BLAKE2S_BLOCK_SIZE; ++i)
+ x_key[i] ^= 0x5c ^ 0x36;
+
+ blake2s_init(&state, BLAKE2S_HASH_SIZE);
+ blake2s_update(&state, x_key, BLAKE2S_BLOCK_SIZE);
+ blake2s_update(&state, i_hash, BLAKE2S_HASH_SIZE);
+ blake2s_final(&state, i_hash);
+
+ memcpy(out, i_hash, BLAKE2S_HASH_SIZE);
+ memzero_explicit(x_key, BLAKE2S_BLOCK_SIZE);
+ memzero_explicit(i_hash, BLAKE2S_HASH_SIZE);
+}
+
/* This is Hugo Krawczyk's HKDF:
* - https://eprint.iacr.org/2010/264.pdf
* - https://tools.ietf.org/html/rfc5869
@@ -322,14 +357,14 @@ static void kdf(u8 *first_dst, u8 *second_dst, u8 *third_dst, const u8 *data,
((third_len || third_dst) && (!second_len || !second_dst))));

/* Extract entropy from data into secret */
- blake2s256_hmac(secret, data, chaining_key, data_len, NOISE_HASH_LEN);
+ hmac(secret, data, chaining_key, data_len, NOISE_HASH_LEN);

if (!first_dst || !first_len)
goto out;

/* Expand first key: key = secret, data = 0x1 */
output[0] = 1;
- blake2s256_hmac(output, output, secret, 1, BLAKE2S_HASH_SIZE);
+ hmac(output, output, secret, 1, BLAKE2S_HASH_SIZE);
memcpy(first_dst, output, first_len);

if (!second_dst || !second_len)
@@ -337,8 +372,7 @@ static void kdf(u8 *first_dst, u8 *second_dst, u8 *third_dst, const u8 *data,

/* Expand second key: key = secret, data = first-key || 0x2 */
output[BLAKE2S_HASH_SIZE] = 2;
- blake2s256_hmac(output, output, secret, BLAKE2S_HASH_SIZE + 1,
- BLAKE2S_HASH_SIZE);
+ hmac(output, output, secret, BLAKE2S_HASH_SIZE + 1, BLAKE2S_HASH_SIZE);
memcpy(second_dst, output, second_len);

if (!third_dst || !third_len)
@@ -346,8 +380,7 @@ static void kdf(u8 *first_dst, u8 *second_dst, u8 *third_dst, const u8 *data,

/* Expand third key: key = secret, data = second-key || 0x3 */
output[BLAKE2S_HASH_SIZE] = 3;
- blake2s256_hmac(output, output, secret, BLAKE2S_HASH_SIZE + 1,
- BLAKE2S_HASH_SIZE);
+ hmac(output, output, secret, BLAKE2S_HASH_SIZE + 1, BLAKE2S_HASH_SIZE);
memcpy(third_dst, output, third_len);

out:
diff --git a/include/crypto/blake2s.h b/include/crypto/blake2s.h
index bc3fb59442ce..4e30e1799e61 100644
--- a/include/crypto/blake2s.h
+++ b/include/crypto/blake2s.h
@@ -101,7 +101,4 @@ static inline void blake2s(u8 *out, const u8 *in, const u8 *key,
blake2s_final(&state, out);
}

-void blake2s256_hmac(u8 *out, const u8 *in, const u8 *key, const size_t inlen,
- const size_t keylen);
-
#endif /* _CRYPTO_BLAKE2S_H */
diff --git a/lib/crypto/blake2s-selftest.c b/lib/crypto/blake2s-selftest.c
index 5d9ea53be973..409e4b728770 100644
--- a/lib/crypto/blake2s-selftest.c
+++ b/lib/crypto/blake2s-selftest.c
@@ -15,7 +15,6 @@
* #include <stdio.h>
*
* #include <openssl/evp.h>
- * #include <openssl/hmac.h>
*
* #define BLAKE2S_TESTVEC_COUNT 256
*
@@ -58,16 +57,6 @@
* }
* printf("};\n\n");
*
- * printf("static const u8 blake2s_hmac_testvecs[][BLAKE2S_HASH_SIZE] __initconst = {\n");
- *
- * HMAC(EVP_blake2s256(), key, sizeof(key), buf, sizeof(buf), hash, NULL);
- * print_vec(hash, BLAKE2S_OUTBYTES);
- *
- * HMAC(EVP_blake2s256(), buf, sizeof(buf), key, sizeof(key), hash, NULL);
- * print_vec(hash, BLAKE2S_OUTBYTES);
- *
- * printf("};\n");
- *
* return 0;
*}
*/
@@ -554,15 +543,6 @@ static const u8 blake2s_testvecs[][BLAKE2S_HASH_SIZE] __initconst = {
0xd6, 0x98, 0x6b, 0x07, 0x10, 0x65, 0x52, 0x65, },
};

-static const u8 blake2s_hmac_testvecs[][BLAKE2S_HASH_SIZE] __initconst = {
- { 0xce, 0xe1, 0x57, 0x69, 0x82, 0xdc, 0xbf, 0x43, 0xad, 0x56, 0x4c, 0x70,
- 0xed, 0x68, 0x16, 0x96, 0xcf, 0xa4, 0x73, 0xe8, 0xe8, 0xfc, 0x32, 0x79,
- 0x08, 0x0a, 0x75, 0x82, 0xda, 0x3f, 0x05, 0x11, },
- { 0x77, 0x2f, 0x0c, 0x71, 0x41, 0xf4, 0x4b, 0x2b, 0xb3, 0xc6, 0xb6, 0xf9,
- 0x60, 0xde, 0xe4, 0x52, 0x38, 0x66, 0xe8, 0xbf, 0x9b, 0x96, 0xc4, 0x9f,
- 0x60, 0xd9, 0x24, 0x37, 0x99, 0xd6, 0xec, 0x31, },
-};
-
bool __init blake2s_selftest(void)
{
u8 key[BLAKE2S_KEY_SIZE];
@@ -607,16 +587,5 @@ bool __init blake2s_selftest(void)
}
}

- if (success) {
- blake2s256_hmac(hash, buf, key, sizeof(buf), sizeof(key));
- success &= !memcmp(hash, blake2s_hmac_testvecs[0], BLAKE2S_HASH_SIZE);
-
- blake2s256_hmac(hash, key, buf, sizeof(key), sizeof(buf));
- success &= !memcmp(hash, blake2s_hmac_testvecs[1], BLAKE2S_HASH_SIZE);
-
- if (!success)
- pr_err("blake2s256_hmac self-test: FAIL\n");
- }
-
return success;
}
diff --git a/lib/crypto/blake2s.c b/lib/crypto/blake2s.c
index 93f2ae051370..9364f79937b8 100644
--- a/lib/crypto/blake2s.c
+++ b/lib/crypto/blake2s.c
@@ -30,43 +30,6 @@ void blake2s_final(struct blake2s_state *state, u8 *out)
}
EXPORT_SYMBOL(blake2s_final);

-void blake2s256_hmac(u8 *out, const u8 *in, const u8 *key, const size_t inlen,
- const size_t keylen)
-{
- struct blake2s_state state;
- u8 x_key[BLAKE2S_BLOCK_SIZE] __aligned(__alignof__(u32)) = { 0 };
- u8 i_hash[BLAKE2S_HASH_SIZE] __aligned(__alignof__(u32));
- int i;
-
- if (keylen > BLAKE2S_BLOCK_SIZE) {
- blake2s_init(&state, BLAKE2S_HASH_SIZE);
- blake2s_update(&state, key, keylen);
- blake2s_final(&state, x_key);
- } else
- memcpy(x_key, key, keylen);
-
- for (i = 0; i < BLAKE2S_BLOCK_SIZE; ++i)
- x_key[i] ^= 0x36;
-
- blake2s_init(&state, BLAKE2S_HASH_SIZE);
- blake2s_update(&state, x_key, BLAKE2S_BLOCK_SIZE);
- blake2s_update(&state, in, inlen);
- blake2s_final(&state, i_hash);
-
- for (i = 0; i < BLAKE2S_BLOCK_SIZE; ++i)
- x_key[i] ^= 0x5c ^ 0x36;
-
- blake2s_init(&state, BLAKE2S_HASH_SIZE);
- blake2s_update(&state, x_key, BLAKE2S_BLOCK_SIZE);
- blake2s_update(&state, i_hash, BLAKE2S_HASH_SIZE);
- blake2s_final(&state, i_hash);
-
- memcpy(out, i_hash, BLAKE2S_HASH_SIZE);
- memzero_explicit(x_key, BLAKE2S_BLOCK_SIZE);
- memzero_explicit(i_hash, BLAKE2S_HASH_SIZE);
-}
-EXPORT_SYMBOL(blake2s256_hmac);
-
static int __init blake2s_mod_init(void)
{
if (!IS_ENABLED(CONFIG_CRYPTO_MANAGER_DISABLE_TESTS) &&
--
2.34.1


2022-01-11 22:05:34

by Jason A. Donenfeld

[permalink] [raw]
Subject: [PATCH crypto v3 2/2] lib/crypto: sha1: re-roll loops to reduce code size

With SHA-1 no longer being used for anything performance oriented, and
also soon to be phased out entirely, we can make up for the space added
by unrolled BLAKE2s by simply re-rolling SHA-1. Since SHA-1 is so much
more complex, re-rolling it more or less takes care of the code size
added by BLAKE2s. And eventually, hopefully we'll see SHA-1 removed
entirely from most small kernel builds.

Cc: Geert Uytterhoeven <[email protected]>
Cc: Herbert Xu <[email protected]>
Cc: Ard Biesheuvel <[email protected]>
Signed-off-by: Jason A. Donenfeld <[email protected]>
---
lib/sha1.c | 95 ++++++++----------------------------------------------
1 file changed, 14 insertions(+), 81 deletions(-)

diff --git a/lib/sha1.c b/lib/sha1.c
index 9bd1935a1472..0494766fc574 100644
--- a/lib/sha1.c
+++ b/lib/sha1.c
@@ -9,6 +9,7 @@
#include <linux/kernel.h>
#include <linux/export.h>
#include <linux/bitops.h>
+#include <linux/string.h>
#include <crypto/sha1.h>
#include <asm/unaligned.h>

@@ -55,7 +56,8 @@
#define SHA_ROUND(t, input, fn, constant, A, B, C, D, E) do { \
__u32 TEMP = input(t); setW(t, TEMP); \
E += TEMP + rol32(A,5) + (fn) + (constant); \
- B = ror32(B, 2); } while (0)
+ B = ror32(B, 2); \
+ TEMP = E; E = D; D = C; C = B; B = A; A = TEMP; } while (0)

#define T_0_15(t, A, B, C, D, E) SHA_ROUND(t, SHA_SRC, (((C^D)&B)^D) , 0x5a827999, A, B, C, D, E )
#define T_16_19(t, A, B, C, D, E) SHA_ROUND(t, SHA_MIX, (((C^D)&B)^D) , 0x5a827999, A, B, C, D, E )
@@ -84,6 +86,7 @@
void sha1_transform(__u32 *digest, const char *data, __u32 *array)
{
__u32 A, B, C, D, E;
+ unsigned int i = 0;

A = digest[0];
B = digest[1];
@@ -92,94 +95,24 @@ void sha1_transform(__u32 *digest, const char *data, __u32 *array)
E = digest[4];

/* Round 1 - iterations 0-16 take their input from 'data' */
- T_0_15( 0, A, B, C, D, E);
- T_0_15( 1, E, A, B, C, D);
- T_0_15( 2, D, E, A, B, C);
- T_0_15( 3, C, D, E, A, B);
- T_0_15( 4, B, C, D, E, A);
- T_0_15( 5, A, B, C, D, E);
- T_0_15( 6, E, A, B, C, D);
- T_0_15( 7, D, E, A, B, C);
- T_0_15( 8, C, D, E, A, B);
- T_0_15( 9, B, C, D, E, A);
- T_0_15(10, A, B, C, D, E);
- T_0_15(11, E, A, B, C, D);
- T_0_15(12, D, E, A, B, C);
- T_0_15(13, C, D, E, A, B);
- T_0_15(14, B, C, D, E, A);
- T_0_15(15, A, B, C, D, E);
+ for (; i < 16; ++i)
+ T_0_15(i, A, B, C, D, E);

/* Round 1 - tail. Input from 512-bit mixing array */
- T_16_19(16, E, A, B, C, D);
- T_16_19(17, D, E, A, B, C);
- T_16_19(18, C, D, E, A, B);
- T_16_19(19, B, C, D, E, A);
+ for (; i < 20; ++i)
+ T_16_19(i, A, B, C, D, E);

/* Round 2 */
- T_20_39(20, A, B, C, D, E);
- T_20_39(21, E, A, B, C, D);
- T_20_39(22, D, E, A, B, C);
- T_20_39(23, C, D, E, A, B);
- T_20_39(24, B, C, D, E, A);
- T_20_39(25, A, B, C, D, E);
- T_20_39(26, E, A, B, C, D);
- T_20_39(27, D, E, A, B, C);
- T_20_39(28, C, D, E, A, B);
- T_20_39(29, B, C, D, E, A);
- T_20_39(30, A, B, C, D, E);
- T_20_39(31, E, A, B, C, D);
- T_20_39(32, D, E, A, B, C);
- T_20_39(33, C, D, E, A, B);
- T_20_39(34, B, C, D, E, A);
- T_20_39(35, A, B, C, D, E);
- T_20_39(36, E, A, B, C, D);
- T_20_39(37, D, E, A, B, C);
- T_20_39(38, C, D, E, A, B);
- T_20_39(39, B, C, D, E, A);
+ for (; i < 40; ++i)
+ T_20_39(i, A, B, C, D, E);

/* Round 3 */
- T_40_59(40, A, B, C, D, E);
- T_40_59(41, E, A, B, C, D);
- T_40_59(42, D, E, A, B, C);
- T_40_59(43, C, D, E, A, B);
- T_40_59(44, B, C, D, E, A);
- T_40_59(45, A, B, C, D, E);
- T_40_59(46, E, A, B, C, D);
- T_40_59(47, D, E, A, B, C);
- T_40_59(48, C, D, E, A, B);
- T_40_59(49, B, C, D, E, A);
- T_40_59(50, A, B, C, D, E);
- T_40_59(51, E, A, B, C, D);
- T_40_59(52, D, E, A, B, C);
- T_40_59(53, C, D, E, A, B);
- T_40_59(54, B, C, D, E, A);
- T_40_59(55, A, B, C, D, E);
- T_40_59(56, E, A, B, C, D);
- T_40_59(57, D, E, A, B, C);
- T_40_59(58, C, D, E, A, B);
- T_40_59(59, B, C, D, E, A);
+ for (; i < 60; ++i)
+ T_40_59(i, A, B, C, D, E);

/* Round 4 */
- T_60_79(60, A, B, C, D, E);
- T_60_79(61, E, A, B, C, D);
- T_60_79(62, D, E, A, B, C);
- T_60_79(63, C, D, E, A, B);
- T_60_79(64, B, C, D, E, A);
- T_60_79(65, A, B, C, D, E);
- T_60_79(66, E, A, B, C, D);
- T_60_79(67, D, E, A, B, C);
- T_60_79(68, C, D, E, A, B);
- T_60_79(69, B, C, D, E, A);
- T_60_79(70, A, B, C, D, E);
- T_60_79(71, E, A, B, C, D);
- T_60_79(72, D, E, A, B, C);
- T_60_79(73, C, D, E, A, B);
- T_60_79(74, B, C, D, E, A);
- T_60_79(75, A, B, C, D, E);
- T_60_79(76, E, A, B, C, D);
- T_60_79(77, D, E, A, B, C);
- T_60_79(78, C, D, E, A, B);
- T_60_79(79, B, C, D, E, A);
+ for (; i < 80; ++i)
+ T_60_79(i, A, B, C, D, E);

digest[0] += A;
digest[1] += B;
--
2.34.1