Subject: [PATCH 00/13] crypto: x86 - yield FPU context during long loops

This is an offshoot of the previous patch series at:
https://lore.kernel.org/linux-crypto/[email protected]

Add a kernel_fpu_yield() function for x86 crypto drivers to call
periodically during long loops.

Test results
============
I created 28 tcrypt modules so modprobe can run concurrent tests,
added 1 MiB functional and speed tests to tcrypt, and ran three processes
spawning 28 subprocesses (one per physical CPU core) each looping forever
through all the tcrypt test modes. This keeps the system quite busy,
generating RCU stalls and soft lockups during both generic and x86
crypto function processing.

In conjunction with these patch series:
* [PATCH 0/8] crypto: kernel-doc for assembly language
https://lore.kernel.org/linux-crypto/[email protected]
* [PATCH 0/3] crypto/rcu: suppress unnecessary CPU stall warnings
https://lore.kernel.org/linux-crypto/[email protected]
* [PATCH 0/3] crypto: yield at end of operations
https://lore.kernel.org/linux-crypto/[email protected]

while using the default RCU values (60 s stalls, 21 s expedited stalls),
several nights of testing did not result in any RCU stall warnings or soft
lockups in any of these preemption modes:
preempt=none
preempt=voluntary
preempt=full

Setting the shortest possible RCU timeouts (3 s, 20 ms) did still result
in RCU stalls, but only about one every 2 hours, and not occurring
on particular modules like sha512_ssse3 and sm4-generic.

systemd usually crashes and restarts when its journal becomes full from
all the tcrypt printk messages. Without the patches, that triggered more
RCU stall reports and soft lockups; with the patches, only userspace
seems perturbed.


Robert Elliott (13):
x86: protect simd.h header file
x86: add yield FPU context utility function
crypto: x86/sha - yield FPU context during long loops
crypto: x86/crc - yield FPU context during long loops
crypto: x86/sm3 - yield FPU context during long loops
crypto: x86/ghash - use u8 rather than char
crypto: x86/ghash - restructure FPU context saving
crypto: x86/ghash - yield FPU context during long loops
crypto: x86/poly - yield FPU context only when needed
crypto: x86/aegis - yield FPU context during long loops
crypto: x86/blake - yield FPU context only when needed
crypto: x86/chacha - yield FPU context only when needed
crypto: x86/aria - yield FPU context only when needed

arch/x86/crypto/aegis128-aesni-glue.c | 49 ++++++---
arch/x86/crypto/aria_aesni_avx_glue.c | 7 +-
arch/x86/crypto/blake2s-glue.c | 41 +++----
arch/x86/crypto/chacha_glue.c | 22 ++--
arch/x86/crypto/crc32-pclmul_glue.c | 49 +++++----
arch/x86/crypto/crc32c-intel_glue.c | 118 ++++++++++++++------
arch/x86/crypto/crct10dif-pclmul_glue.c | 65 ++++++++---
arch/x86/crypto/ghash-clmulni-intel_asm.S | 6 +-
arch/x86/crypto/ghash-clmulni-intel_glue.c | 37 +++++--
arch/x86/crypto/nhpoly1305-avx2-glue.c | 22 ++--
arch/x86/crypto/nhpoly1305-sse2-glue.c | 22 ++--
arch/x86/crypto/poly1305_glue.c | 47 ++++----
arch/x86/crypto/polyval-clmulni_glue.c | 46 +++++---
arch/x86/crypto/sha1_avx2_x86_64_asm.S | 6 +-
arch/x86/crypto/sha1_ni_asm.S | 8 +-
arch/x86/crypto/sha1_ssse3_glue.c | 120 +++++++++++++++++----
arch/x86/crypto/sha256_ni_asm.S | 8 +-
arch/x86/crypto/sha256_ssse3_glue.c | 115 ++++++++++++++++----
arch/x86/crypto/sha512_ssse3_glue.c | 89 ++++++++++++---
arch/x86/crypto/sm3_avx_glue.c | 34 +++++-
arch/x86/include/asm/simd.h | 23 ++++
include/crypto/internal/blake2s.h | 8 +-
lib/crypto/blake2s-generic.c | 12 +--
23 files changed, 687 insertions(+), 267 deletions(-)

--
2.38.1


Subject: [PATCH 09/13] crypto: x86/poly - yield FPU context only when needed

The x86 assembly language implementations using SIMD process data
between kernel_fpu_begin() and kernel_fpu_end() calls. That
disables scheduler preemption, so prevents the CPU core from being
used by other threads.

The update() and finup() functions might be called to process
large quantities of data, which can result in RCU stalls and
soft lockups.

Rather than break the processing into 4 KiB passes, each of which
unilaterally calls kernel_fpu_begin() and kernel_fpu_end(),
periodically check if the kernel scheduler wants to run something
else on the CPU. If so, yield the kernel FPU context and let the
scheduler intervene.

Suggested-by: Herbert Xu <[email protected]>
Signed-off-by: Robert Elliott <[email protected]>
---
arch/x86/crypto/nhpoly1305-avx2-glue.c | 22 +++++++-----
arch/x86/crypto/nhpoly1305-sse2-glue.c | 22 +++++++-----
arch/x86/crypto/poly1305_glue.c | 47 ++++++++++++--------------
arch/x86/crypto/polyval-clmulni_glue.c | 46 +++++++++++++++----------
4 files changed, 79 insertions(+), 58 deletions(-)

diff --git a/arch/x86/crypto/nhpoly1305-avx2-glue.c b/arch/x86/crypto/nhpoly1305-avx2-glue.c
index 46b036204ed9..4afbfd35afda 100644
--- a/arch/x86/crypto/nhpoly1305-avx2-glue.c
+++ b/arch/x86/crypto/nhpoly1305-avx2-glue.c
@@ -22,15 +22,21 @@ static int nhpoly1305_avx2_update(struct shash_desc *desc,
if (srclen < 64 || !crypto_simd_usable())
return crypto_nhpoly1305_update(desc, src, srclen);

- do {
- unsigned int n = min_t(unsigned int, srclen, SZ_4K);
+ kernel_fpu_begin();
+ for (;;) {
+ const unsigned int chunk = min(srclen, 4096U);
+
+ crypto_nhpoly1305_update_helper(desc, src, chunk, nh_avx2);
+ srclen -= chunk;
+
+ if (!srclen)
+ break;
+
+ src += chunk;
+ kernel_fpu_yield();
+ }
+ kernel_fpu_end();

- kernel_fpu_begin();
- crypto_nhpoly1305_update_helper(desc, src, n, nh_avx2);
- kernel_fpu_end();
- src += n;
- srclen -= n;
- } while (srclen);
return 0;
}

diff --git a/arch/x86/crypto/nhpoly1305-sse2-glue.c b/arch/x86/crypto/nhpoly1305-sse2-glue.c
index 4a4970d75107..f5c757f6f781 100644
--- a/arch/x86/crypto/nhpoly1305-sse2-glue.c
+++ b/arch/x86/crypto/nhpoly1305-sse2-glue.c
@@ -22,15 +22,21 @@ static int nhpoly1305_sse2_update(struct shash_desc *desc,
if (srclen < 64 || !crypto_simd_usable())
return crypto_nhpoly1305_update(desc, src, srclen);

- do {
- unsigned int n = min_t(unsigned int, srclen, SZ_4K);
+ kernel_fpu_begin();
+ for (;;) {
+ const unsigned int chunk = min(srclen, 4096U);
+
+ crypto_nhpoly1305_update_helper(desc, src, chunk, nh_sse2);
+ srclen -= chunk;
+
+ if (!srclen)
+ break;
+
+ src += chunk;
+ kernel_fpu_yield();
+ }
+ kernel_fpu_end();

- kernel_fpu_begin();
- crypto_nhpoly1305_update_helper(desc, src, n, nh_sse2);
- kernel_fpu_end();
- src += n;
- srclen -= n;
- } while (srclen);
return 0;
}

diff --git a/arch/x86/crypto/poly1305_glue.c b/arch/x86/crypto/poly1305_glue.c
index 1dfb8af48a3c..13e2e134b458 100644
--- a/arch/x86/crypto/poly1305_glue.c
+++ b/arch/x86/crypto/poly1305_glue.c
@@ -15,20 +15,13 @@
#include <asm/intel-family.h>
#include <asm/simd.h>

-asmlinkage void poly1305_init_x86_64(void *ctx,
- const u8 key[POLY1305_BLOCK_SIZE]);
-asmlinkage void poly1305_blocks_x86_64(void *ctx, const u8 *inp,
- const size_t len, const u32 padbit);
-asmlinkage void poly1305_emit_x86_64(void *ctx, u8 mac[POLY1305_DIGEST_SIZE],
- const u32 nonce[4]);
-asmlinkage void poly1305_emit_avx(void *ctx, u8 mac[POLY1305_DIGEST_SIZE],
- const u32 nonce[4]);
-asmlinkage void poly1305_blocks_avx(void *ctx, const u8 *inp, const size_t len,
- const u32 padbit);
-asmlinkage void poly1305_blocks_avx2(void *ctx, const u8 *inp, const size_t len,
- const u32 padbit);
-asmlinkage void poly1305_blocks_avx512(void *ctx, const u8 *inp,
- const size_t len, const u32 padbit);
+asmlinkage void poly1305_init_x86_64(void *ctx, const u8 key[POLY1305_BLOCK_SIZE]);
+asmlinkage void poly1305_blocks_x86_64(void *ctx, const u8 *inp, unsigned int len, u32 padbit);
+asmlinkage void poly1305_emit_x86_64(void *ctx, u8 mac[POLY1305_DIGEST_SIZE], const u32 nonce[4]);
+asmlinkage void poly1305_emit_avx(void *ctx, u8 mac[POLY1305_DIGEST_SIZE], const u32 nonce[4]);
+asmlinkage void poly1305_blocks_avx(void *ctx, const u8 *inp, unsigned int len, const u32 padbit);
+asmlinkage void poly1305_blocks_avx2(void *ctx, const u8 *inp, unsigned int len, u32 padbit);
+asmlinkage void poly1305_blocks_avx512(void *ctx, const u8 *inp, unsigned int len, u32 padbit);

static __ro_after_init DEFINE_STATIC_KEY_FALSE(poly1305_use_avx);
static __ro_after_init DEFINE_STATIC_KEY_FALSE(poly1305_use_avx2);
@@ -86,7 +79,7 @@ static void poly1305_simd_init(void *ctx, const u8 key[POLY1305_BLOCK_SIZE])
poly1305_init_x86_64(ctx, key);
}

-static void poly1305_simd_blocks(void *ctx, const u8 *inp, size_t len,
+static void poly1305_simd_blocks(void *ctx, const u8 *inp, unsigned int len,
const u32 padbit)
{
struct poly1305_arch_internal *state = ctx;
@@ -103,21 +96,25 @@ static void poly1305_simd_blocks(void *ctx, const u8 *inp, size_t len,
return;
}

- do {
- const size_t bytes = min_t(size_t, len, SZ_4K);
+ kernel_fpu_begin();
+ for (;;) {
+ const unsigned int chunk = min(len, 4096U);

- kernel_fpu_begin();
if (IS_ENABLED(CONFIG_AS_AVX512) && static_branch_likely(&poly1305_use_avx512))
- poly1305_blocks_avx512(ctx, inp, bytes, padbit);
+ poly1305_blocks_avx512(ctx, inp, chunk, padbit);
else if (static_branch_likely(&poly1305_use_avx2))
- poly1305_blocks_avx2(ctx, inp, bytes, padbit);
+ poly1305_blocks_avx2(ctx, inp, chunk, padbit);
else
- poly1305_blocks_avx(ctx, inp, bytes, padbit);
- kernel_fpu_end();
+ poly1305_blocks_avx(ctx, inp, chunk, padbit);
+ len -= chunk;

- len -= bytes;
- inp += bytes;
- } while (len);
+ if (!len)
+ break;
+
+ inp += chunk;
+ kernel_fpu_yield();
+ }
+ kernel_fpu_end();
}

static void poly1305_simd_emit(void *ctx, u8 mac[POLY1305_DIGEST_SIZE],
diff --git a/arch/x86/crypto/polyval-clmulni_glue.c b/arch/x86/crypto/polyval-clmulni_glue.c
index 8fa58b0f3cb3..a3d72e87d58d 100644
--- a/arch/x86/crypto/polyval-clmulni_glue.c
+++ b/arch/x86/crypto/polyval-clmulni_glue.c
@@ -45,8 +45,8 @@ struct polyval_desc_ctx {
u32 bytes;
};

-asmlinkage void clmul_polyval_update(const struct polyval_tfm_ctx *keys,
- const u8 *in, size_t nblocks, u8 *accumulator);
+asmlinkage void clmul_polyval_update(const struct polyval_tfm_ctx *keys, const u8 *in,
+ unsigned int nblocks, u8 *accumulator);
asmlinkage void clmul_polyval_mul(u8 *op1, const u8 *op2);

static inline struct polyval_tfm_ctx *polyval_tfm_ctx(struct crypto_shash *tfm)
@@ -55,27 +55,40 @@ static inline struct polyval_tfm_ctx *polyval_tfm_ctx(struct crypto_shash *tfm)
}

static void internal_polyval_update(const struct polyval_tfm_ctx *keys,
- const u8 *in, size_t nblocks, u8 *accumulator)
+ const u8 *in, unsigned int nblocks, u8 *accumulator)
{
- if (likely(crypto_simd_usable())) {
- kernel_fpu_begin();
- clmul_polyval_update(keys, in, nblocks, accumulator);
- kernel_fpu_end();
- } else {
+ if (!crypto_simd_usable()) {
polyval_update_non4k(keys->key_powers[NUM_KEY_POWERS-1], in,
nblocks, accumulator);
+ return;
}
+
+ kernel_fpu_begin();
+ for (;;) {
+ const unsigned int chunks = min(nblocks, 4096U / POLYVAL_BLOCK_SIZE);
+
+ clmul_polyval_update(keys, in, chunks, accumulator);
+ nblocks -= chunks;
+
+ if (!nblocks)
+ break;
+
+ in += chunks * POLYVAL_BLOCK_SIZE;
+ kernel_fpu_yield();
+ }
+ kernel_fpu_end();
}

static void internal_polyval_mul(u8 *op1, const u8 *op2)
{
- if (likely(crypto_simd_usable())) {
- kernel_fpu_begin();
- clmul_polyval_mul(op1, op2);
- kernel_fpu_end();
- } else {
+ if (!crypto_simd_usable()) {
polyval_mul_non4k(op1, op2);
+ return;
}
+
+ kernel_fpu_begin();
+ clmul_polyval_mul(op1, op2);
+ kernel_fpu_end();
}

static int polyval_x86_setkey(struct crypto_shash *tfm,
@@ -113,7 +126,6 @@ static int polyval_x86_update(struct shash_desc *desc,
struct polyval_desc_ctx *dctx = shash_desc_ctx(desc);
const struct polyval_tfm_ctx *tctx = polyval_tfm_ctx(desc->tfm);
u8 *pos;
- unsigned int nblocks;
unsigned int n;

if (dctx->bytes) {
@@ -131,9 +143,9 @@ static int polyval_x86_update(struct shash_desc *desc,
tctx->key_powers[NUM_KEY_POWERS-1]);
}

- while (srclen >= POLYVAL_BLOCK_SIZE) {
- /* Allow rescheduling every 4K bytes. */
- nblocks = min(srclen, 4096U) / POLYVAL_BLOCK_SIZE;
+ if (srclen >= POLYVAL_BLOCK_SIZE) {
+ const unsigned int nblocks = srclen / POLYVAL_BLOCK_SIZE;
+
internal_polyval_update(tctx, src, nblocks, dctx->buffer);
srclen -= nblocks * POLYVAL_BLOCK_SIZE;
src += nblocks * POLYVAL_BLOCK_SIZE;
--
2.38.1

Subject: [PATCH 06/13] crypto: x86/ghash - use u8 rather than char

Use more consistent unambivalent types (u8 rather than char)
for the source and destination buffer pointer arguments for
the asm functions.

Reference them with "asmlinkage" as well.

Signed-off-by: Robert Elliott <[email protected]>
---
arch/x86/crypto/ghash-clmulni-intel_asm.S | 6 +++---
arch/x86/crypto/ghash-clmulni-intel_glue.c | 6 +++---
2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/x86/crypto/ghash-clmulni-intel_asm.S b/arch/x86/crypto/ghash-clmulni-intel_asm.S
index 09cf9271b83a..ad860836f75b 100644
--- a/arch/x86/crypto/ghash-clmulni-intel_asm.S
+++ b/arch/x86/crypto/ghash-clmulni-intel_asm.S
@@ -96,7 +96,7 @@ SYM_FUNC_END(__clmul_gf128mul_ble)
* This supports 64-bit CPUs.
*
* Return: none (but @dst is updated)
- * Prototype: asmlinkage void clmul_ghash_mul(char *dst, const u128 *shash)
+ * Prototype: asmlinkage void clmul_ghash_mul(u8 *dst, const u128 *shash)
*/
SYM_FUNC_START(clmul_ghash_mul)
FRAME_BEGIN
@@ -122,8 +122,8 @@ SYM_FUNC_END(clmul_ghash_mul)
* This supports 64-bit CPUs.
*
* Return: none (but @dst is updated)
- * Prototype: asmlinkage clmul_ghash_update(char *dst, const char *src,
- * unsigned int srclen, const u128 *shash);
+ * Prototype: asmlinkage void clmul_ghash_update(u8 *dst, const u8 *src,
+ * unsigned int srclen, const u128 *shash);
*/
SYM_FUNC_START(clmul_ghash_update)
FRAME_BEGIN
diff --git a/arch/x86/crypto/ghash-clmulni-intel_glue.c b/arch/x86/crypto/ghash-clmulni-intel_glue.c
index 1f1a95f3dd0c..beac4b2eddf6 100644
--- a/arch/x86/crypto/ghash-clmulni-intel_glue.c
+++ b/arch/x86/crypto/ghash-clmulni-intel_glue.c
@@ -23,10 +23,10 @@
#define GHASH_BLOCK_SIZE 16
#define GHASH_DIGEST_SIZE 16

-void clmul_ghash_mul(char *dst, const u128 *shash);
+asmlinkage void clmul_ghash_mul(u8 *dst, const u128 *shash);

-void clmul_ghash_update(char *dst, const char *src, unsigned int srclen,
- const u128 *shash);
+asmlinkage void clmul_ghash_update(u8 *dst, const u8 *src, unsigned int srclen,
+ const u128 *shash);

struct ghash_async_ctx {
struct cryptd_ahash *cryptd_tfm;
--
2.38.1

Subject: [PATCH 04/13] crypto: x86/crc - yield FPU context during long loops

The x86 assembly language implementations using SIMD process data
between kernel_fpu_begin() and kernel_fpu_end() calls. That
disables scheduler preemption, so prevents the CPU core from being
used by other threads.

The update() and finup() functions might be called to process
large quantities of data, which can result in RCU stalls and
soft lockups.

Periodically check if the kernel scheduler wants to run something
else on the CPU. If so, yield the kernel FPU context and let the
scheduler intervene.

For crc32, add a pre-alignment loop so the assembly language
function is not repeatedly called with an unaligned starting
address.

Fixes: 78c37d191dd6 ("crypto: crc32 - add crc32 pclmulqdq implementation and wrappers for table implementation")
Fixes: 6a8ce1ef3940 ("crypto: crc32c - Optimize CRC32C calculation with PCLMULQDQ instruction")
Fixes: 0b95a7f85718 ("crypto: crct10dif - Glue code to cast accelerated CRCT10DIF assembly as a crypto transform")
Suggested-by: Herbert Xu <[email protected]>
Signed-off-by: Robert Elliott <[email protected]>
---
arch/x86/crypto/crc32-pclmul_glue.c | 49 +++++-----
arch/x86/crypto/crc32c-intel_glue.c | 118 +++++++++++++++++-------
arch/x86/crypto/crct10dif-pclmul_glue.c | 65 ++++++++++---
3 files changed, 165 insertions(+), 67 deletions(-)

diff --git a/arch/x86/crypto/crc32-pclmul_glue.c b/arch/x86/crypto/crc32-pclmul_glue.c
index 98cf3b4e4c9f..3692b50faf1c 100644
--- a/arch/x86/crypto/crc32-pclmul_glue.c
+++ b/arch/x86/crypto/crc32-pclmul_glue.c
@@ -41,41 +41,50 @@
#define CHKSUM_BLOCK_SIZE 1
#define CHKSUM_DIGEST_SIZE 4

-#define PCLMUL_MIN_LEN 64L /* minimum size of buffer
- * for crc32_pclmul_le_16 */
-#define SCALE_F 16L /* size of xmm register */
+#define PCLMUL_MIN_LEN 64U /* minimum size of buffer for crc32_pclmul_le_16 */
+#define SCALE_F 16U /* size of xmm register */
#define SCALE_F_MASK (SCALE_F - 1)

-u32 crc32_pclmul_le_16(unsigned char const *buffer, size_t len, u32 crc32);
+asmlinkage u32 crc32_pclmul_le_16(const u8 *buffer, unsigned int len, u32 crc32);

-static u32 __attribute__((pure))
- crc32_pclmul_le(u32 crc, unsigned char const *p, size_t len)
+static u32 crc32_pclmul_le(u32 crc, const u8 *p, unsigned int len)
{
unsigned int iquotient;
unsigned int iremainder;
- unsigned int prealign;

if (len < PCLMUL_MIN_LEN + SCALE_F_MASK || !crypto_simd_usable())
return crc32_le(crc, p, len);

- if ((long)p & SCALE_F_MASK) {
+ if ((unsigned long)p & SCALE_F_MASK) {
/* align p to 16 byte */
- prealign = SCALE_F - ((long)p & SCALE_F_MASK);
+ unsigned int prealign = SCALE_F - ((unsigned long)p & SCALE_F_MASK);

crc = crc32_le(crc, p, prealign);
len -= prealign;
- p = (unsigned char *)(((unsigned long)p + SCALE_F_MASK) &
- ~SCALE_F_MASK);
+ p += prealign;
}
- iquotient = len & (~SCALE_F_MASK);
+ iquotient = len & ~SCALE_F_MASK;
iremainder = len & SCALE_F_MASK;

- kernel_fpu_begin();
- crc = crc32_pclmul_le_16(p, iquotient, crc);
- kernel_fpu_end();
+ if (iquotient) {
+ kernel_fpu_begin();
+ for (;;) {
+ const unsigned int chunk = min(iquotient, 4096U);

- if (iremainder)
- crc = crc32_le(crc, p + iquotient, iremainder);
+ crc = crc32_pclmul_le_16(p, chunk, crc);
+ iquotient -= chunk;
+ p += chunk;
+
+ if (iquotient < PCLMUL_MIN_LEN)
+ break;
+
+ kernel_fpu_yield();
+ }
+ kernel_fpu_end();
+ }
+
+ if (iquotient || iremainder)
+ crc = crc32_le(crc, p, iquotient + iremainder);

return crc;
}
@@ -120,8 +129,7 @@ static int crc32_pclmul_update(struct shash_desc *desc, const u8 *data,
}

/* No final XOR 0xFFFFFFFF, like crc32_le */
-static int __crc32_pclmul_finup(u32 *crcp, const u8 *data, unsigned int len,
- u8 *out)
+static int __crc32_pclmul_finup(u32 *crcp, const u8 *data, unsigned int len, u8 *out)
{
*(__le32 *)out = cpu_to_le32(crc32_pclmul_le(*crcp, data, len));
return 0;
@@ -144,8 +152,7 @@ static int crc32_pclmul_final(struct shash_desc *desc, u8 *out)
static int crc32_pclmul_digest(struct shash_desc *desc, const u8 *data,
unsigned int len, u8 *out)
{
- return __crc32_pclmul_finup(crypto_shash_ctx(desc->tfm), data, len,
- out);
+ return __crc32_pclmul_finup(crypto_shash_ctx(desc->tfm), data, len, out);
}

static struct shash_alg alg = {
diff --git a/arch/x86/crypto/crc32c-intel_glue.c b/arch/x86/crypto/crc32c-intel_glue.c
index feccb5254c7e..932574661ef5 100644
--- a/arch/x86/crypto/crc32c-intel_glue.c
+++ b/arch/x86/crypto/crc32c-intel_glue.c
@@ -35,19 +35,24 @@

#ifdef CONFIG_X86_64
/*
- * use carryless multiply version of crc32c when buffer
- * size is >= 512 to account
- * for fpu state save/restore overhead.
+ * only use crc_pcl() (carryless multiply version of crc32c) when buffer
+ * size is >= 512 to account for fpu state save/restore overhead.
*/
#define CRC32C_PCL_BREAKEVEN 512

-asmlinkage unsigned int crc_pcl(const u8 *buffer, int len,
- unsigned int crc_init);
+/*
+ * only pass aligned buffers to crc_pcl() to avoid special handling
+ * in each pass
+ */
+#define ALIGN_CRCPCL 16U
+#define ALIGN_CRCPCL_MASK (ALIGN_CRCPCL - 1)
+
+asmlinkage u32 crc_pcl(const u8 *buffer, u64 len, u32 crc_init);
#endif /* CONFIG_X86_64 */

-static u32 crc32c_intel_le_hw_byte(u32 crc, unsigned char const *data, size_t length)
+static u32 crc32c_intel_le_hw_byte(u32 crc, const u8 *data, unsigned int len)
{
- while (length--) {
+ while (len--) {
asm("crc32b %1, %0"
: "+r" (crc) : "rm" (*data));
data++;
@@ -56,7 +61,7 @@ static u32 crc32c_intel_le_hw_byte(u32 crc, unsigned char const *data, size_t le
return crc;
}

-static u32 __pure crc32c_intel_le_hw(u32 crc, unsigned char const *p, size_t len)
+static u32 __pure crc32c_intel_le_hw(u32 crc, const u8 *p, unsigned int len)
{
unsigned int iquotient = len / SCALE_F;
unsigned int iremainder = len % SCALE_F;
@@ -69,8 +74,7 @@ static u32 __pure crc32c_intel_le_hw(u32 crc, unsigned char const *p, size_t len
}

if (iremainder)
- crc = crc32c_intel_le_hw_byte(crc, (unsigned char *)ptmp,
- iremainder);
+ crc = crc32c_intel_le_hw_byte(crc, (u8 *)ptmp, iremainder);

return crc;
}
@@ -110,8 +114,8 @@ static int crc32c_intel_update(struct shash_desc *desc, const u8 *data,
return 0;
}

-static int __crc32c_intel_finup(u32 *crcp, const u8 *data, unsigned int len,
- u8 *out)
+static int __crc32c_intel_finup(const u32 *crcp, const u8 *data,
+ unsigned int len, u8 *out)
{
*(__le32 *)out = ~cpu_to_le32(crc32c_intel_le_hw(*crcp, data, len));
return 0;
@@ -134,8 +138,7 @@ static int crc32c_intel_final(struct shash_desc *desc, u8 *out)
static int crc32c_intel_digest(struct shash_desc *desc, const u8 *data,
unsigned int len, u8 *out)
{
- return __crc32c_intel_finup(crypto_shash_ctx(desc->tfm), data, len,
- out);
+ return __crc32c_intel_finup(crypto_shash_ctx(desc->tfm), data, len, out);
}

static int crc32c_intel_cra_init(struct crypto_tfm *tfm)
@@ -149,47 +152,96 @@ static int crc32c_intel_cra_init(struct crypto_tfm *tfm)

#ifdef CONFIG_X86_64
static int crc32c_pcl_intel_update(struct shash_desc *desc, const u8 *data,
- unsigned int len)
+ unsigned int len)
{
u32 *crcp = shash_desc_ctx(desc);
+ u32 crc;
+
+ BUILD_BUG_ON(CRC32C_PCL_BREAKEVEN > 4096U);

/*
* use faster PCL version if datasize is large enough to
* overcome kernel fpu state save/restore overhead
*/
- if (len >= CRC32C_PCL_BREAKEVEN && crypto_simd_usable()) {
- kernel_fpu_begin();
- *crcp = crc_pcl(data, len, *crcp);
- kernel_fpu_end();
- } else
+ if (len < CRC32C_PCL_BREAKEVEN + ALIGN_CRCPCL_MASK || !crypto_simd_usable()) {
*crcp = crc32c_intel_le_hw(*crcp, data, len);
+ return 0;
+ }
+
+ crc = *crcp;
+ /*
+ * Although crc_pcl() supports unaligned buffers, it is more efficient
+ * handling a 16-byte aligned buffer.
+ */
+ if ((unsigned long)data & ALIGN_CRCPCL_MASK) {
+ unsigned int prealign = ALIGN_CRCPCL - ((unsigned long)data & ALIGN_CRCPCL_MASK);
+
+ crc = crc32c_intel_le_hw(crc, data, prealign);
+ len -= prealign;
+ data += prealign;
+ }
+
+ kernel_fpu_begin();
+ for (;;) {
+ const unsigned int chunk = min(len, 4096U);
+
+ crc = crc_pcl(data, chunk, crc);
+ len -= chunk;
+
+ if (!len)
+ break;
+
+ data += chunk;
+ kernel_fpu_yield();
+ }
+ kernel_fpu_end();
+
+ *crcp = crc;
return 0;
}

-static int __crc32c_pcl_intel_finup(u32 *crcp, const u8 *data, unsigned int len,
- u8 *out)
+static int __crc32c_pcl_intel_finup(const u32 *crcp, const u8 *data,
+ unsigned int len, u8 *out)
{
- if (len >= CRC32C_PCL_BREAKEVEN && crypto_simd_usable()) {
- kernel_fpu_begin();
- *(__le32 *)out = ~cpu_to_le32(crc_pcl(data, len, *crcp));
- kernel_fpu_end();
- } else
- *(__le32 *)out =
- ~cpu_to_le32(crc32c_intel_le_hw(*crcp, data, len));
+ u32 crc;
+
+ BUILD_BUG_ON(CRC32C_PCL_BREAKEVEN > 4096U);
+
+ if (len < CRC32C_PCL_BREAKEVEN || !crypto_simd_usable()) {
+ *(__le32 *)out = ~cpu_to_le32(crc32c_intel_le_hw(*crcp, data, len));
+ return 0;
+ }
+
+ crc = *crcp;
+ kernel_fpu_begin();
+ for (;;) {
+ const unsigned int chunk = min(len, 4096U);
+
+ crc = crc_pcl(data, chunk, crc);
+ len -= chunk;
+
+ if (!len)
+ break;
+
+ data += chunk;
+ kernel_fpu_yield();
+ }
+ kernel_fpu_end();
+
+ *(__le32 *)out = ~cpu_to_le32(crc);
return 0;
}

static int crc32c_pcl_intel_finup(struct shash_desc *desc, const u8 *data,
- unsigned int len, u8 *out)
+ unsigned int len, u8 *out)
{
return __crc32c_pcl_intel_finup(shash_desc_ctx(desc), data, len, out);
}

static int crc32c_pcl_intel_digest(struct shash_desc *desc, const u8 *data,
- unsigned int len, u8 *out)
+ unsigned int len, u8 *out)
{
- return __crc32c_pcl_intel_finup(crypto_shash_ctx(desc->tfm), data, len,
- out);
+ return __crc32c_pcl_intel_finup(crypto_shash_ctx(desc->tfm), data, len, out);
}
#endif /* CONFIG_X86_64 */

diff --git a/arch/x86/crypto/crct10dif-pclmul_glue.c b/arch/x86/crypto/crct10dif-pclmul_glue.c
index 71291d5af9f4..4d39eac94289 100644
--- a/arch/x86/crypto/crct10dif-pclmul_glue.c
+++ b/arch/x86/crypto/crct10dif-pclmul_glue.c
@@ -34,6 +34,8 @@
#include <asm/cpu_device_id.h>
#include <asm/simd.h>

+#define PCLMUL_MIN_LEN 16U /* minimum size of buffer for crc_t10dif_pcl */
+
asmlinkage u16 crc_t10dif_pcl(u16 init_crc, const u8 *buf, size_t len);

struct chksum_desc_ctx {
@@ -49,17 +51,36 @@ static int chksum_init(struct shash_desc *desc)
return 0;
}

-static int chksum_update(struct shash_desc *desc, const u8 *data,
- unsigned int length)
+static int chksum_update(struct shash_desc *desc, const u8 *data, unsigned int len)
{
struct chksum_desc_ctx *ctx = shash_desc_ctx(desc);
+ u16 crc;
+
+ if (len < PCLMUL_MIN_LEN || !crypto_simd_usable()) {
+ ctx->crc = crc_t10dif_generic(ctx->crc, data, len);
+ return 0;
+ }
+
+ crc = ctx->crc;
+ kernel_fpu_begin();
+ for (;;) {
+ const unsigned int chunk = min(len, 4096U);
+
+ crc = crc_t10dif_pcl(crc, data, chunk);
+ len -= chunk;
+ data += chunk;
+
+ if (len < PCLMUL_MIN_LEN)
+ break;
+
+ kernel_fpu_yield();
+ }
+ kernel_fpu_end();
+
+ if (len)
+ crc = crc_t10dif_generic(crc, data, len);

- if (length >= 16 && crypto_simd_usable()) {
- kernel_fpu_begin();
- ctx->crc = crc_t10dif_pcl(ctx->crc, data, length);
- kernel_fpu_end();
- } else
- ctx->crc = crc_t10dif_generic(ctx->crc, data, length);
+ ctx->crc = crc;
return 0;
}

@@ -73,12 +94,30 @@ static int chksum_final(struct shash_desc *desc, u8 *out)

static int __chksum_finup(__u16 crc, const u8 *data, unsigned int len, u8 *out)
{
- if (len >= 16 && crypto_simd_usable()) {
- kernel_fpu_begin();
- *(__u16 *)out = crc_t10dif_pcl(crc, data, len);
- kernel_fpu_end();
- } else
+ if (len < PCLMUL_MIN_LEN || !crypto_simd_usable()) {
*(__u16 *)out = crc_t10dif_generic(crc, data, len);
+ return 0;
+ }
+
+ kernel_fpu_begin();
+ for (;;) {
+ const unsigned int chunk = min(len, 4096U);
+
+ crc = crc_t10dif_pcl(crc, data, chunk);
+ len -= chunk;
+ data += chunk;
+
+ if (len < PCLMUL_MIN_LEN)
+ break;
+
+ kernel_fpu_yield();
+ }
+ kernel_fpu_end();
+
+ if (len)
+ crc = crc_t10dif_generic(crc, data, len);
+
+ *(__u16 *)out = crc;
return 0;
}

--
2.38.1

Subject: [PATCH 02/13] x86: add yield FPU context utility function

Add a function that may be called to avoid hogging the CPU
between kernel_fpu_begin() and kernel_fpu_end() calls.

Signed-off-by: Robert Elliott <[email protected]>
---
arch/x86/include/asm/simd.h | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)

diff --git a/arch/x86/include/asm/simd.h b/arch/x86/include/asm/simd.h
index bd9c672a2792..2c887dec95a2 100644
--- a/arch/x86/include/asm/simd.h
+++ b/arch/x86/include/asm/simd.h
@@ -3,6 +3,7 @@
#define _ASM_X86_SIMD_H

#include <asm/fpu/api.h>
+#include <linux/sched.h>

/*
* may_use_simd - whether it is allowable at this time to issue SIMD
@@ -13,4 +14,22 @@ static __must_check inline bool may_use_simd(void)
return irq_fpu_usable();
}

+/**
+ * kernel_fpu_relax - pause FPU preemption if scheduler wants
+ *
+ * Call this periodically during long loops between kernel_fpu_begin()
+ * and kernel_fpu_end() calls to avoid hogging the CPU if the
+ * scheduler wants to use the CPU for another thread
+ *
+ * Return: none
+ */
+static inline void kernel_fpu_yield(void)
+{
+ if (need_resched()) {
+ kernel_fpu_end();
+ cond_resched();
+ kernel_fpu_begin();
+ }
+}
+
#endif /* _ASM_X86_SIMD_H */
--
2.38.1

Subject: [PATCH 01/13] x86: protect simd.h header file

Add the usual #ifndef/#define construct around the contents of simd.h
so it doesn't confuse the C pre-processor if included by multiple
include files.

Fixes: 801201aa2564 ("crypto: move x86 to the generic version of ablk_helper")
Signed-off-by: Robert Elliott <[email protected]>
---
arch/x86/include/asm/simd.h | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/arch/x86/include/asm/simd.h b/arch/x86/include/asm/simd.h
index a341c878e977..bd9c672a2792 100644
--- a/arch/x86/include/asm/simd.h
+++ b/arch/x86/include/asm/simd.h
@@ -1,4 +1,6 @@
/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_SIMD_H
+#define _ASM_X86_SIMD_H

#include <asm/fpu/api.h>

@@ -10,3 +12,5 @@ static __must_check inline bool may_use_simd(void)
{
return irq_fpu_usable();
}
+
+#endif /* _ASM_X86_SIMD_H */
--
2.38.1

Subject: [PATCH 03/13] crypto: x86/sha - yield FPU context during long loops

The x86 assembly language implementations using SIMD process data
between kernel_fpu_begin() and kernel_fpu_end() calls. That
disables scheduler preemption, so prevents the CPU core from being
used by other threads.

The update() and finup() functions might be called to process
large quantities of data, which can result in RCU stalls and
soft lockups.

Periodically check if the kernel scheduler wants to run something
else on the CPU. If so, yield the kernel FPU context and let the
scheduler intervene.

Fixes: 66be89515888 ("crypto: sha1 - SSSE3 based SHA1 implementation for x86-64")
Fixes: 8275d1aa6422 ("crypto: sha256 - Create module providing optimized SHA256 routines using SSSE3, AVX or AVX2 instructions.")
Fixes: 87de4579f92d ("crypto: sha512 - Create module providing optimized SHA512 routines using SSSE3, AVX or AVX2 instructions.")
Fixes: aa031b8f702e ("crypto: x86/sha512 - load based on CPU features")
Suggested-by: Herbert Xu <[email protected]>
Signed-off-by: Robert Elliott <[email protected]>
---
arch/x86/crypto/sha1_avx2_x86_64_asm.S | 6 +-
arch/x86/crypto/sha1_ni_asm.S | 8 +-
arch/x86/crypto/sha1_ssse3_glue.c | 120 ++++++++++++++++++++-----
arch/x86/crypto/sha256_ni_asm.S | 8 +-
arch/x86/crypto/sha256_ssse3_glue.c | 115 +++++++++++++++++++-----
arch/x86/crypto/sha512_ssse3_glue.c | 89 ++++++++++++++----
6 files changed, 277 insertions(+), 69 deletions(-)

diff --git a/arch/x86/crypto/sha1_avx2_x86_64_asm.S b/arch/x86/crypto/sha1_avx2_x86_64_asm.S
index c3ee9334cb0f..df03fbb2c42c 100644
--- a/arch/x86/crypto/sha1_avx2_x86_64_asm.S
+++ b/arch/x86/crypto/sha1_avx2_x86_64_asm.S
@@ -58,9 +58,9 @@
/*
* SHA-1 implementation with Intel(R) AVX2 instruction set extensions.
*
- *This implementation is based on the previous SSSE3 release:
- *Visit http://software.intel.com/en-us/articles/
- *and refer to improving-the-performance-of-the-secure-hash-algorithm-1/
+ * This implementation is based on the previous SSSE3 release:
+ * Visit http://software.intel.com/en-us/articles/
+ * and refer to improving-the-performance-of-the-secure-hash-algorithm-1/
*
*/

diff --git a/arch/x86/crypto/sha1_ni_asm.S b/arch/x86/crypto/sha1_ni_asm.S
index a69595b033c8..d513b85e242c 100644
--- a/arch/x86/crypto/sha1_ni_asm.S
+++ b/arch/x86/crypto/sha1_ni_asm.S
@@ -75,7 +75,7 @@
.text

/**
- * sha1_ni_transform - Calculate SHA1 hash using the x86 SHA-NI feature set
+ * sha1_transform_ni - Calculate SHA1 hash using the x86 SHA-NI feature set
* @digest: address of current 20-byte hash value (%rdi, DIGEST_PTR macro)
* @data: address of data (%rsi, DATA_PTR macro);
* data size must be a multiple of 64 bytes
@@ -94,9 +94,9 @@
* The non-indented lines are instructions related to the message schedule.
*
* Return: none
- * Prototype: asmlinkage void sha1_ni_transform(u32 *digest, const u8 *data, int blocks)
+ * Prototype: asmlinkage void sha1_transform_ni(u32 *digest, const u8 *data, int blocks)
*/
-SYM_TYPED_FUNC_START(sha1_ni_transform)
+SYM_TYPED_FUNC_START(sha1_transform_ni)
push %rbp
mov %rsp, %rbp
sub $FRAME_SIZE, %rsp
@@ -294,7 +294,7 @@ SYM_TYPED_FUNC_START(sha1_ni_transform)
pop %rbp

RET
-SYM_FUNC_END(sha1_ni_transform)
+SYM_FUNC_END(sha1_transform_ni)

.section .rodata.cst16.PSHUFFLE_BYTE_FLIP_MASK, "aM", @progbits, 16
.align 16
diff --git a/arch/x86/crypto/sha1_ssse3_glue.c b/arch/x86/crypto/sha1_ssse3_glue.c
index 44340a1139e0..b269b455fbbe 100644
--- a/arch/x86/crypto/sha1_ssse3_glue.c
+++ b/arch/x86/crypto/sha1_ssse3_glue.c
@@ -41,9 +41,7 @@ static int sha1_update(struct shash_desc *desc, const u8 *data,
*/
BUILD_BUG_ON(offsetof(struct sha1_state, state) != 0);

- kernel_fpu_begin();
sha1_base_do_update(desc, data, len, sha1_xform);
- kernel_fpu_end();

return 0;
}
@@ -54,28 +52,46 @@ static int sha1_finup(struct shash_desc *desc, const u8 *data,
if (!crypto_simd_usable())
return crypto_sha1_finup(desc, data, len, out);

- kernel_fpu_begin();
if (len)
sha1_base_do_update(desc, data, len, sha1_xform);
sha1_base_do_finalize(desc, sha1_xform);
- kernel_fpu_end();

return sha1_base_finish(desc, out);
}

-asmlinkage void sha1_transform_ssse3(struct sha1_state *state,
- const u8 *data, int blocks);
+asmlinkage void sha1_transform_ssse3(u32 *digest, const u8 *data, int blocks);
+
+void __sha1_transform_ssse3(struct sha1_state *state, const u8 *data, int blocks)
+{
+ if (blocks <= 0)
+ return;
+
+ kernel_fpu_begin();
+ for (;;) {
+ const int chunks = min(blocks, 4096 / SHA1_BLOCK_SIZE);
+
+ sha1_transform_ssse3(state->state, data, chunks);
+ data += chunks * SHA1_BLOCK_SIZE;
+ blocks -= chunks;
+
+ if (blocks <= 0)
+ break;
+
+ kernel_fpu_yield();
+ }
+ kernel_fpu_end();
+}

static int sha1_ssse3_update(struct shash_desc *desc, const u8 *data,
unsigned int len)
{
- return sha1_update(desc, data, len, sha1_transform_ssse3);
+ return sha1_update(desc, data, len, __sha1_transform_ssse3);
}

static int sha1_ssse3_finup(struct shash_desc *desc, const u8 *data,
unsigned int len, u8 *out)
{
- return sha1_finup(desc, data, len, out, sha1_transform_ssse3);
+ return sha1_finup(desc, data, len, out, __sha1_transform_ssse3);
}

/* Add padding and return the message digest. */
@@ -113,19 +129,39 @@ static void unregister_sha1_ssse3(void)
crypto_unregister_shash(&sha1_ssse3_alg);
}

-asmlinkage void sha1_transform_avx(struct sha1_state *state,
- const u8 *data, int blocks);
+asmlinkage void sha1_transform_avx(u32 *digest, const u8 *data, int blocks);
+
+void __sha1_transform_avx(struct sha1_state *state, const u8 *data, int blocks)
+{
+ if (blocks <= 0)
+ return;
+
+ kernel_fpu_begin();
+ for (;;) {
+ const int chunks = min(blocks, 4096 / SHA1_BLOCK_SIZE);
+
+ sha1_transform_avx(state->state, data, chunks);
+ data += chunks * SHA1_BLOCK_SIZE;
+ blocks -= chunks;
+
+ if (blocks <= 0)
+ break;
+
+ kernel_fpu_yield();
+ }
+ kernel_fpu_end();
+}

static int sha1_avx_update(struct shash_desc *desc, const u8 *data,
unsigned int len)
{
- return sha1_update(desc, data, len, sha1_transform_avx);
+ return sha1_update(desc, data, len, __sha1_transform_avx);
}

static int sha1_avx_finup(struct shash_desc *desc, const u8 *data,
unsigned int len, u8 *out)
{
- return sha1_finup(desc, data, len, out, sha1_transform_avx);
+ return sha1_finup(desc, data, len, out, __sha1_transform_avx);
}

static int sha1_avx_final(struct shash_desc *desc, u8 *out)
@@ -175,8 +211,28 @@ static void unregister_sha1_avx(void)

#define SHA1_AVX2_BLOCK_OPTSIZE 4 /* optimal 4*64 bytes of SHA1 blocks */

-asmlinkage void sha1_transform_avx2(struct sha1_state *state,
- const u8 *data, int blocks);
+asmlinkage void sha1_transform_avx2(u32 *digest, const u8 *data, int blocks);
+
+void __sha1_transform_avx2(struct sha1_state *state, const u8 *data, int blocks)
+{
+ if (blocks <= 0)
+ return;
+
+ kernel_fpu_begin();
+ for (;;) {
+ const int chunks = min(blocks, 4096 / SHA1_BLOCK_SIZE);
+
+ sha1_transform_avx2(state->state, data, chunks);
+ data += chunks * SHA1_BLOCK_SIZE;
+ blocks -= chunks;
+
+ if (blocks <= 0)
+ break;
+
+ kernel_fpu_yield();
+ }
+ kernel_fpu_end();
+}

static bool avx2_usable(void)
{
@@ -193,9 +249,9 @@ static void sha1_apply_transform_avx2(struct sha1_state *state,
{
/* Select the optimal transform based on data block size */
if (blocks >= SHA1_AVX2_BLOCK_OPTSIZE)
- sha1_transform_avx2(state, data, blocks);
+ __sha1_transform_avx2(state, data, blocks);
else
- sha1_transform_avx(state, data, blocks);
+ __sha1_transform_avx(state, data, blocks);
}

static int sha1_avx2_update(struct shash_desc *desc, const u8 *data,
@@ -245,19 +301,39 @@ static void unregister_sha1_avx2(void)
}

#ifdef CONFIG_AS_SHA1_NI
-asmlinkage void sha1_ni_transform(struct sha1_state *digest, const u8 *data,
- int rounds);
+asmlinkage void sha1_transform_ni(u32 *digest, const u8 *data, int rounds);
+
+void __sha1_transform_ni(struct sha1_state *state, const u8 *data, int blocks)
+{
+ if (blocks <= 0)
+ return;
+
+ kernel_fpu_begin();
+ for (;;) {
+ const int chunks = min(blocks, 4096 / SHA1_BLOCK_SIZE);
+
+ sha1_transform_ni(state->state, data, chunks);
+ data += chunks * SHA1_BLOCK_SIZE;
+ blocks -= chunks;
+
+ if (blocks <= 0)
+ break;
+
+ kernel_fpu_yield();
+ }
+ kernel_fpu_end();
+}

static int sha1_ni_update(struct shash_desc *desc, const u8 *data,
- unsigned int len)
+ unsigned int len)
{
- return sha1_update(desc, data, len, sha1_ni_transform);
+ return sha1_update(desc, data, len, __sha1_transform_ni);
}

static int sha1_ni_finup(struct shash_desc *desc, const u8 *data,
- unsigned int len, u8 *out)
+ unsigned int len, u8 *out)
{
- return sha1_finup(desc, data, len, out, sha1_ni_transform);
+ return sha1_finup(desc, data, len, out, __sha1_transform_ni);
}

static int sha1_ni_final(struct shash_desc *desc, u8 *out)
diff --git a/arch/x86/crypto/sha256_ni_asm.S b/arch/x86/crypto/sha256_ni_asm.S
index e7a3b9939327..29458ec970a9 100644
--- a/arch/x86/crypto/sha256_ni_asm.S
+++ b/arch/x86/crypto/sha256_ni_asm.S
@@ -79,7 +79,7 @@
.text

/**
- * sha256_ni_transform - Calculate SHA256 hash using the x86 SHA-NI feature set
+ * sha256_transform_ni - Calculate SHA256 hash using the x86 SHA-NI feature set
* @digest: address of current 32-byte hash value (%rdi, DIGEST_PTR macro)
* @data: address of data (%rsi, DATA_PTR macro);
* data size must be a multiple of 64 bytes
@@ -98,9 +98,9 @@
* The non-indented lines are instructions related to the message schedule.
*
* Return: none
- * Prototype: asmlinkage void sha256_ni_transform(u32 *digest, const u8 *data, int blocks)
+ * Prototype: asmlinkage void sha256_transform_ni(u32 *digest, const u8 *data, int blocks)
*/
-SYM_TYPED_FUNC_START(sha256_ni_transform)
+SYM_TYPED_FUNC_START(sha256_transform_ni)
shl $6, NUM_BLKS /* convert to bytes */
jz .Ldone_hash
add DATA_PTR, NUM_BLKS /* pointer to end of data */
@@ -329,7 +329,7 @@ SYM_TYPED_FUNC_START(sha256_ni_transform)
.Ldone_hash:

RET
-SYM_FUNC_END(sha256_ni_transform)
+SYM_FUNC_END(sha256_transform_ni)

.section .rodata.cst256.K256, "aM", @progbits, 256
.align 64
diff --git a/arch/x86/crypto/sha256_ssse3_glue.c b/arch/x86/crypto/sha256_ssse3_glue.c
index 3a5f6be7dbba..43927cf3d06e 100644
--- a/arch/x86/crypto/sha256_ssse3_glue.c
+++ b/arch/x86/crypto/sha256_ssse3_glue.c
@@ -40,8 +40,28 @@
#include <linux/string.h>
#include <asm/simd.h>

-asmlinkage void sha256_transform_ssse3(struct sha256_state *state,
- const u8 *data, int blocks);
+asmlinkage void sha256_transform_ssse3(u32 *digest, const u8 *data, int blocks);
+
+void __sha256_transform_ssse3(struct sha256_state *state, const u8 *data, int blocks)
+{
+ if (blocks <= 0)
+ return;
+
+ kernel_fpu_begin();
+ for (;;) {
+ const int chunks = min(blocks, 4096 / SHA256_BLOCK_SIZE);
+
+ sha256_transform_ssse3(state->state, data, chunks);
+ data += chunks * SHA256_BLOCK_SIZE;
+ blocks -= chunks;
+
+ if (blocks <= 0)
+ break;
+
+ kernel_fpu_yield();
+ }
+ kernel_fpu_end();
+}

static int _sha256_update(struct shash_desc *desc, const u8 *data,
unsigned int len, sha256_block_fn *sha256_xform)
@@ -58,9 +78,7 @@ static int _sha256_update(struct shash_desc *desc, const u8 *data,
*/
BUILD_BUG_ON(offsetof(struct sha256_state, state) != 0);

- kernel_fpu_begin();
sha256_base_do_update(desc, data, len, sha256_xform);
- kernel_fpu_end();

return 0;
}
@@ -71,11 +89,9 @@ static int sha256_finup(struct shash_desc *desc, const u8 *data,
if (!crypto_simd_usable())
return crypto_sha256_finup(desc, data, len, out);

- kernel_fpu_begin();
if (len)
sha256_base_do_update(desc, data, len, sha256_xform);
sha256_base_do_finalize(desc, sha256_xform);
- kernel_fpu_end();

return sha256_base_finish(desc, out);
}
@@ -83,13 +99,13 @@ static int sha256_finup(struct shash_desc *desc, const u8 *data,
static int sha256_ssse3_update(struct shash_desc *desc, const u8 *data,
unsigned int len)
{
- return _sha256_update(desc, data, len, sha256_transform_ssse3);
+ return _sha256_update(desc, data, len, __sha256_transform_ssse3);
}

static int sha256_ssse3_finup(struct shash_desc *desc, const u8 *data,
unsigned int len, u8 *out)
{
- return sha256_finup(desc, data, len, out, sha256_transform_ssse3);
+ return sha256_finup(desc, data, len, out, __sha256_transform_ssse3);
}

/* Add padding and return the message digest. */
@@ -143,19 +159,39 @@ static void unregister_sha256_ssse3(void)
ARRAY_SIZE(sha256_ssse3_algs));
}

-asmlinkage void sha256_transform_avx(struct sha256_state *state,
- const u8 *data, int blocks);
+asmlinkage void sha256_transform_avx(u32 *digest, const u8 *data, int blocks);
+
+void __sha256_transform_avx(struct sha256_state *state, const u8 *data, int blocks)
+{
+ if (blocks <= 0)
+ return;
+
+ kernel_fpu_begin();
+ for (;;) {
+ const int chunks = min(blocks, 4096 / SHA256_BLOCK_SIZE);
+
+ sha256_transform_avx(state->state, data, chunks);
+ data += chunks * SHA256_BLOCK_SIZE;
+ blocks -= chunks;
+
+ if (blocks <= 0)
+ break;
+
+ kernel_fpu_yield();
+ }
+ kernel_fpu_end();
+}

static int sha256_avx_update(struct shash_desc *desc, const u8 *data,
unsigned int len)
{
- return _sha256_update(desc, data, len, sha256_transform_avx);
+ return _sha256_update(desc, data, len, __sha256_transform_avx);
}

static int sha256_avx_finup(struct shash_desc *desc, const u8 *data,
unsigned int len, u8 *out)
{
- return sha256_finup(desc, data, len, out, sha256_transform_avx);
+ return sha256_finup(desc, data, len, out, __sha256_transform_avx);
}

static int sha256_avx_final(struct shash_desc *desc, u8 *out)
@@ -219,19 +255,39 @@ static void unregister_sha256_avx(void)
ARRAY_SIZE(sha256_avx_algs));
}

-asmlinkage void sha256_transform_rorx(struct sha256_state *state,
- const u8 *data, int blocks);
+asmlinkage void sha256_transform_rorx(u32 *state, const u8 *data, int blocks);
+
+void __sha256_transform_avx2(struct sha256_state *state, const u8 *data, int blocks)
+{
+ if (blocks <= 0)
+ return;
+
+ kernel_fpu_begin();
+ for (;;) {
+ const int chunks = min(blocks, 4096 / SHA256_BLOCK_SIZE);
+
+ sha256_transform_rorx(state->state, data, chunks);
+ data += chunks * SHA256_BLOCK_SIZE;
+ blocks -= chunks;
+
+ if (blocks <= 0)
+ break;
+
+ kernel_fpu_yield();
+ }
+ kernel_fpu_end();
+}

static int sha256_avx2_update(struct shash_desc *desc, const u8 *data,
unsigned int len)
{
- return _sha256_update(desc, data, len, sha256_transform_rorx);
+ return _sha256_update(desc, data, len, __sha256_transform_avx2);
}

static int sha256_avx2_finup(struct shash_desc *desc, const u8 *data,
unsigned int len, u8 *out)
{
- return sha256_finup(desc, data, len, out, sha256_transform_rorx);
+ return sha256_finup(desc, data, len, out, __sha256_transform_avx2);
}

static int sha256_avx2_final(struct shash_desc *desc, u8 *out)
@@ -294,19 +350,38 @@ static void unregister_sha256_avx2(void)
}

#ifdef CONFIG_AS_SHA256_NI
-asmlinkage void sha256_ni_transform(struct sha256_state *digest,
- const u8 *data, int rounds);
+asmlinkage void sha256_transform_ni(u32 *digest, const u8 *data, int rounds);
+
+void __sha256_transform_ni(struct sha256_state *state, const u8 *data, int blocks)
+{
+ if (blocks <= 0)
+ return;
+
+ kernel_fpu_begin();
+ for (;;) {
+ const int chunks = min(blocks, 4096 / SHA256_BLOCK_SIZE);

+ sha256_transform_ni(state->state, data, chunks);
+ data += chunks * SHA256_BLOCK_SIZE;
+ blocks -= chunks;
+
+ if (blocks <= 0)
+ break;
+
+ kernel_fpu_yield();
+ }
+ kernel_fpu_end();
+}
static int sha256_ni_update(struct shash_desc *desc, const u8 *data,
unsigned int len)
{
- return _sha256_update(desc, data, len, sha256_ni_transform);
+ return _sha256_update(desc, data, len, __sha256_transform_ni);
}

static int sha256_ni_finup(struct shash_desc *desc, const u8 *data,
unsigned int len, u8 *out)
{
- return sha256_finup(desc, data, len, out, sha256_ni_transform);
+ return sha256_finup(desc, data, len, out, __sha256_transform_ni);
}

static int sha256_ni_final(struct shash_desc *desc, u8 *out)
diff --git a/arch/x86/crypto/sha512_ssse3_glue.c b/arch/x86/crypto/sha512_ssse3_glue.c
index 6d3b85e53d0e..cb6aad9d5052 100644
--- a/arch/x86/crypto/sha512_ssse3_glue.c
+++ b/arch/x86/crypto/sha512_ssse3_glue.c
@@ -39,8 +39,28 @@
#include <asm/cpu_device_id.h>
#include <asm/simd.h>

-asmlinkage void sha512_transform_ssse3(struct sha512_state *state,
- const u8 *data, int blocks);
+asmlinkage void sha512_transform_ssse3(u64 *digest, const u8 *data, int blocks);
+
+void __sha512_transform_ssse3(struct sha512_state *state, const u8 *data, int blocks)
+{
+ if (blocks <= 0)
+ return;
+
+ kernel_fpu_begin();
+ for (;;) {
+ const int chunks = min(blocks, 4096 / SHA512_BLOCK_SIZE);
+
+ sha512_transform_ssse3(&state->state[0], data, chunks);
+ data += chunks * SHA512_BLOCK_SIZE;
+ blocks -= chunks;
+
+ if (blocks <= 0)
+ break;
+
+ kernel_fpu_yield();
+ }
+ kernel_fpu_end();
+}

static int sha512_update(struct shash_desc *desc, const u8 *data,
unsigned int len, sha512_block_fn *sha512_xform)
@@ -57,9 +77,7 @@ static int sha512_update(struct shash_desc *desc, const u8 *data,
*/
BUILD_BUG_ON(offsetof(struct sha512_state, state) != 0);

- kernel_fpu_begin();
sha512_base_do_update(desc, data, len, sha512_xform);
- kernel_fpu_end();

return 0;
}
@@ -70,11 +88,9 @@ static int sha512_finup(struct shash_desc *desc, const u8 *data,
if (!crypto_simd_usable())
return crypto_sha512_finup(desc, data, len, out);

- kernel_fpu_begin();
if (len)
sha512_base_do_update(desc, data, len, sha512_xform);
sha512_base_do_finalize(desc, sha512_xform);
- kernel_fpu_end();

return sha512_base_finish(desc, out);
}
@@ -82,13 +98,13 @@ static int sha512_finup(struct shash_desc *desc, const u8 *data,
static int sha512_ssse3_update(struct shash_desc *desc, const u8 *data,
unsigned int len)
{
- return sha512_update(desc, data, len, sha512_transform_ssse3);
+ return sha512_update(desc, data, len, __sha512_transform_ssse3);
}

static int sha512_ssse3_finup(struct shash_desc *desc, const u8 *data,
unsigned int len, u8 *out)
{
- return sha512_finup(desc, data, len, out, sha512_transform_ssse3);
+ return sha512_finup(desc, data, len, out, __sha512_transform_ssse3);
}

/* Add padding and return the message digest. */
@@ -142,8 +158,29 @@ static void unregister_sha512_ssse3(void)
ARRAY_SIZE(sha512_ssse3_algs));
}

-asmlinkage void sha512_transform_avx(struct sha512_state *state,
- const u8 *data, int blocks);
+asmlinkage void sha512_transform_avx(u64 *digest, const u8 *data, int blocks);
+
+void __sha512_transform_avx(struct sha512_state *state, const u8 *data, int blocks)
+{
+ if (blocks <= 0)
+ return;
+
+ kernel_fpu_begin();
+ for (;;) {
+ const int chunks = min(blocks, 4096 / SHA512_BLOCK_SIZE);
+
+ sha512_transform_avx(state->state, data, chunks);
+ data += chunks * SHA512_BLOCK_SIZE;
+ blocks -= chunks;
+
+ if (blocks <= 0)
+ break;
+
+ kernel_fpu_yield();
+ }
+ kernel_fpu_end();
+}
+
static bool avx_usable(void)
{
if (!cpu_has_xfeatures(XFEATURE_MASK_SSE | XFEATURE_MASK_YMM, NULL)) {
@@ -158,13 +195,13 @@ static bool avx_usable(void)
static int sha512_avx_update(struct shash_desc *desc, const u8 *data,
unsigned int len)
{
- return sha512_update(desc, data, len, sha512_transform_avx);
+ return sha512_update(desc, data, len, __sha512_transform_avx);
}

static int sha512_avx_finup(struct shash_desc *desc, const u8 *data,
unsigned int len, u8 *out)
{
- return sha512_finup(desc, data, len, out, sha512_transform_avx);
+ return sha512_finup(desc, data, len, out, __sha512_transform_avx);
}

/* Add padding and return the message digest. */
@@ -218,19 +255,39 @@ static void unregister_sha512_avx(void)
ARRAY_SIZE(sha512_avx_algs));
}

-asmlinkage void sha512_transform_rorx(struct sha512_state *state,
- const u8 *data, int blocks);
+asmlinkage void sha512_transform_rorx(u64 *digest, const u8 *data, int blocks);
+
+void __sha512_transform_avx2(struct sha512_state *state, const u8 *data, int blocks)
+{
+ if (blocks <= 0)
+ return;
+
+ kernel_fpu_begin();
+ for (;;) {
+ const int chunks = min(blocks, 4096 / SHA512_BLOCK_SIZE);
+
+ sha512_transform_rorx(state->state, data, chunks);
+ data += chunks * SHA512_BLOCK_SIZE;
+ blocks -= chunks;
+
+ if (blocks <= 0)
+ break;
+
+ kernel_fpu_yield();
+ }
+ kernel_fpu_end();
+}

static int sha512_avx2_update(struct shash_desc *desc, const u8 *data,
unsigned int len)
{
- return sha512_update(desc, data, len, sha512_transform_rorx);
+ return sha512_update(desc, data, len, __sha512_transform_avx2);
}

static int sha512_avx2_finup(struct shash_desc *desc, const u8 *data,
unsigned int len, u8 *out)
{
- return sha512_finup(desc, data, len, out, sha512_transform_rorx);
+ return sha512_finup(desc, data, len, out, __sha512_transform_avx2);
}

/* Add padding and return the message digest. */
--
2.38.1

Subject: [PATCH 05/13] crypto: x86/sm3 - yield FPU context during long loops

The x86 assembly language implementations using SIMD process data
between kernel_fpu_begin() and kernel_fpu_end() calls. That
disables scheduler preemption, so prevents the CPU core from being
used by other threads.

The update() and finup() functions might be called to process
large quantities of data, which can result in RCU stalls and
soft lockups.

Periodically check if the kernel scheduler wants to run something
else on the CPU. If so, yield the kernel FPU context and let the
scheduler intervene.

Fixes: 930ab34d906d ("crypto: x86/sm3 - add AVX assembly implementation")
Suggested-by: Herbert Xu <[email protected]>
Signed-off-by: Robert Elliott <[email protected]>
---
arch/x86/crypto/sm3_avx_glue.c | 34 +++++++++++++++++++++++++++++-----
1 file changed, 29 insertions(+), 5 deletions(-)

diff --git a/arch/x86/crypto/sm3_avx_glue.c b/arch/x86/crypto/sm3_avx_glue.c
index 661b6f22ffcd..9e4b21c0e748 100644
--- a/arch/x86/crypto/sm3_avx_glue.c
+++ b/arch/x86/crypto/sm3_avx_glue.c
@@ -25,8 +25,7 @@ static int sm3_avx_update(struct shash_desc *desc, const u8 *data,
{
struct sm3_state *sctx = shash_desc_ctx(desc);

- if (!crypto_simd_usable() ||
- (sctx->count % SM3_BLOCK_SIZE) + len < SM3_BLOCK_SIZE) {
+ if (((sctx->count % SM3_BLOCK_SIZE) + len < SM3_BLOCK_SIZE) || !crypto_simd_usable()) {
sm3_update(sctx, data, len);
return 0;
}
@@ -38,7 +37,19 @@ static int sm3_avx_update(struct shash_desc *desc, const u8 *data,
BUILD_BUG_ON(offsetof(struct sm3_state, state) != 0);

kernel_fpu_begin();
- sm3_base_do_update(desc, data, len, sm3_transform_avx);
+ for (;;) {
+ const unsigned int chunk = min(len, 4096U);
+
+ sm3_base_do_update(desc, data, chunk, sm3_transform_avx);
+
+ len -= chunk;
+
+ if (!len)
+ break;
+
+ data += chunk;
+ kernel_fpu_yield();
+ }
kernel_fpu_end();

return 0;
@@ -58,8 +69,21 @@ static int sm3_avx_finup(struct shash_desc *desc, const u8 *data,
}

kernel_fpu_begin();
- if (len)
- sm3_base_do_update(desc, data, len, sm3_transform_avx);
+ if (len) {
+ for (;;) {
+ const unsigned int chunk = min(len, 4096U);
+
+ sm3_base_do_update(desc, data, chunk, sm3_transform_avx);
+ len -= chunk;
+
+ if (!len)
+ break;
+
+ data += chunk;
+ kernel_fpu_yield();
+ }
+ }
+
sm3_base_do_finalize(desc, sm3_transform_avx);
kernel_fpu_end();

--
2.38.1

Subject: [PATCH 13/13] crypto: x86/aria - yield FPU context only when needed

The x86 assembly language implementations using SIMD process data
between kernel_fpu_begin() and kernel_fpu_end() calls. That
disables scheduler preemption, so prevents the CPU core from being
used by other threads.

During ctr mode, rather than break the processing into 256 byte
passes, each of which unilaterally calls kernel_fpu_begin() and
kernel_fpu_end(), periodically check if the kernel scheduler wants
to run something else on the CPU. If so, yield the kernel FPU
context and let the scheduler intervene.

Signed-off-by: Robert Elliott <[email protected]>
---
arch/x86/crypto/aria_aesni_avx_glue.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/x86/crypto/aria_aesni_avx_glue.c b/arch/x86/crypto/aria_aesni_avx_glue.c
index c561ea4fefa5..6657ce576e6c 100644
--- a/arch/x86/crypto/aria_aesni_avx_glue.c
+++ b/arch/x86/crypto/aria_aesni_avx_glue.c
@@ -5,6 +5,7 @@
* Copyright (c) 2022 Taehee Yoo <[email protected]>
*/

+#include <asm/simd.h>
#include <crypto/algapi.h>
#include <crypto/internal/simd.h>
#include <crypto/aria.h>
@@ -85,17 +86,19 @@ static int aria_avx_ctr_encrypt(struct skcipher_request *req)
const u8 *src = walk.src.virt.addr;
u8 *dst = walk.dst.virt.addr;

+ kernel_fpu_begin();
while (nbytes >= ARIA_AESNI_PARALLEL_BLOCK_SIZE) {
u8 keystream[ARIA_AESNI_PARALLEL_BLOCK_SIZE];

- kernel_fpu_begin();
aria_ops.aria_ctr_crypt_16way(ctx, dst, src, keystream,
walk.iv);
- kernel_fpu_end();
dst += ARIA_AESNI_PARALLEL_BLOCK_SIZE;
src += ARIA_AESNI_PARALLEL_BLOCK_SIZE;
nbytes -= ARIA_AESNI_PARALLEL_BLOCK_SIZE;
+
+ kernel_fpu_yield();
}
+ kernel_fpu_end();

while (nbytes >= ARIA_BLOCK_SIZE) {
u8 keystream[ARIA_BLOCK_SIZE];
--
2.38.1

2022-12-20 03:59:13

by Herbert Xu

[permalink] [raw]
Subject: Re: [PATCH 03/13] crypto: x86/sha - yield FPU context during long loops

On Mon, Dec 19, 2022 at 04:02:13PM -0600, Robert Elliott wrote:
>
> +void __sha1_transform_ssse3(struct sha1_state *state, const u8 *data, int blocks)
> +{
> + if (blocks <= 0)
> + return;
> +
> + kernel_fpu_begin();
> + for (;;) {
> + const int chunks = min(blocks, 4096 / SHA1_BLOCK_SIZE);
> +
> + sha1_transform_ssse3(state->state, data, chunks);
> + data += chunks * SHA1_BLOCK_SIZE;
> + blocks -= chunks;
> +
> + if (blocks <= 0)
> + break;
> +
> + kernel_fpu_yield();

Shouldn't this check the MAY_SLEEP flag?

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2022-12-20 20:14:47

by Eric Biggers

[permalink] [raw]
Subject: Re: [PATCH 00/13] crypto: x86 - yield FPU context during long loops

On Mon, Dec 19, 2022 at 04:02:10PM -0600, Robert Elliott wrote:
> This is an offshoot of the previous patch series at:
> https://lore.kernel.org/linux-crypto/[email protected]
>
> Add a kernel_fpu_yield() function for x86 crypto drivers to call
> periodically during long loops.
>
> Test results
> ============
> I created 28 tcrypt modules so modprobe can run concurrent tests,
> added 1 MiB functional and speed tests to tcrypt, and ran three processes
> spawning 28 subprocesses (one per physical CPU core) each looping forever
> through all the tcrypt test modes. This keeps the system quite busy,
> generating RCU stalls and soft lockups during both generic and x86
> crypto function processing.
>
> In conjunction with these patch series:
> * [PATCH 0/8] crypto: kernel-doc for assembly language
> https://lore.kernel.org/linux-crypto/[email protected]
> * [PATCH 0/3] crypto/rcu: suppress unnecessary CPU stall warnings
> https://lore.kernel.org/linux-crypto/[email protected]
> * [PATCH 0/3] crypto: yield at end of operations
> https://lore.kernel.org/linux-crypto/[email protected]
>
> while using the default RCU values (60 s stalls, 21 s expedited stalls),
> several nights of testing did not result in any RCU stall warnings or soft
> lockups in any of these preemption modes:
> preempt=none
> preempt=voluntary
> preempt=full
>
> Setting the shortest possible RCU timeouts (3 s, 20 ms) did still result
> in RCU stalls, but only about one every 2 hours, and not occurring
> on particular modules like sha512_ssse3 and sm4-generic.
>
> systemd usually crashes and restarts when its journal becomes full from
> all the tcrypt printk messages. Without the patches, that triggered more
> RCU stall reports and soft lockups; with the patches, only userspace
> seems perturbed.
>

Where does this patch series apply to?

- Eric

2022-12-30 09:08:52

by Herbert Xu

[permalink] [raw]
Subject: Re: [PATCH 03/13] crypto: x86/sha - yield FPU context during long loops

On Mon, Dec 19, 2022 at 04:02:13PM -0600, Robert Elliott wrote:
>
> diff --git a/arch/x86/crypto/sha1_avx2_x86_64_asm.S b/arch/x86/crypto/sha1_avx2_x86_64_asm.S
> index c3ee9334cb0f..df03fbb2c42c 100644
> --- a/arch/x86/crypto/sha1_avx2_x86_64_asm.S
> +++ b/arch/x86/crypto/sha1_avx2_x86_64_asm.S
> @@ -58,9 +58,9 @@
> /*
> * SHA-1 implementation with Intel(R) AVX2 instruction set extensions.
> *
> - *This implementation is based on the previous SSSE3 release:
> - *Visit http://software.intel.com/en-us/articles/
> - *and refer to improving-the-performance-of-the-secure-hash-algorithm-1/
> + * This implementation is based on the previous SSSE3 release:
> + * Visit http://software.intel.com/en-us/articles/
> + * and refer to improving-the-performance-of-the-secure-hash-algorithm-1/

Could you please leave out changes which are not related to the
main purpose of this patch?

Put them into a separate patch if necessary.

> diff --git a/arch/x86/crypto/sha1_ssse3_glue.c b/arch/x86/crypto/sha1_ssse3_glue.c
> index 44340a1139e0..b269b455fbbe 100644
> --- a/arch/x86/crypto/sha1_ssse3_glue.c
> +++ b/arch/x86/crypto/sha1_ssse3_glue.c
> @@ -41,9 +41,7 @@ static int sha1_update(struct shash_desc *desc, const u8 *data,
> */
> BUILD_BUG_ON(offsetof(struct sha1_state, state) != 0);
>
> - kernel_fpu_begin();
> sha1_base_do_update(desc, data, len, sha1_xform);
> - kernel_fpu_end();

Moving kernel_fpu_begin/kernel_fpu_end down seems to be entirely
unnecessary as you could already call kernel_fpu_yield deep down
the stack with the current code.

Thanks,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2023-01-12 08:09:20

by Herbert Xu

[permalink] [raw]
Subject: Re: [PATCH 03/13] crypto: x86/sha - yield FPU context during long loops

On Mon, Dec 19, 2022 at 04:02:13PM -0600, Robert Elliott wrote:
>
> @@ -41,9 +41,7 @@ static int sha1_update(struct shash_desc *desc, const u8 *data,

I just realised a show-stopper with this patch-set. We don't
have a desc->flags field that tells us whether we can sleep or
not.

I'm currently doing a patch-set for per-request keys and I will
add a flags field to shash_desc so we could use that for your
patch-set too.

Thanks,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2023-01-12 19:08:38

by Eric Biggers

[permalink] [raw]
Subject: Re: [PATCH 03/13] crypto: x86/sha - yield FPU context during long loops

On Thu, Jan 12, 2023 at 04:05:55PM +0800, Herbert Xu wrote:
> On Mon, Dec 19, 2022 at 04:02:13PM -0600, Robert Elliott wrote:
> >
> > @@ -41,9 +41,7 @@ static int sha1_update(struct shash_desc *desc, const u8 *data,
>
> I just realised a show-stopper with this patch-set. We don't
> have a desc->flags field that tells us whether we can sleep or
> not.
>
> I'm currently doing a patch-set for per-request keys and I will
> add a flags field to shash_desc so we could use that for your
> patch-set too.
>

Right, this used to exist, but it didn't actually do anything, and it had
suffered heavily from bitrot. For example, some callers specified MAY_SLEEP
when actually they couldn't sleep. IIRC, some callers also didn't even bother
initializing the flags, so they were passing uninitialized memory. So I removed
it in commit 877b5691f27a ("crypto: shash - remove shash_desc::flags").

Has there been any consideration of just adding the crypto_shash_update_large()
helper function that I had mentioned in the commit message of 877b5691f27a?

- Eric

2023-01-13 02:39:46

by Herbert Xu

[permalink] [raw]
Subject: Re: [PATCH 03/13] crypto: x86/sha - yield FPU context during long loops

On Thu, Jan 12, 2023 at 10:46:42AM -0800, Eric Biggers wrote:
>
> Right, this used to exist, but it didn't actually do anything, and it had
> suffered heavily from bitrot. For example, some callers specified MAY_SLEEP
> when actually they couldn't sleep. IIRC, some callers also didn't even bother
> initializing the flags, so they were passing uninitialized memory. So I removed
> it in commit 877b5691f27a ("crypto: shash - remove shash_desc::flags").
>
> Has there been any consideration of just adding the crypto_shash_update_large()
> helper function that I had mentioned in the commit message of 877b5691f27a?

I had forgotten about this :)

Perhaps we should just convert any users that trigger these warnings
over to ahash? The shash interface was never meant to process large
amounts of data anyway.

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2023-01-13 02:53:24

by Herbert Xu

[permalink] [raw]
Subject: Re: [PATCH 03/13] crypto: x86/sha - yield FPU context during long loops

On Fri, Jan 13, 2023 at 10:36:08AM +0800, Herbert Xu wrote:
>
> Perhaps we should just convert any users that trigger these warnings
> over to ahash? The shash interface was never meant to process large
> amounts of data anyway.

We could even add some length checks in shash to ensure that
all large updates fail with a big bright warning once the existing
users have been converted.

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Subject: RE: [PATCH 03/13] crypto: x86/sha - yield FPU context during long loops



> -----Original Message-----
> From: Herbert Xu <[email protected]>
> Sent: Thursday, January 12, 2023 8:38 PM
> To: Eric Biggers <[email protected]>
> Cc: Elliott, Robert (Servers) <[email protected]>; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]
> Subject: Re: [PATCH 03/13] crypto: x86/sha - yield FPU context during long
> loops
>
> On Fri, Jan 13, 2023 at 10:36:08AM +0800, Herbert Xu wrote:
> >
> > Perhaps we should just convert any users that trigger these warnings
> > over to ahash? The shash interface was never meant to process large
> > amounts of data anyway.
>
> We could even add some length checks in shash to ensure that
> all large updates fail with a big bright warning once the existing
> users have been converted.

The call trace that triggered this whole topic was checking module
signatures during boot (thousands of files totaling 2.4 GB):
[ 29.729849] ? sha512_finup.part.0+0x1de/0x230 [sha512_ssse3]
[ 29.729851] ? pkcs7_digest+0xaf/0x1f0
[ 29.729854] ? pkcs7_verify+0x61/0x540
[ 29.729856] ? verify_pkcs7_message_sig+0x4a/0xe0
[ 29.729859] ? pkcs7_parse_message+0x174/0x1b0
[ 29.729861] ? verify_pkcs7_signature+0x4c/0x80
[ 29.729862] ? mod_verify_sig+0x74/0x90
[ 29.729867] ? module_sig_check+0x87/0xd0
[ 29.729868] ? load_module+0x4e/0x1fc0
[ 29.729871] ? xfs_file_read_iter+0x70/0xe0 [xfs]
[ 29.729955] ? __kernel_read+0x118/0x290
[ 29.729959] ? ima_post_read_file+0xac/0xc0
[ 29.729962] ? kernel_read_file+0x211/0x2a0
[ 29.729965] ? __do_sys_finit_module+0x93/0xf0

pkcs_digest() uses shash like this:
/* Allocate the hashing algorithm we're going to need and find out how
* big the hash operational data will be.
*/
tfm = crypto_alloc_shash(sinfo->sig->hash_algo, 0, 0);
if (IS_ERR(tfm))
return (PTR_ERR(tfm) == -ENOENT) ? -ENOPKG : PTR_ERR(tfm);

desc_size = crypto_shash_descsize(tfm) + sizeof(*desc);
sig->digest_size = crypto_shash_digestsize(tfm);

ret = -ENOMEM;
sig->digest = kmalloc(sig->digest_size, GFP_KERNEL);
if (!sig->digest)
goto error_no_desc;

desc = kzalloc(desc_size, GFP_KERNEL);
if (!desc)
goto error_no_desc;

desc->tfm = tfm;

/* Digest the message [RFC2315 9.3] */
ret = crypto_shash_digest(desc, pkcs7->data, pkcs7->data_len,
sig->digest);
if (ret < 0)
goto error;
pr_devel("MsgDigest = [%*ph]\n", 8, sig->digest);

There is a crypto_ahash_digest() available. Interestingly, the number of
users of each one happens to be identical:
$ grep -Er --include '*.[chS]' "crypto_shash_digest\(" | wc -l
37
$ grep -Er --include '*.[chS]' "crypto_ahash_digest\(" | wc -l
37


2023-01-16 03:44:22

by Herbert Xu

[permalink] [raw]
Subject: Re: [PATCH 03/13] crypto: x86/sha - yield FPU context during long loops

On Fri, Jan 13, 2023 at 07:35:07PM +0000, Elliott, Robert (Servers) wrote:
>
> pkcs_digest() uses shash like this:
> /* Allocate the hashing algorithm we're going to need and find out how
> * big the hash operational data will be.
> */
> tfm = crypto_alloc_shash(sinfo->sig->hash_algo, 0, 0);
> if (IS_ERR(tfm))
> return (PTR_ERR(tfm) == -ENOENT) ? -ENOPKG : PTR_ERR(tfm);
>
> desc_size = crypto_shash_descsize(tfm) + sizeof(*desc);
> sig->digest_size = crypto_shash_digestsize(tfm);
>
> ret = -ENOMEM;
> sig->digest = kmalloc(sig->digest_size, GFP_KERNEL);
> if (!sig->digest)
> goto error_no_desc;
>
> desc = kzalloc(desc_size, GFP_KERNEL);
> if (!desc)
> goto error_no_desc;
>
> desc->tfm = tfm;
>
> /* Digest the message [RFC2315 9.3] */
> ret = crypto_shash_digest(desc, pkcs7->data, pkcs7->data_len,
> sig->digest);
> if (ret < 0)
> goto error;
> pr_devel("MsgDigest = [%*ph]\n", 8, sig->digest);

As this path is sleepable, the conversion should be fairly trivial
with crypto_wait_req.

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt