Subject: [PATCH] crypto: x86/sha512 - load based on CPU features

x86 optimized crypto modules built as modules rather than built-in
to the kernel end up as .ko files in the filesystem, e.g., in
/usr/lib/modules. If the filesystem itself is a module, these might
not be available when the crypto API is initialized, resulting in
the generic implementation being used (e.g., sha512_transform rather
than sha512_transform_avx2).

In one test case, CPU utilization in the sha512 function dropped
from 15.34% to 7.18% after forcing loading of the optimized module.

Add module aliases for this x86 optimized crypto module based on CPU
feature bits so udev gets a chance to load them later in the boot
process when the filesystems are all running.

Signed-off-by: Robert Elliott <[email protected]>
---
arch/x86/crypto/sha512_ssse3_glue.c | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/arch/x86/crypto/sha512_ssse3_glue.c b/arch/x86/crypto/sha512_ssse3_glue.c
index 30e70f4fe2f7..6d3b85e53d0e 100644
--- a/arch/x86/crypto/sha512_ssse3_glue.c
+++ b/arch/x86/crypto/sha512_ssse3_glue.c
@@ -36,6 +36,7 @@
#include <linux/types.h>
#include <crypto/sha2.h>
#include <crypto/sha512_base.h>
+#include <asm/cpu_device_id.h>
#include <asm/simd.h>

asmlinkage void sha512_transform_ssse3(struct sha512_state *state,
@@ -284,6 +285,13 @@ static int register_sha512_avx2(void)
ARRAY_SIZE(sha512_avx2_algs));
return 0;
}
+static const struct x86_cpu_id module_cpu_ids[] = {
+ X86_MATCH_FEATURE(X86_FEATURE_AVX2, NULL),
+ X86_MATCH_FEATURE(X86_FEATURE_AVX, NULL),
+ X86_MATCH_FEATURE(X86_FEATURE_SSSE3, NULL),
+ {}
+};
+MODULE_DEVICE_TABLE(x86cpu, module_cpu_ids);

static void unregister_sha512_avx2(void)
{
@@ -294,6 +302,8 @@ static void unregister_sha512_avx2(void)

static int __init sha512_ssse3_mod_init(void)
{
+ if (!x86_match_cpu(module_cpu_ids))
+ return -ENODEV;

if (register_sha512_ssse3())
goto fail;
--
2.37.1


2022-08-19 11:08:36

by Herbert Xu

[permalink] [raw]
Subject: Re: [PATCH] crypto: x86/sha512 - load based on CPU features

Robert Elliott <[email protected]> wrote:
> x86 optimized crypto modules built as modules rather than built-in
> to the kernel end up as .ko files in the filesystem, e.g., in
> /usr/lib/modules. If the filesystem itself is a module, these might
> not be available when the crypto API is initialized, resulting in
> the generic implementation being used (e.g., sha512_transform rather
> than sha512_transform_avx2).
>
> In one test case, CPU utilization in the sha512 function dropped
> from 15.34% to 7.18% after forcing loading of the optimized module.
>
> Add module aliases for this x86 optimized crypto module based on CPU
> feature bits so udev gets a chance to load them later in the boot
> process when the filesystems are all running.
>
> Signed-off-by: Robert Elliott <[email protected]>
> ---
> arch/x86/crypto/sha512_ssse3_glue.c | 10 ++++++++++
> 1 file changed, 10 insertions(+)

Patch applied. Thanks.
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Subject: RE: [PATCH] crypto: x86/sha512 - load based on CPU features



> -----Original Message-----
> From: Herbert Xu <[email protected]>
> Sent: Friday, August 19, 2022 6:05 AM
> Subject: Re: [PATCH] crypto: x86/sha512 - load based on CPU features
>
> Robert Elliott <[email protected]> wrote:
> > x86 optimized crypto modules built as modules rather than built-in
> > to the kernel end up as .ko files in the filesystem, e.g., in
> > /usr/lib/modules. If the filesystem itself is a module, these might
> > not be available when the crypto API is initialized, resulting in
> > the generic implementation being used (e.g., sha512_transform rather
> > than sha512_transform_avx2).
> >
> > In one test case, CPU utilization in the sha512 function dropped
> > from 15.34% to 7.18% after forcing loading of the optimized module.
> >
> > Add module aliases for this x86 optimized crypto module based on CPU
> > feature bits so udev gets a chance to load them later in the boot
> > process when the filesystems are all running.
> >
> > Signed-off-by: Robert Elliott <[email protected]>
> > ---
> > arch/x86/crypto/sha512_ssse3_glue.c | 10 ++++++++++
> > 1 file changed, 10 insertions(+)
>
> Patch applied. Thanks.

I'll post a series that applies this technique to all the other
x86 modules, if there are no problems reported with sha512.

Do other architectures have the same issue, or is something else done
to ensure that modules are loaded?

MODULE_DEVICE_TABLE() is used by three other architecture-specific
modules:

1. One arm (32-bit) module, matching on certain Arm extensions:
arch/arm/crypto/crc32-ce-glue.c:MODULE_DEVICE_TABLE(cpu, crc32_cpu_feature);

2. One arm64 module, matching on certain Arm extensions:
arch/arm64/crypto/ghash-ce-glue.c:MODULE_DEVICE_TABLE(cpu, ghash_cpu_feature);

3. All of the sparc modules share a global solution:
arch/sparc/crypto/crop_devid.c:MODULE_DEVICE_TABLE(of, crypto_opcode_match);

/* This is a dummy device table linked into all of the crypto
* opcode drivers. It serves to trigger the module autoloading
* mechanisms in userspace which scan the OF device tree and
* load any modules which have device table entries that
* match OF device nodes.
*/

Each module .c file includes that .c file:
aes_glue.c:#include "crop_devid.c"
camellia_glue.c:#include "crop_devid.c"
crc32c_glue.c:#include "crop_devid.c"
des_glue.c:#include "crop_devid.c"
md5_glue.c:#include "crop_devid.c"
sha1_glue.c:#include "crop_devid.c"
sha256_glue.c:#include "crop_devid.c"
sha512_glue.c:#include "crop_devid.c"


Subject: RE: [PATCH] crypto: x86/sha512 - load based on CPU features



> -----Original Message-----
> From: Elliott, Robert (Servers) <[email protected]>
> Sent: Friday, August 19, 2022 5:37 PM
> To: Herbert Xu <[email protected]>
> Subject: RE: [PATCH] crypto: x86/sha512 - load based on CPU features
>
> > -----Original Message-----
> > From: Herbert Xu <[email protected]>
> > Sent: Friday, August 19, 2022 6:05 AM
> > Subject: Re: [PATCH] crypto: x86/sha512 - load based on CPU features
> >
> > Robert Elliott <[email protected]> wrote:
> > > x86 optimized crypto modules built as modules rather than built-in
> > > to the kernel end up as .ko files in the filesystem, e.g., in
> > > /usr/lib/modules. If the filesystem itself is a module, these might
> > > not be available when the crypto API is initialized, resulting in
> > > the generic implementation being used (e.g., sha512_transform rather
> > > than sha512_transform_avx2).
> > >
> > > In one test case, CPU utilization in the sha512 function dropped
> > > from 15.34% to 7.18% after forcing loading of the optimized module.
> > >
> > > Add module aliases for this x86 optimized crypto module based on CPU
> > > feature bits so udev gets a chance to load them later in the boot
> > > process when the filesystems are all running.
> > >
> > > Signed-off-by: Robert Elliott <[email protected]>
> > > ---
> > > arch/x86/crypto/sha512_ssse3_glue.c | 10 ++++++++++
> > > 1 file changed, 10 insertions(+)
> >
> > Patch applied. Thanks.
>
> I'll post a series that applies this technique to all the other
> x86 modules, if there are no problems reported with sha512.

Suggestion: please revert the sha512-x86 patch for a while.

I've been running with this technique applied to all x86 modules for
a few weeks and noticed a potential problem.

I've seen several cases of "rcu_preempt detected expedited stalls"
with stack dumps pointing to the x86 crypto functions:
rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 12-... } 22 jiffies s: 277 root: 0x1/.

(It doesn't always happen; I haven't found a definitive recipe to
trigger the problem yet.)

Three instances occurred during boot, with the stack track pointing
to the sha512-x86 function (my system is set to use SHA-512 for
module signing, so it's using that algorithm a lot):
module_sig_check/mod_verify_sig/
pkcs7_verify/pkcs7_digest/
sha512_finup/sha512_base_do_update

I also triggered three of them by running "modprobe tcrypt mode=n"
looping from 0 to 1000, all in the aes-x86 module:
tcrypt: testing encryption speed of sync skcipher cts(cbc(aes)) using cts(cbc(aes-aesni))
tcrypt: testing encryption speed of sync skcipher cfb(aes) using cfb(aes-aesni)
tcrypt: testing decryption speed of sync skcipher cfb(aes) using cfb(aes-aesni)

In that case the stack traces pointed to:
do_test/prepare_alloc_pages/test_skcipher_speed.cold

Theory:
I noticed the proposed aria-x86 module has these calls surrounding
its main data processing functions:
kernel_fpu_begin();
kernel_fpu_end();

The sha-x86 and aes-x86 implementations use those as well. For
example, sha512_update() is structured as:
kernel_fpu_begin();
sha512_base_do_update(desc, data, len, sha512_xform);
kernel_fpu_end();

and aesni_encrypt() is structured as:
kernel_fpu_begin();
aesni_enc(ctx, dst, src);
kernel_fpu_end();

I noticed that kernel_fpu_begin() includes this:
preempt_disable();

and kernel_fpu_end() has:
preempt_enable();

Is that sufficient to cause the RCU problems?

Although preempt_disable isn't disabling interrupts, it's blocking the
scheduler from using the CPUs (A few of the arm AES functions mess
with interrupts, but none of the others seem to do so). So, I suspect
that could lead to RCU problems and soft lockups, but not hard
lockups.

Do these functions need to break up their processing into smaller chunks
(e.g., a few Megabytes), calling kernel_fpu_end() periodically to
allow the scheduler to take over the CPUs if needed? If so, what
chunk size would be appropriate?



Example stack trace:
[ 29.729811] rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 12-... } 22 jiffies s: 277 root: 0x1/.
[ 29.729815] rcu: blocking rcu_node structures (internal RCU debug): l=1:0-13:0x1000/.
[ 29.729818] Task dump for CPU 12:
[ 29.729819] task:modprobe state:R running task stack: 0 pid: 1703 ppid: 1702 flags:0x0000400a
[ 29.729822] Call Trace:
[ 29.729823] <TASK>
[ 29.729825] ? sysvec_apic_timer_interrupt+0xab/0xc0
[ 29.729830] ? asm_sysvec_apic_timer_interrupt+0x16/0x20
[ 29.729835] ? loop1+0x3f2/0x98f [sha512_ssse3]
[ 29.729839] ? crypto_create_tfm_node+0x33/0x100
[ 29.729843] ? nowork+0x6/0x6 [sha512_ssse3]
[ 29.729844] ? sha512_base_do_update.isra.0+0xeb/0x160 [sha512_ssse3]
[ 29.729847] ? nowork+0x6/0x6 [sha512_ssse3]
[ 29.729849] ? sha512_finup.part.0+0x1de/0x230 [sha512_ssse3]
[ 29.729851] ? pkcs7_digest+0xaf/0x1f0
[ 29.729854] ? pkcs7_verify+0x61/0x540
[ 29.729856] ? verify_pkcs7_message_sig+0x4a/0xe0
[ 29.729859] ? pkcs7_parse_message+0x174/0x1b0
[ 29.729861] ? verify_pkcs7_signature+0x4c/0x80
[ 29.729862] ? mod_verify_sig+0x74/0x90
[ 29.729867] ? module_sig_check+0x87/0xd0
[ 29.729868] ? load_module+0x4e/0x1fc0
[ 29.729871] ? xfs_file_read_iter+0x70/0xe0 [xfs]
[ 29.729955] ? __kernel_read+0x118/0x290
[ 29.729959] ? ima_post_read_file+0xac/0xc0
[ 29.729962] ? kernel_read_file+0x211/0x2a0
[ 29.729965] ? __do_sys_finit_module+0x93/0xf0
[ 29.729967] ? __do_sys_finit_module+0x93/0xf0
[ 29.729969] ? do_syscall_64+0x58/0x80
[ 29.729971] ? syscall_exit_to_user_mode+0x17/0x40
[ 29.729973] ? do_syscall_64+0x67/0x80
[ 29.729975] ? exit_to_user_mode_prepare+0x18f/0x1f0
[ 29.729976] ? syscall_exit_to_user_mode+0x17/0x40
[ 29.729977] ? do_syscall_64+0x67/0x80
[ 29.729979] ? syscall_exit_to_user_mode+0x17/0x40
[ 29.729980] ? do_syscall_64+0x67/0x80
[ 29.729982] ? exc_page_fault+0x70/0x170
[ 29.729983] ? entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 29.729986] </TASK>
[ 29.753810] rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { } 236 jiffies s: 281 root: 0x0/.


2022-08-26 03:08:53

by Herbert Xu

[permalink] [raw]
Subject: Re: [PATCH] crypto: x86/sha512 - load based on CPU features

On Fri, Aug 26, 2022 at 02:40:58AM +0000, Elliott, Robert (Servers) wrote:
>
> Suggestion: please revert the sha512-x86 patch for a while.

This problem would have existed anyway if the module was built
into the kernel.

> Do these functions need to break up their processing into smaller chunks
> (e.g., a few Megabytes), calling kernel_fpu_end() periodically to
> allow the scheduler to take over the CPUs if needed? If so, what
> chunk size would be appropriate?

Yes these should be limited to 4K each. It appears that all the
sha* helpers in arch/x86/crypto have the same problem.

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2022-08-30 18:01:09

by Tim Chen

[permalink] [raw]
Subject: Re: [PATCH] crypto: x86/sha512 - load based on CPU features

On Fri, 2022-08-26 at 10:52 +0800, Herbert Xu wrote:
> On Fri, Aug 26, 2022 at 02:40:58AM +0000, Elliott, Robert (Servers) wrote:
> > Suggestion: please revert the sha512-x86 patch for a while.
>
> This problem would have existed anyway if the module was built
> into the kernel.
>
> > Do these functions need to break up their processing into smaller chunks
> > (e.g., a few Megabytes), calling kernel_fpu_end() periodically to
> > allow the scheduler to take over the CPUs if needed? If so, what
> > chunk size would be appropriate?
>
> Yes these should be limited to 4K each. It appears that all the
> sha* helpers in arch/x86/crypto have the same problem.
>

I think limiting chunk to 4K is reasonable to prevent the CPU from being hogged by
the crypto code for too long.

Tim