Hi Tom,
While looking at a different issue on a GNR system I noticed that
during the boot of the kdump kernel it crashes when probing iaa_crypto
due to a divide by zero in rebalance_wq_table. The problem is that the
kdump kernel comes up with a single cpu, and if there are multiple iaa
devices cpus_per_iaa is going to be calculated to be 0, and then the
'if ((cpu % cpus_per_iaa) == 0)' in rebalance_wq_table results in a
divide by zero. I reproduced it with the 6.8 eln kernel, and so far
have reproduced it on GNR, EMR, and SRF systems. I'm assuming the same
will be the case on and SPR system with IAA devices enabled if I can
find one.
Should save_iaa_wq return an error if the number of iaa devices is greater
than the number of cpus?
[ 17.242696] idxd: crypto: iaa_crypto now ENABLED
[ 17.248641] divide error: 0000 [#1] PREEMPT SMP NOPTI
[ 17.254358] CPU: 0 PID: 396 Comm: systemd-udevd Not tainted 6.8.0-63.eln136.1.x86_64 #1
[ 17.263399] Hardware name: Intel Corporation AvenueCity/AvenueCity, BIOS BHSDCRB1.IPC.2780.D02.2311070514 11/07/2023
[ 17.275266] RIP: 0010:rebalance_wq_table.part.0+0x163/0x220 [iaa_crypto]
[ 17.282851] Code: 85 c0 74 c1 8b 35 6d ed f3 c2 31 db 48 39 f3 73 4d 48 89 da 4c 89 f7 e8 9b 5a 26 c1 3b 05 55 ed f3 c2 89 c6 73 38 31 d2 89 d8 <f7> 35 9f 76 00 00 83 fa 01 41 83 d5 00 44 89 ef e8 68 f9 ff ff 85
[ 17.303974] RSP: 0018:ffa0000001147bb0 EFLAGS: 00010246
[ 17.309895] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00000000ffffffff
[ 17.317956] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[ 17.326016] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001
[ 17.334076] R10: ff1100005bff93c0 R11: 0000000000000003 R12: ffffffff826cbba8
[ 17.342137] R13: 00000000ffffffff R14: ff1100005bff93c0 R15: ff110000563968e0
[ 17.350197] FS: 00007f0697de8540(0000) GS:ff1100005ba00000(0000) knlGS:0000000000000000
[ 17.359333] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 17.365834] CR2: 000055bf003ad358 CR3: 0000000046632003 CR4: 0000000000f71eb0
[ 17.373900] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 17.381960] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 17.390020] PKRU: 55555554
[ 17.393113] Call Trace:
[ 17.395905] <TASK>
[ 17.398310] ? die+0x36/0x90
[ 17.401600] ? do_trap+0xda/0x100
[ 17.405373] ? rebalance_wq_table.part.0+0x163/0x220 [iaa_crypto]
[ 17.412265] ? do_error_trap+0x65/0x80
[ 17.416519] ? rebalance_wq_table.part.0+0x163/0x220 [iaa_crypto]
[ 17.423412] ? exc_divide_error+0x38/0x50
[ 17.427970] ? rebalance_wq_table.part.0+0x163/0x220 [iaa_crypto]
[ 17.434861] ? asm_exc_divide_error+0x1a/0x20
[ 17.439805] ? rebalance_wq_table.part.0+0x163/0x220 [iaa_crypto]
[ 17.446696] iaa_crypto_probe+0x117/0x2e0 [iaa_crypto]
[ 17.452514] really_probe+0x19b/0x3e0
[ 17.456674] ? __pfx___driver_attach+0x10/0x10
[ 17.461715] __driver_probe_device+0x78/0x160
[ 17.466659] driver_probe_device+0x1f/0xa0
[ 17.471313] __driver_attach+0xba/0x1c0
[ 17.475665] bus_for_each_dev+0x8c/0xe0
[ 17.480028] bus_add_driver+0x116/0x220
[ 17.484380] driver_register+0x5c/0x100
[ 17.488731] iaa_crypto_init_module+0xe5/0xff0 [iaa_crypto]
[ 17.495043] ? __pfx_iaa_crypto_init_module+0x10/0x10 [iaa_crypto]
[ 17.502032] do_one_initcall+0x58/0x310
[ 17.506385] do_init_module+0x60/0x240
[ 17.510640] __do_sys_init_module+0x17a/0x1b0
[ 17.515587] do_syscall_64+0x81/0x160
[ 17.519746] ? handle_mm_fault+0xdd/0x360
[ 17.524302] ? do_user_addr_fault+0x2fe/0x670
[ 17.529248] ? exc_page_fault+0x6b/0x150
[ 17.533697] entry_SYSCALL_64_after_hwframe+0x6e/0x76
[ 17.539413] RIP: 0033:0x7f0698a2ef1e
[ 17.543479] Code: 48 8b 0d 05 af 0e 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d d2 ae 0e 00 f7 d8 64 89 01 48
[ 17.564605] RSP: 002b:00007ffe27da0918 EFLAGS: 00000246 ORIG_RAX: 00000000000000af
[ 17.573156] RAX: ffffffffffffffda RBX: 000055beffb45cb0 RCX: 00007f0698a2ef1e
[ 17.581216] RDX: 000055beffb78ba0 RSI: 0000000000026400 RDI: 000055bf00386cf0
[ 17.589276] RBP: 000055bf00386cf0 R08: 000055beffb70340 R09: 0000000000026010
[ 17.597337] R10: 0000000000000005 R11: 0000000000000246 R12: 000055beffb78ba0
[ 17.605397] R13: 000055beffb71110 R14: 0000000000000000 R15: 000055beffb45fe0
[ 17.613459] </TASK>
Regards,
Jerry
Hi Jerry,
On Tue, 2024-03-19 at 13:51 -0700, Jerry Snitselaar wrote:
> Hi Tom,
>
> While looking at a different issue on a GNR system I noticed that
> during the boot of the kdump kernel it crashes when probing
> iaa_crypto
> due to a divide by zero in rebalance_wq_table. The problem is that
> the
> kdump kernel comes up with a single cpu, and if there are multiple
> iaa
> devices cpus_per_iaa is going to be calculated to be 0, and then the
> 'if ((cpu % cpus_per_iaa) == 0)' in rebalance_wq_table results in a
> divide by zero. I reproduced it with the 6.8 eln kernel, and so far
> have reproduced it on GNR, EMR, and SRF systems. I'm assuming the
> same
> will be the case on and SPR system with IAA devices enabled if I can
> find one.
>
Good catch, I've never tested that before. Thanks for reporting it.
> Should save_iaa_wq return an error if the number of iaa devices is
> greater
> than the number of cpus?
>
No, you should still be able to use the driver with just one cpu, maybe
it just always maps to the same device. I'll take a look and come up
with a fix.
Tom
>
> [ 17.242696] idxd: crypto: iaa_crypto now ENABLED
> [ 17.248641] divide error: 0000 [#1] PREEMPT SMP NOPTI
> [ 17.254358] CPU: 0 PID: 396 Comm: systemd-udevd Not tainted
> 6.8.0-63.eln136.1.x86_64 #1
> [ 17.263399] Hardware name: Intel Corporation
> AvenueCity/AvenueCity, BIOS BHSDCRB1.IPC.2780.D02.2311070514
> 11/07/2023
> [ 17.275266] RIP: 0010:rebalance_wq_table.part.0+0x163/0x220
> [iaa_crypto]
> [ 17.282851] Code: 85 c0 74 c1 8b 35 6d ed f3 c2 31 db 48 39 f3
> 73 4d 48 89 da 4c 89 f7 e8 9b 5a 26 c1 3b 05 55 ed f3 c2 89 c6 73 38
> 31 d2 89 d8 <f7> 35 9f 76 00 00 83 fa 01 41 83 d5 00 44 89 ef e8 68
> f9 ff ff 85
> [ 17.303974] RSP: 0018:ffa0000001147bb0 EFLAGS: 00010246
> [ 17.309895] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
> 00000000ffffffff
> [ 17.317956] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
> 0000000000000000
> [ 17.326016] RBP: 0000000000000000 R08: 0000000000000000 R09:
> 0000000000000001
> [ 17.334076] R10: ff1100005bff93c0 R11: 0000000000000003 R12:
> ffffffff826cbba8
> [ 17.342137] R13: 00000000ffffffff R14: ff1100005bff93c0 R15:
> ff110000563968e0
> [ 17.350197] FS: 00007f0697de8540(0000)
> GS:ff1100005ba00000(0000) knlGS:0000000000000000
> [ 17.359333] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 17.365834] CR2: 000055bf003ad358 CR3: 0000000046632003 CR4:
> 0000000000f71eb0
> [ 17.373900] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [ 17.381960] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7:
> 0000000000000400
> [ 17.390020] PKRU: 55555554
> [ 17.393113] Call Trace:
> [ 17.395905] <TASK>
> [ 17.398310] ? die+0x36/0x90
> [ 17.401600] ? do_trap+0xda/0x100
> [ 17.405373] ? rebalance_wq_table.part.0+0x163/0x220
> [iaa_crypto]
> [ 17.412265] ? do_error_trap+0x65/0x80
> [ 17.416519] ? rebalance_wq_table.part.0+0x163/0x220
> [iaa_crypto]
> [ 17.423412] ? exc_divide_error+0x38/0x50
> [ 17.427970] ? rebalance_wq_table.part.0+0x163/0x220
> [iaa_crypto]
> [ 17.434861] ? asm_exc_divide_error+0x1a/0x20
> [ 17.439805] ? rebalance_wq_table.part.0+0x163/0x220
> [iaa_crypto]
> [ 17.446696] iaa_crypto_probe+0x117/0x2e0 [iaa_crypto]
> [ 17.452514] really_probe+0x19b/0x3e0
> [ 17.456674] ? __pfx___driver_attach+0x10/0x10
> [ 17.461715] __driver_probe_device+0x78/0x160
> [ 17.466659] driver_probe_device+0x1f/0xa0
> [ 17.471313] __driver_attach+0xba/0x1c0
> [ 17.475665] bus_for_each_dev+0x8c/0xe0
> [ 17.480028] bus_add_driver+0x116/0x220
> [ 17.484380] driver_register+0x5c/0x100
> [ 17.488731] iaa_crypto_init_module+0xe5/0xff0 [iaa_crypto]
> [ 17.495043] ? __pfx_iaa_crypto_init_module+0x10/0x10
> [iaa_crypto]
> [ 17.502032] do_one_initcall+0x58/0x310
> [ 17.506385] do_init_module+0x60/0x240
> [ 17.510640] __do_sys_init_module+0x17a/0x1b0
> [ 17.515587] do_syscall_64+0x81/0x160
> [ 17.519746] ? handle_mm_fault+0xdd/0x360
> [ 17.524302] ? do_user_addr_fault+0x2fe/0x670
> [ 17.529248] ? exc_page_fault+0x6b/0x150
> [ 17.533697] entry_SYSCALL_64_after_hwframe+0x6e/0x76
> [ 17.539413] RIP: 0033:0x7f0698a2ef1e
> [ 17.543479] Code: 48 8b 0d 05 af 0e 00 f7 d8 64 89 01 48 83 c8
> ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00
> 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d d2 ae 0e 00 f7 d8
> 64 89 01 48
> [ 17.564605] RSP: 002b:00007ffe27da0918 EFLAGS: 00000246
> ORIG_RAX: 00000000000000af
> [ 17.573156] RAX: ffffffffffffffda RBX: 000055beffb45cb0 RCX:
> 00007f0698a2ef1e
> [ 17.581216] RDX: 000055beffb78ba0 RSI: 0000000000026400 RDI:
> 000055bf00386cf0
> [ 17.589276] RBP: 000055bf00386cf0 R08: 000055beffb70340 R09:
> 0000000000026010
> [ 17.597337] R10: 0000000000000005 R11: 0000000000000246 R12:
> 000055beffb78ba0
> [ 17.605397] R13: 000055beffb71110 R14: 0000000000000000 R15:
> 000055beffb45fe0
> [ 17.613459] </TASK>
>
>
> Regards,
> Jerry
>
Hi Jerry,
On Tue, 2024-03-19 at 17:19 -0500, Tom Zanussi wrote:
> Hi Jerry,
>
> On Tue, 2024-03-19 at 13:51 -0700, Jerry Snitselaar wrote:
> > Hi Tom,
> >
> > While looking at a different issue on a GNR system I noticed that
> > during the boot of the kdump kernel it crashes when probing
> > iaa_crypto
> > due to a divide by zero in rebalance_wq_table. The problem is that
> > the
> > kdump kernel comes up with a single cpu, and if there are multiple
> > iaa
> > devices cpus_per_iaa is going to be calculated to be 0, and then
> > the
> > 'if ((cpu % cpus_per_iaa) == 0)' in rebalance_wq_table results in a
> > divide by zero. I reproduced it with the 6.8 eln kernel, and so far
> > have reproduced it on GNR, EMR, and SRF systems. I'm assuming the
> > same
> > will be the case on and SPR system with IAA devices enabled if I
> > can
> > find one.
> >
>
> Good catch, I've never tested that before. Thanks for reporting it.
>
> > Should save_iaa_wq return an error if the number of iaa devices is
> > greater
> > than the number of cpus?
> >
>
> No, you should still be able to use the driver with just one cpu,
> maybe
> it just always maps to the same device. I'll take a look and come up
> with a fix.
>
> Tom
The patch below fixes it for me. It gets rid of the crash and I was
able to run some basic tests successfully.
Tom
From 37dc97831c9e12c103115cb5fc9866b42cad7bc5 Mon Sep 17 00:00:00 2001
From: Tom Zanussi <[email protected]>
Date: Wed, 20 Mar 2024 05:37:11 -0700
Subject: [PATCH] crypto: iaa - Fix nr_cpus < nr_iaa case
If nr_cpus < nr_iaa, the calculated cpus_per_iaa will be 0, which
causes a divide-by-0 in rebalance_wq_table().
Make sure cpus_per_iaa is 1 in that case, and also in the nr_iaa == 0
case, even though cpus_per_iaa is never used if nr_iaa == 0, for
paranoia.
Reported-by: Jerry Snitselaar <[email protected]>
Signed-off-by: Tom Zanussi <[email protected]>
---
drivers/crypto/intel/iaa/iaa_crypto_main.c | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)
diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index 1cd304de5388..b2191ade9011 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -806,6 +806,8 @@ static int save_iaa_wq(struct idxd_wq *wq)
return -EINVAL;
cpus_per_iaa = (nr_nodes * nr_cpus_per_node) / nr_iaa;
+ if (!cpus_per_iaa)
+ cpus_per_iaa = 1;
out:
return 0;
}
@@ -821,10 +823,12 @@ static void remove_iaa_wq(struct idxd_wq *wq)
}
}
- if (nr_iaa)
+ if (nr_iaa) {
cpus_per_iaa = (nr_nodes * nr_cpus_per_node) / nr_iaa;
- else
- cpus_per_iaa = 0;
+ if (!cpus_per_iaa)
+ cpus_per_iaa = 1;
+ } else
+ cpus_per_iaa = 1;
}
static int wq_table_add_wqs(int iaa, int cpu)
--
2.34.1
On Wed, Mar 20, 2024 at 01:13:48PM -0500, Tom Zanussi wrote:
> Hi Jerry,
>
> On Tue, 2024-03-19 at 17:19 -0500, Tom Zanussi wrote:
> > Hi Jerry,
> >
> > On Tue, 2024-03-19 at 13:51 -0700, Jerry Snitselaar wrote:
> > > Hi Tom,
> > >
> > > While looking at a different issue on a GNR system I noticed that
> > > during the boot of the kdump kernel it crashes when probing
> > > iaa_crypto
> > > due to a divide by zero in rebalance_wq_table. The problem is that
> > > the
> > > kdump kernel comes up with a single cpu, and if there are multiple
> > > iaa
> > > devices cpus_per_iaa is going to be calculated to be 0, and then
> > > the
> > > 'if ((cpu % cpus_per_iaa) == 0)' in rebalance_wq_table results in a
> > > divide by zero. I reproduced it with the 6.8 eln kernel, and so far
> > > have reproduced it on GNR, EMR, and SRF systems. I'm assuming the
> > > same
> > > will be the case on and SPR system with IAA devices enabled if I
> > > can
> > > find one.
> > >
> >
> > Good catch, I've never tested that before. Thanks for reporting it.
> >
> > > Should save_iaa_wq return an error if the number of iaa devices is
> > > greater
> > > than the number of cpus?
> > >
> >
> > No, you should still be able to use the driver with just one cpu,
> > maybe
> > it just always maps to the same device. I'll take a look and come up
> > with a fix.
> >
> > Tom
>
> The patch below fixes it for me. It gets rid of the crash and I was
> able to run some basic tests successfully.
>
> Tom
>
It avoids the crash for me.
Regards,
Jerry
>
> From 37dc97831c9e12c103115cb5fc9866b42cad7bc5 Mon Sep 17 00:00:00 2001
> From: Tom Zanussi <[email protected]>
> Date: Wed, 20 Mar 2024 05:37:11 -0700
> Subject: [PATCH] crypto: iaa - Fix nr_cpus < nr_iaa case
>
> If nr_cpus < nr_iaa, the calculated cpus_per_iaa will be 0, which
> causes a divide-by-0 in rebalance_wq_table().
>
> Make sure cpus_per_iaa is 1 in that case, and also in the nr_iaa == 0
> case, even though cpus_per_iaa is never used if nr_iaa == 0, for
> paranoia.
>
> Reported-by: Jerry Snitselaar <[email protected]>
> Signed-off-by: Tom Zanussi <[email protected]>
> ---
> drivers/crypto/intel/iaa/iaa_crypto_main.c | 10 +++++++---
> 1 file changed, 7 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
> index 1cd304de5388..b2191ade9011 100644
> --- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
> +++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
> @@ -806,6 +806,8 @@ static int save_iaa_wq(struct idxd_wq *wq)
> return -EINVAL;
>
> cpus_per_iaa = (nr_nodes * nr_cpus_per_node) / nr_iaa;
> + if (!cpus_per_iaa)
> + cpus_per_iaa = 1;
> out:
> return 0;
> }
> @@ -821,10 +823,12 @@ static void remove_iaa_wq(struct idxd_wq *wq)
> }
> }
>
> - if (nr_iaa)
> + if (nr_iaa) {
> cpus_per_iaa = (nr_nodes * nr_cpus_per_node) / nr_iaa;
> - else
> - cpus_per_iaa = 0;
> + if (!cpus_per_iaa)
> + cpus_per_iaa = 1;
> + } else
> + cpus_per_iaa = 1;
> }
>
> static int wq_table_add_wqs(int iaa, int cpu)
> --
> 2.34.1
>
>