Hello
When running some CI test jobs (targeting crypto tests), I always get the following WARNING:
[ 7.886361] ------------[ cut here ]------------
[ 7.886388] WARNING: CPU: 2 PID: 147 at kernel/workqueue.c:1473 __queue_work+0x3b8/0x3d0
[ 7.886394] Modules linked in: ghash_generic
[ 7.886409] CPU: 2 PID: 147 Comm: modprobe Not tainted 5.6.0-rc1-next-20200214-00068-g166c9264f0b1-dirty #545
[ 7.886414] Hardware name: Pine H64 model A (DT)
[ 7.886422] pstate: a0000085 (NzCv daIf -PAN -UAO)
[ 7.886429] pc : __queue_work+0x3b8/0x3d0
[ 7.886436] lr : __queue_work+0x1dc/0x3d0
[ 7.886440] sp : ffff800012073b50
[ 7.886445] x29: ffff800012073b50 x28: ffff8000117bb590
[ 7.886452] x27: 0000000000000100 x26: ffff80001132d018
[ 7.886460] x25: ffff800011336d58 x24: ffff800011629920
[ 7.886467] x23: ffff80001132d018 x22: 000000000000000e
[ 7.886474] x21: 0000000000000002 x20: ffff0000b9c08000
[ 7.886481] x19: ffff0000bd9b8400 x18: 0000000000000000
[ 7.886488] x17: 0000000000000000 x16: 0000000000000000
[ 7.886495] x15: 0000af8c526b5c68 x14: 02be881212d8d480
[ 7.886503] x13: 0000000000000352 x12: 0000000000000001
[ 7.886510] x11: 0000000000000400 x10: 0000000000000040
[ 7.886517] x9 : ffff80001163f5e8 x8 : ffff80001163f5e0
[ 7.886524] x7 : ffff0000b9800028 x6 : 0000000000000000
[ 7.886531] x5 : ffff0000b9800000 x4 : 0000000000000000
[ 7.886538] x3 : ffff0000bd9b4800 x2 : 0000000000000001
[ 7.886545] x1 : 0000000000000000 x0 : ffff8000117bb598
[ 7.886552] Call trace:
[ 7.886560] __queue_work+0x3b8/0x3d0
[ 7.886567] queue_work_on+0x6c/0x90
[ 7.886576] do_init_module+0x188/0x1f0
[ 7.886582] load_module+0x1d00/0x22b0
[ 7.886589] __do_sys_finit_module+0xd0/0xe8
[ 7.886595] __arm64_sys_finit_module+0x1c/0x28
[ 7.886605] el0_svc_common.constprop.0+0x68/0x160
[ 7.886613] do_el0_svc+0x20/0x80
[ 7.886621] el0_sync_handler+0x10c/0x180
[ 7.886627] el0_sync+0x140/0x180
[ 7.886638] ---[ end trace a44615ff285cb96c ]---
[ 7.886643] WARN for events
For finding what was cause this, I have added the following debug:
@@ -1468,8 +1470,10 @@ static void __queue_work(int cpu, struct workqueue_struct *wq,
/* pwq determined, queue */
trace_workqueue_queue_work(req_cpu, pwq, work);
- if (WARN_ON(!list_empty(&work->entry)))
+ if (WARN_ON(!list_empty(&work->entry))) {
+ pr_err("WARN for %s\n", wq->name);
goto out;
+ }
So it seems that it is a "events" workqueue that hit this problem.
Note that classic defconfig do not hit this problem, since it appears with the following config change:
-CONFIG_CRYPTO_MANAGER_DISABLE_TESTS=y
+# CONFIG_CRYPTO_MANAGER_DISABLE_TESTS is not set
+CONFIG_CRYPTO_MANAGER_EXTRA_TESTS=y
I dont see any relation between crypto and this problem, but this problem with this config change is reproductible.
Regards
Hi Corentin,
On Mon, Feb 17, 2020 at 09:48:03PM +0100, Corentin Labbe wrote:
> When running some CI test jobs (targeting crypto tests), I always get the following WARNING:
Can you be more specific about which test triggers this? I used the config
option you mention and failed to reproduce after doing the boot-time crypto
tests and running various tcrypt incantations.
> [ 7.886361] ------------[ cut here ]------------
> [ 7.886388] WARNING: CPU: 2 PID: 147 at kernel/workqueue.c:1473 __queue_work+0x3b8/0x3d0
> [ 7.886394] Modules linked in: ghash_generic
> [ 7.886409] CPU: 2 PID: 147 Comm: modprobe Not tainted 5.6.0-rc1-next-20200214-00068-g166c9264f0b1-dirty #545
I was using just plain next-20200214. Can't find 166c9264f0b1, what tag/branch
were you using exactly?
> So it seems that it is a "events" workqueue that hit this problem.
Looks like "schedule_work(&init_free_wq)" in do_init_module(), can't be sure
though.
thanks,
Daniel
On Tue, Feb 18, 2020 at 11:35:04AM -0500, Daniel Jordan wrote:
> Hi Corentin,
>
> On Mon, Feb 17, 2020 at 09:48:03PM +0100, Corentin Labbe wrote:
> > When running some CI test jobs (targeting crypto tests), I always get the following WARNING:
>
> Can you be more specific about which test triggers this? I used the config
> option you mention and failed to reproduce after doing the boot-time crypto
> tests and running various tcrypt incantations.
>
Hello
It appears before any user space start. But according to the "Modules linked", probably ghash is already loaded and perhaps tested.
Removing GHASH lead to:
[ 7.920931] ------------[ cut here ]------------
[ 7.920955] WARNING: CPU: 1 PID: 120 at kernel/workqueue.c:1469 __queue_work+0x370/0x388
[ 7.920960] Modules linked in: ccm
And removing CCM lead to
[ 7.798877] ------------[ cut here ]------------
[ 7.798902] WARNING: CPU: 2 PID: 127 at kernel/workqueue.c:1469 __queue_work+0x370/0x388
[ 7.798907] Modules linked in: ctr
So it confirm that the problem is not related to the tested crypto algorithm.
> > [ 7.886361] ------------[ cut here ]------------
> > [ 7.886388] WARNING: CPU: 2 PID: 147 at kernel/workqueue.c:1473 __queue_work+0x3b8/0x3d0
> > [ 7.886394] Modules linked in: ghash_generic
> > [ 7.886409] CPU: 2 PID: 147 Comm: modprobe Not tainted 5.6.0-rc1-next-20200214-00068-g166c9264f0b1-dirty #545
>
> I was using just plain next-20200214. Can't find 166c9264f0b1, what tag/branch
> were you using exactly?
>
The pasted example has some commit to try to debug it.
But I got the same with plain next (like yesterday 5.6.0-rc2-next-20200219 and tomorow 5.6.0-rc2-next-20200220) and master got the same issue.
But for reproductability on different hardware, I agree it is difficult.
For the moment, I got it only on Allwinner H5, A64, H6 SoCs and imx8q.
[ 6.611449] ------------[ cut here ]------------
[ 6.613234] WARNING: CPU: 1 PID: 157 at /srv/data/clabbe/linux-next/kernel/workqueue.c:1471 __queue_work+0x324/0x3b0
[ 6.623809] Modules linked in: ghash_generic
[ 6.628101] CPU: 1 PID: 157 Comm: modprobe Not tainted 5.6.0-rc2-next-20200220 #82
[ 6.635710] Hardware name: NXP i.MX8MNano DDR4 EVK board (DT)
I tried amlogic boards and some qemu "virt" without success.
(I have added [email protected] to the CC)
On Thu, Feb 20, 2020 at 10:03:50AM +0100, Corentin Labbe wrote:
> But I got the same with plain next (like yesterday 5.6.0-rc2-next-20200219 and tomorow 5.6.0-rc2-next-20200220) and master got the same issue.
Thanks. I've been trying to reproduce this on an arm board but it's taking a
while to get it setup since I've never used it for kernel work.
Hoping to get it up soon, though someone with a working setup may be in a
better position to help with this.
On Fri, Feb 21, 2020 at 12:42:23PM -0500, Daniel Jordan wrote:
> On Thu, Feb 20, 2020 at 10:03:50AM +0100, Corentin Labbe wrote:
> > But I got the same with plain next (like yesterday 5.6.0-rc2-next-20200219 and tomorow 5.6.0-rc2-next-20200220) and master got the same issue.
>
> Thanks. I've been trying to reproduce this on an arm board but it's taking a
> while to get it setup since I've never used it for kernel work.
>
> Hoping to get it up soon, though someone with a working setup may be in a
> better position to help with this.
Any joy with this? It sounded to me like the issue also happens on a
mainline kernel. If this is the case, have you managed to bisect it?
Cheers,
Will
On Sun, Mar 01, 2020 at 06:53:51PM +0100, Corentin Labbe wrote:
> I tried to bisect this problem, but the result is:
...
> # first bad commit: [81ff5d2cba4f86cd850b9ee4a530cd221ee45aa3] Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
>
> The only interesting thing I see in this MR is: "Add fuzz testing to testmgr"
>
> But this wont help.
Hm, that merge commit has only a couple lines of powerpc build change, so maybe
there's something nondeterministic going on.
Does this fix it? I can't verify but figure it's worth trying the simplest
explanation first, which is that the work isn't initialized by the time it's
queued.
thanks,
daniel
---8<---
Subject: [PATCH] module: statically initialize init section freeing data
Signed-off-by: Daniel Jordan <[email protected]>
---
kernel/module.c | 13 +++----------
1 file changed, 3 insertions(+), 10 deletions(-)
diff --git a/kernel/module.c b/kernel/module.c
index 33569a01d6e1..db0cda206167 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -88,8 +88,9 @@ EXPORT_SYMBOL_GPL(module_mutex);
static LIST_HEAD(modules);
/* Work queue for freeing init sections in success case */
-static struct work_struct init_free_wq;
-static struct llist_head init_free_list;
+static void do_free_init(struct work_struct *w);
+static DECLARE_WORK(init_free_wq, do_free_init);
+static LLIST_HEAD(init_free_list);
#ifdef CONFIG_MODULES_TREE_LOOKUP
@@ -3501,14 +3502,6 @@ static void do_free_init(struct work_struct *w)
}
}
-static int __init modules_wq_init(void)
-{
- INIT_WORK(&init_free_wq, do_free_init);
- init_llist_head(&init_free_list);
- return 0;
-}
-module_init(modules_wq_init);
-
/*
* This is where the real work happens.
*
--
2.25.1
On Mon, Mar 02, 2020 at 12:25:10PM -0500, Daniel Jordan wrote:
> On Sun, Mar 01, 2020 at 06:53:51PM +0100, Corentin Labbe wrote:
> > I tried to bisect this problem, but the result is:
> ...
> > # first bad commit: [81ff5d2cba4f86cd850b9ee4a530cd221ee45aa3] Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
> >
> > The only interesting thing I see in this MR is: "Add fuzz testing to testmgr"
> >
> > But this wont help.
>
> Hm, that merge commit has only a couple lines of powerpc build change, so maybe
> there's something nondeterministic going on.
>
> Does this fix it? I can't verify but figure it's worth trying the simplest
> explanation first, which is that the work isn't initialized by the time it's
> queued.
>
> thanks,
> daniel
>
> ---8<---
>
> Subject: [PATCH] module: statically initialize init section freeing data
>
> Signed-off-by: Daniel Jordan <[email protected]>
> ---
> kernel/module.c | 13 +++----------
> 1 file changed, 3 insertions(+), 10 deletions(-)
>
> diff --git a/kernel/module.c b/kernel/module.c
> index 33569a01d6e1..db0cda206167 100644
> --- a/kernel/module.c
> +++ b/kernel/module.c
> @@ -88,8 +88,9 @@ EXPORT_SYMBOL_GPL(module_mutex);
> static LIST_HEAD(modules);
>
> /* Work queue for freeing init sections in success case */
> -static struct work_struct init_free_wq;
> -static struct llist_head init_free_list;
> +static void do_free_init(struct work_struct *w);
> +static DECLARE_WORK(init_free_wq, do_free_init);
> +static LLIST_HEAD(init_free_list);
>
> #ifdef CONFIG_MODULES_TREE_LOOKUP
>
> @@ -3501,14 +3502,6 @@ static void do_free_init(struct work_struct *w)
> }
> }
>
> -static int __init modules_wq_init(void)
> -{
> - INIT_WORK(&init_free_wq, do_free_init);
> - init_llist_head(&init_free_list);
> - return 0;
> -}
> -module_init(modules_wq_init);
> -
> /*
> * This is where the real work happens.
> *
Hello
The patch fix the issue. Thanks!
So you could add:
Reported-by: Corentin Labbe <[email protected]>
Tested-by: Corentin Labbe <[email protected]>
Tested-on: sun50i-h6-pine-h64
Tested-on: imx8mn-ddr4-evk
Tested-on: sun50i-a64-bananapi-m64
Thanks again
Regards
On Tue, Mar 03, 2020 at 08:48:19AM +0100, Corentin Labbe wrote:
> The patch fix the issue. Thanks!
Thanks for trying it!
> So you could add:
> Reported-by: Corentin Labbe <[email protected]>
> Tested-by: Corentin Labbe <[email protected]>
> Tested-on: sun50i-h6-pine-h64
> Tested-on: imx8mn-ddr4-evk
> Tested-on: sun50i-a64-bananapi-m64
I definitely will if the patch turns out to be the right fix.
thanks,
Daniel