Hey All,
In doing some local testing, I noticed I've started to see boot
stalls with CONFIG_WW_MUTEX_SELFTEST with 6.9-rc on a 64cpu qemu
environment.
I've bisected the problem down to:
5797b1c18919 (workqueue: Implement system-wide nr_active enforcement
for unbound workqueues)
+ the fix needed for that change:
15930da42f89 (workqueue: Don't call cpumask_test_cpu() with -1 CPU
in wq_update_node_max_active())
I've seen problems in the past with the ww_mutex selftest code, so
it's likely a problem in the test itself, but I wanted to raise the
issue so folks were aware and see if there were suggestions for a
solution.
It seems to get stuck in __test_cycle() after a few runs when it hits
flush_workqueue()
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/locking/test-ww_mutex.c#n344
That seems to be because when the various work functions get queued,
they all don't seem to get a chance to run (they use a circular chain
of completions, so the 0th workfunc won't finish until after the
nrthreads-th workfunc runs).
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/locking/test-ww_mutex.c#n295
I'm noticing this happens when the test gets to nrthreads=9 (the test
usually goes up to NR_CPUS), so we queue work for 0->8 but the 9th
worker function never seems to run. Looking at __queue_work() I do
see pwq_tryinc_nr_active() fails for that 9th work struct and we end
up inserting the work as inactive.
I notice the change that uncovers this issue(5797b1c18919), both
tweaks pwq_tryinc_nr_active() and sets the WQ_DFL_MIN_ACTIVE to 8, so
maybe that's a hint as to if the test is abusing the number of queueud
work functions? Though that seems odd because that's the min not the
max (which seems to be 512).
Anyway, let me know if there's anything further I can help share to
debug this. I'll continue digging here as well.
thanks
-john
Hello, John.
On Fri, May 03, 2024 at 06:01:49PM -0700, John Stultz wrote:
> Hey All,
> In doing some local testing, I noticed I've started to see boot
> stalls with CONFIG_WW_MUTEX_SELFTEST with 6.9-rc on a 64cpu qemu
> environment.
>
> I've bisected the problem down to:
> 5797b1c18919 (workqueue: Implement system-wide nr_active enforcement
> for unbound workqueues)
> + the fix needed for that change:
> 15930da42f89 (workqueue: Don't call cpumask_test_cpu() with -1 CPU
> in wq_update_node_max_active())
This should be fixed by d40f92020c7a ("workqueue: The default node_nr_active
should have its max set to max_active"). Can you please confirm the fix?
Thanks and sorry about the hassle.
--
tejun
On Fri, May 3, 2024 at 7:47 PM Tejun Heo <[email protected]> wrote:
> On Fri, May 03, 2024 at 06:01:49PM -0700, John Stultz wrote:
> > Hey All,
> > In doing some local testing, I noticed I've started to see boot
> > stalls with CONFIG_WW_MUTEX_SELFTEST with 6.9-rc on a 64cpu qemu
> > environment.
> >
> > I've bisected the problem down to:
> > 5797b1c18919 (workqueue: Implement system-wide nr_active enforcement
> > for unbound workqueues)
> > + the fix needed for that change:
> > 15930da42f89 (workqueue: Don't call cpumask_test_cpu() with -1 CPU
> > in wq_update_node_max_active())
>
> This should be fixed by d40f92020c7a ("workqueue: The default node_nr_active
> should have its max set to max_active"). Can you please confirm the fix?
Ah! Thank you, that does resolve the issue!
> Thanks and sorry about the hassle.
Thanks so much for the quick response (can't beat a fix already in the tree!)
-john