2022-03-21 08:49:05

by Tetsuo Handa

[permalink] [raw]
Subject: [PATCH v3 (repost)] workqueue: Warn flushing of kernel-global workqueues

Since flush operation synchronously waits for completion, flushing
kernel-global WQs (e.g. system_wq) might introduce possibility of deadlock
due to unexpected locking dependency. Tejun Heo commented that it makes no
sense at all to call flush_workqueue() on the shared WQs as the caller has
no idea what it's gonna end up waiting for.

Although there is flush_scheduled_work() which flushes system_wq WQ with
"Think twice before calling this function! It's very easy to get into
trouble if you don't take great care." warning message, syzbot found a
circular locking dependency caused by flushing system_long_wq WQ [1].
Therefore, let's change the direction to that developers had better use
their local WQs if flush_workqueue() is inevitable.

To give developers time to update their modules, for now just emit
a warning message with ratelimit when flush_workqueue() is called on
kernel-global WQs. We will eventually convert this warning message into
WARN_ON() and kill flush_scheduled_work().

This patch introduces __WQ_NO_FLUSH flag for emitting warning. Don't set
this flag when creating your local WQs while updating your module, for
destroy_workqueue() will involve flush operation.

Theoretically, flushing specific work item using flush_work() queued on
kernel-global WQs (which are !WQ_MEM_RECLAIM) has possibility of deadlock.
But this patch does not emit warning when flush_work() is called on work
items queued on kernel-global WQs, based on assumption that we can create
kworker threads as needed and we won't hit max_active limit.

Link: https://syzkaller.appspot.com/bug?extid=831661966588c802aae9 [1]
Signed-off-by: Tetsuo Handa <[email protected]>
---
This patch has been in linux-next.git as commit b9c20da356db1b39 ("workqueue:
Warn flushing of kernel-global workqueues") since next-20220301, and I got no
failure reports. I think that this patch is safe to go to linux.git; then
developers will find this patch and update their modules to use their local WQ.

Changes in v3:
Don't check flush_work() attempt.
Use a private WQ_ flag.
Changes in v2:
Removed #ifdef CONFIG_PROVE_LOCKING=y check.
Also check flush_work() attempt.
Shorten warning message.
Introduced a public WQ_ flag, which is initially meant for use by
only system-wide WQs, but allows private WQs used by built-in modules
to use this flag for detecting unexpected flush attempts if they want.

include/linux/workqueue.h | 15 +++------------
kernel/workqueue.c | 36 +++++++++++++++++++++++++++++-------
2 files changed, 32 insertions(+), 19 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 7fee9b6cfede..7b13fae377e2 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -339,6 +339,7 @@ enum {
__WQ_ORDERED = 1 << 17, /* internal: workqueue is ordered */
__WQ_LEGACY = 1 << 18, /* internal: create*_workqueue() */
__WQ_ORDERED_EXPLICIT = 1 << 19, /* internal: alloc_ordered_workqueue() */
+ __WQ_NO_FLUSH = 1 << 20, /* internal: warn flush_workqueue() */

WQ_MAX_ACTIVE = 512, /* I like 512, better ideas? */
WQ_MAX_UNBOUND_PER_CPU = 4, /* 4 * #cpus for unbound wq */
@@ -569,18 +570,8 @@ static inline bool schedule_work(struct work_struct *work)
* Forces execution of the kernel-global workqueue and blocks until its
* completion.
*
- * Think twice before calling this function! It's very easy to get into
- * trouble if you don't take great care. Either of the following situations
- * will lead to deadlock:
- *
- * One of the work items currently on the workqueue needs to acquire
- * a lock held by your code or its caller.
- *
- * Your code is running in the context of a work routine.
- *
- * They will be detected by lockdep when they occur, but the first might not
- * occur very often. It depends on what work items are on the workqueue and
- * what locks they need, which you have no control over.
+ * Please stop calling this function. If you need to flush kernel-global
+ * workqueue, please use your local workqueue.
*
* In most situations flushing the entire workqueue is overkill; you merely
* need to know that a particular work item isn't queued and isn't running.
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 33f1106b4f99..bc271579704f 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -2805,6 +2805,25 @@ static bool flush_workqueue_prep_pwqs(struct workqueue_struct *wq,
return wait;
}

+static void warn_flush_attempt(struct workqueue_struct *wq)
+{
+ /*
+ * Since there are known in-tree modules which will emit this warning,
+ * for now don't use WARN_ON() in order not to break kernel testing.
+ *
+ * Print whole traces with ratelimit, in order to make sure that
+ * this warning is not overlooked while this warning does not flood
+ * console and kernel log buffer.
+ */
+ static DEFINE_RATELIMIT_STATE(flush_warn_rs, 600 * HZ, 1);
+
+ ratelimit_set_flags(&flush_warn_rs, RATELIMIT_MSG_ON_RELEASE);
+ if (!__ratelimit(&flush_warn_rs))
+ return;
+ pr_warn("Please do not flush %s WQ.\n", wq->name);
+ dump_stack();
+}
+
/**
* flush_workqueue - ensure that any scheduled work has run to completion.
* @wq: workqueue to flush
@@ -2824,6 +2843,9 @@ void flush_workqueue(struct workqueue_struct *wq)
if (WARN_ON(!wq_online))
return;

+ if (unlikely(wq->flags & __WQ_NO_FLUSH))
+ warn_flush_attempt(wq);
+
lock_map_acquire(&wq->lockdep_map);
lock_map_release(&wq->lockdep_map);

@@ -6054,17 +6076,17 @@ void __init workqueue_init_early(void)
ordered_wq_attrs[i] = attrs;
}

- system_wq = alloc_workqueue("events", 0, 0);
- system_highpri_wq = alloc_workqueue("events_highpri", WQ_HIGHPRI, 0);
- system_long_wq = alloc_workqueue("events_long", 0, 0);
- system_unbound_wq = alloc_workqueue("events_unbound", WQ_UNBOUND,
+ system_wq = alloc_workqueue("events", __WQ_NO_FLUSH, 0);
+ system_highpri_wq = alloc_workqueue("events_highpri", __WQ_NO_FLUSH | WQ_HIGHPRI, 0);
+ system_long_wq = alloc_workqueue("events_long", __WQ_NO_FLUSH, 0);
+ system_unbound_wq = alloc_workqueue("events_unbound", __WQ_NO_FLUSH | WQ_UNBOUND,
WQ_UNBOUND_MAX_ACTIVE);
system_freezable_wq = alloc_workqueue("events_freezable",
- WQ_FREEZABLE, 0);
+ __WQ_NO_FLUSH | WQ_FREEZABLE, 0);
system_power_efficient_wq = alloc_workqueue("events_power_efficient",
- WQ_POWER_EFFICIENT, 0);
+ __WQ_NO_FLUSH | WQ_POWER_EFFICIENT, 0);
system_freezable_power_efficient_wq = alloc_workqueue("events_freezable_power_efficient",
- WQ_FREEZABLE | WQ_POWER_EFFICIENT,
+ __WQ_NO_FLUSH | WQ_FREEZABLE | WQ_POWER_EFFICIENT,
0);
BUG_ON(!system_wq || !system_highpri_wq || !system_long_wq ||
!system_unbound_wq || !system_freezable_wq ||
--
2.32.0



2022-03-21 17:37:54

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v3 (repost)] workqueue: Warn flushing of kernel-global workqueues

On Fri, Mar 18, 2022 at 9:43 PM Tetsuo Handa
<[email protected]> wrote:
>
> Since flush operation synchronously waits for completion, flushing
> kernel-global WQs (e.g. system_wq) might introduce possibility of deadlock
> due to unexpected locking dependency. Tejun Heo commented that it makes no
> sense at all to call flush_workqueue() on the shared WQs as the caller has
> no idea what it's gonna end up waiting for.

NAK on this patch for a very simple reason:

static inline void flush_scheduled_work(void)
{
flush_workqueue(system_wq);
}

and now grep for flush_scheduled_work().

The *other* system workqueue flushes may be rare and "that subsystem
should just be converted to do its own workqueue".

But flush_scheduled_work() is literally exported as an explicit and
separate interface,

The fact that the function has a big warning in the comment above it
doesn't change that fact. At the very least, this patch has to be
preceded by a couple of other patches that fix a couple of subsystems
and document "this is what you should do".

Because suddenly saying "hey, we gave you this interface, now we're
warning about it because it's going to go away" without actually
showing how to do it instead is not acceptable.

And honestly, I don't personally see a good conversion. We literally
have all those "schedule_{delayed_}work{_on}()" etc helper functions
that are *designed* to use this system_wq. People *need* the ability
to flush those things, even if it's only for module unload.

So I really think this patch on its own is completely broken. It'd not
pointing to a way forward, it's just saying "don't do this" with no
really acceptable way to not do it.

Removing flush_scheduled_work() needs to be paired with removing the
"schedule_{delayed_}work{_on}()" helpers too.

And I don't see you having a good alternative.

So until that clear way forward exists, NAK.

Linus

2022-03-21 22:21:09

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v3 (repost)] workqueue: Warn flushing of kernel-global workqueues

Hello, Tetsuo.

On Sat, Mar 19, 2022 at 01:42:59PM +0900, Tetsuo Handa wrote:
> Since flush operation synchronously waits for completion, flushing
> kernel-global WQs (e.g. system_wq) might introduce possibility of deadlock
> due to unexpected locking dependency. Tejun Heo commented that it makes no
> sense at all to call flush_workqueue() on the shared WQs as the caller has
> no idea what it's gonna end up waiting for.
>
> Although there is flush_scheduled_work() which flushes system_wq WQ with
> "Think twice before calling this function! It's very easy to get into
> trouble if you don't take great care." warning message, syzbot found a
> circular locking dependency caused by flushing system_long_wq WQ [1].
> Therefore, let's change the direction to that developers had better use
> their local WQs if flush_workqueue() is inevitable.
>
> To give developers time to update their modules, for now just emit
> a warning message with ratelimit when flush_workqueue() is called on
> kernel-global WQs. We will eventually convert this warning message into
> WARN_ON() and kill flush_scheduled_work().
>
> This patch introduces __WQ_NO_FLUSH flag for emitting warning. Don't set
> this flag when creating your local WQs while updating your module, for
> destroy_workqueue() will involve flush operation.
>
> Theoretically, flushing specific work item using flush_work() queued on
> kernel-global WQs (which are !WQ_MEM_RECLAIM) has possibility of deadlock.
> But this patch does not emit warning when flush_work() is called on work
> items queued on kernel-global WQs, based on assumption that we can create
> kworker threads as needed and we won't hit max_active limit.

As noted in the other thread, we can float this sort of patches in -next to
suss out ones which aren't immediately obvious - e.g. something saves a
pointer to one of the system workqueues and flushes that, but if this is to
happen, somebody has to do the most of the legwork to convert most if not
all of the existing users, which isn't necessarily difficult but is tedious,
which is why it hasn't been done yet. So, if you wanna take on this, you
gotta take on the conversions too. Just declaring it so doesn't really work.

Thanks.

--
tejun

2022-05-14 01:06:39

by Dmitry Torokhov

[permalink] [raw]
Subject: Re: [PATCH v3 (repost)] workqueue: Warn flushing of kernel-global workqueues

On Thu, May 12, 2022 at 08:32:10PM +0900, Tetsuo Handa wrote:
> On 2022/05/12 19:38, Dmitry Torokhov wrote:
> > Hi Tejun,
> >
> > On Mon, Mar 21, 2022 at 07:02:45AM -1000, Tejun Heo wrote:
> >> I'm willing to bet that the majority of the use cases can be converted to
> >> use flush_work() and that'd be the preference. We need a separate workqueue
> >> iff the flush requrement is complex (e.g. there are multiple dynamic work
> >> items in flight which need to be flushed together) or the work items needs
> >> some special attributes (such as MEM_RECLAIM or HIGHPRI) which don't apply
> >> to the system_wq users in the first place.
> >
> > This means that now the code has to keep track of all work items that it
> > allocated, instead of being able "fire and forget" works (when dealing
> > with extremely infrequent events) and rely on flush_workqueue() to
> > cleanup.
>
> Yes. Moreover, a patch to catch and refuse at compile time was proposed at
> https://lkml.kernel.org/r/[email protected] .

My comment was not a wholesale endorsement of Tejun's statement, but
rather a note of the fact that it again adds complexity (at least as far
as driver writers are concerned) to the kernel code.

Also as far as I can see the patch was rejected.

>
> > That flush typically happens in module unload path, and I
> > wonder if the restriction on flush_workqueue() could be relaxed to allow
> > calling it on unload.
>
> A patch for drivers/input/mouse/psmouse-smbus.c is waiting for your response at
> https://lkml.kernel.org/r/[email protected] .
> Like many modules, flush_workqueue() happens on only module unload in your case.

Yes, I saw that patch, and that is what prompted my response. I find it
adding complexity and I was wondering if it could be avoided. It also
unclear to me if there is an additional cost coming from allocating a
dedicated workqueue.

>
> We currently don't have a flag to tell whether the caller is inside module unload
> path. And even inside module unload path, flushing the system-wide workqueue is
> problematic under e.g. GFP_NOFS/GFP_NOIO context.

Sorry, I do not follow here. Are there module unloading code that runs
as GFP_NOFS/GFP_NOIO?

> Therefore, I don't think that
> the caller is inside module unload path as a good exception.
>
> Removing flush_scheduled_work() is for proactively avoiding new problems like
> https://lkml.kernel.org/r/[email protected]
> and https://lkml.kernel.org/r/[email protected] .
>
> Using local WQ also helps for documentation purpose.
> This change makes clear where the work's dependency is.
> Please grep the linux-next.git tree. Some have been already converted.

I understand that for some of them the change makes sense, but it would
be nice to continue using simple API under limited circumstances.

>
> Any chance you have too many out-of-tree modules to convert?
>

No, we are trying to get everything upstream.

Thanks.

--
Dmitry

2022-05-14 01:41:23

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v3 (repost)] workqueue: Warn flushing of kernel-global workqueues

Hello, Dmitry.

On Thu, May 12, 2022 at 06:13:35AM -0700, Dmitry Torokhov wrote:
> > > This means that now the code has to keep track of all work items that it
> > > allocated, instead of being able "fire and forget" works (when dealing
> > > with extremely infrequent events) and rely on flush_workqueue() to
> > > cleanup.
> >
> > Yes. Moreover, a patch to catch and refuse at compile time was proposed at
> > https://lkml.kernel.org/r/[email protected] .
>
> My comment was not a wholesale endorsement of Tejun's statement, but
> rather a note of the fact that it again adds complexity (at least as far
> as driver writers are concerned) to the kernel code.

I was more thinking about cases where there are a small number of static
work items. If there are multiple dynamic work items, creating a workqueue
as a flush domain is the way to go. It does add a bit of complexity but
shouldn't be too bad - e.g. it just adds the alloc_workqueue() during init
and the exit path can simply use destroy_workqueue() instead of flush.

> > > That flush typically happens in module unload path, and I
> > > wonder if the restriction on flush_workqueue() could be relaxed to allow
> > > calling it on unload.
> >
> > A patch for drivers/input/mouse/psmouse-smbus.c is waiting for your response at
> > https://lkml.kernel.org/r/[email protected] .
> > Like many modules, flush_workqueue() happens on only module unload in your case.
>
> Yes, I saw that patch, and that is what prompted my response. I find it
> adding complexity and I was wondering if it could be avoided. It also
> unclear to me if there is an additional cost coming from allocating a
> dedicated workqueue.

A workqueue without WQ_RECLAIM is really cheap. All it does is tracking
what's in flight for that particular frontend while interfacing with the
shared worker pool.

> I understand that for some of them the change makes sense, but it would
> be nice to continue using simple API under limited circumstances.

Hmmm... unfortunately, I can't think of a way to guarantee that a module
unloading path can't get involved in a deadlock scenario through system_wq.
Given that the added complexity should be something like half a dozen lines
of code, switching to separte workqueues feels like the right direction to
me.

Thanks.

--
tejun

2022-05-16 12:05:08

by Tetsuo Handa

[permalink] [raw]
Subject: Re: [PATCH v3 (repost)] workqueue: Warn flushing of kernel-global workqueues

On 2022/05/12 22:13, Dmitry Torokhov wrote:
> Also as far as I can see the patch was rejected.

Not rejected. I sent
https://lkml.kernel.org/r/[email protected]
for 5.19.

>> We currently don't have a flag to tell whether the caller is inside module unload
>> path. And even inside module unload path, flushing the system-wide workqueue is
>> problematic under e.g. GFP_NOFS/GFP_NOIO context.
>
> Sorry, I do not follow here. Are there module unloading code that runs
> as GFP_NOFS/GFP_NOIO?

It is GFP_KERNEL context when module's __exit function is called. But whether
flush_workqueue() is called from restricted context depends on what locks does
the module's __exit function hold.

If request_module() is called from some work function using one of system-wide WQs,
and flush_workqueue() is called on that WQ from module's __exit function, the kernel
might deadlock on module_mutex lock. Making sure that flush_workqueue() is not called
on system-wide WQs is the safer choice.