Subject: [PATCH v2 1/2] workqueue: Use rcuwait for wq_manager_wait

The workqueue code has it's internal spinlock (pool::lock) and also
implicit spinlock usage in the wq_manager waitqueue. These spinlocks
are converted to 'sleeping' spinlocks on a RT-kernel.

Workqueue functions can be invoked from contexts which are truly atomic
even on a PREEMPT_RT enabled kernel. Taking sleeping locks from such
contexts is forbidden.

pool::lock can be converted to a raw spinlock as the lock held times
are short. But the workqueue manager waitqueue is handled inside of
pool::lock held regions which again violates the lock nesting rules
of raw and regular spinlocks.

The manager waitqueue has no special requirements like custom wakeup
callbacks or mass wakeups. While it does not use exclusive wait mode
explicitly there is no strict requirement to queue the waiters in a
particular order as there is only one waiter at a time.

This allows to replace the waitqueue with rcuwait which solves the
locking problem because rcuwait relies on existing locking.

Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
kernel/workqueue.c | 24 +++++++++++++++++++-----
1 file changed, 19 insertions(+), 5 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 891ccad5f2716..e1d9af713df4a 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -301,7 +301,8 @@ static struct workqueue_attrs *wq_update_unbound_numa_attrs_buf;
static DEFINE_MUTEX(wq_pool_mutex); /* protects pools and workqueues list */
static DEFINE_MUTEX(wq_pool_attach_mutex); /* protects worker attach/detach */
static DEFINE_SPINLOCK(wq_mayday_lock); /* protects wq->maydays list */
-static DECLARE_WAIT_QUEUE_HEAD(wq_manager_wait); /* wait for manager to go away */
+/* wait for manager to go away */
+static struct rcuwait manager_wait = __RCUWAIT_INITIALIZER(manager_wait);

static LIST_HEAD(workqueues); /* PR: list of all workqueues */
static bool workqueue_freezing; /* PL: have wqs started freezing? */
@@ -2140,7 +2141,7 @@ static bool manage_workers(struct worker *worker)

pool->manager = NULL;
pool->flags &= ~POOL_MANAGER_ACTIVE;
- wake_up(&wq_manager_wait);
+ rcuwait_wake_up(&manager_wait);
return true;
}

@@ -3504,6 +3505,18 @@ static void rcu_free_pool(struct rcu_head *rcu)
kfree(pool);
}

+/* This returns with the lock held on success (pool manager is inactive). */
+static bool wq_manager_inactive(struct worker_pool *pool)
+{
+ spin_lock_irq(&pool->lock);
+
+ if (pool->flags & POOL_MANAGER_ACTIVE) {
+ spin_unlock_irq(&pool->lock);
+ return false;
+ }
+ return true;
+}
+
/**
* put_unbound_pool - put a worker_pool
* @pool: worker_pool to put
@@ -3539,10 +3552,11 @@ static void put_unbound_pool(struct worker_pool *pool)
* Become the manager and destroy all workers. This prevents
* @pool's workers from blocking on attach_mutex. We're the last
* manager and @pool gets freed with the flag set.
+ * Because of how wq_manager_inactive() works, we will hold the
+ * spinlock after a successful wait.
*/
- spin_lock_irq(&pool->lock);
- wait_event_lock_irq(wq_manager_wait,
- !(pool->flags & POOL_MANAGER_ACTIVE), pool->lock);
+ rcuwait_wait_event(&manager_wait, wq_manager_inactive(pool),
+ TASK_UNINTERRUPTIBLE);
pool->flags |= POOL_MANAGER_ACTIVE;

while ((worker = first_idle_worker(pool)))
--
2.27.0.rc0


2020-05-28 03:11:45

by Lai Jiangshan

[permalink] [raw]
Subject: [PATCH 1/2] workqueue: pin the pool while it is managing

So that put_unbound_pool() can ensure all workers in idle,
no unfinished manager. And it doens't need to wait any manager
and can go to delete all the idle workers straight away.

Also removes manager waitqueue, because it is unneeded and as
Sebastian Andrzej Siewior said:

The workqueue code has it's internal spinlock (pool::lock) and also
implicit spinlock usage in the wq_manager waitqueue. These spinlocks
are converted to 'sleeping' spinlocks on a RT-kernel.

Workqueue functions can be invoked from contexts which are truly atomic
even on a PREEMPT_RT enabled kernel. Taking sleeping locks from such
contexts is forbidden.

pool::lock can be converted to a raw spinlock as the lock held times
are short. But the workqueue manager waitqueue is handled inside of
pool::lock held regions which again violates the lock nesting rules
of raw and regular spinlocks.

Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Sebastian Andrzej Siewior <[email protected]>
Link: https://lore.kernel.org/lkml/[email protected]
Signed-off-by: Lai Jiangshan <[email protected]>
---
kernel/workqueue.c | 23 ++++++++++++-----------
1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 891ccad5f271..fde10a5dba82 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -301,7 +301,6 @@ static struct workqueue_attrs *wq_update_unbound_numa_attrs_buf;
static DEFINE_MUTEX(wq_pool_mutex); /* protects pools and workqueues list */
static DEFINE_MUTEX(wq_pool_attach_mutex); /* protects worker attach/detach */
static DEFINE_SPINLOCK(wq_mayday_lock); /* protects wq->maydays list */
-static DECLARE_WAIT_QUEUE_HEAD(wq_manager_wait); /* wait for manager to go away */

static LIST_HEAD(workqueues); /* PR: list of all workqueues */
static bool workqueue_freezing; /* PL: have wqs started freezing? */
@@ -2129,10 +2128,21 @@ __acquires(&pool->lock)
static bool manage_workers(struct worker *worker)
{
struct worker_pool *pool = worker->pool;
+ struct work_struct *work = list_first_entry(&pool->worklist,
+ struct work_struct, entry);
+ struct pool_workqueue *pwq = get_work_pwq(work);

if (pool->flags & POOL_MANAGER_ACTIVE)
return false;

+ /*
+ * If @pwq is for an unbound wq, its base ref and the @pool's
+ * ref may be put at any time due to an attribute change.
+ * Pin @pwq to also pin the @pool until the manager is done
+ * with it.
+ */
+ get_pwq(pwq);
+
pool->flags |= POOL_MANAGER_ACTIVE;
pool->manager = worker;

@@ -2140,7 +2150,7 @@ static bool manage_workers(struct worker *worker)

pool->manager = NULL;
pool->flags &= ~POOL_MANAGER_ACTIVE;
- wake_up(&wq_manager_wait);
+ put_pwq(pwq);
return true;
}

@@ -3535,16 +3545,7 @@ static void put_unbound_pool(struct worker_pool *pool)
idr_remove(&worker_pool_idr, pool->id);
hash_del(&pool->hash_node);

- /*
- * Become the manager and destroy all workers. This prevents
- * @pool's workers from blocking on attach_mutex. We're the last
- * manager and @pool gets freed with the flag set.
- */
spin_lock_irq(&pool->lock);
- wait_event_lock_irq(wq_manager_wait,
- !(pool->flags & POOL_MANAGER_ACTIVE), pool->lock);
- pool->flags |= POOL_MANAGER_ACTIVE;
-
while ((worker = first_idle_worker(pool)))
destroy_worker(worker);
WARN_ON(pool->nr_workers || pool->nr_idle);
--
2.20.1

Subject: Re: [PATCH 1/2] workqueue: pin the pool while it is managing

On 2020-05-28 03:06:55 [+0000], Lai Jiangshan wrote:
> So that put_unbound_pool() can ensure all workers in idle,
> no unfinished manager. And it doens't need to wait any manager
> and can go to delete all the idle workers straight away.
>
> Also removes manager waitqueue, because it is unneeded and as
> Sebastian Andrzej Siewior said:
>
> The workqueue code has it's internal spinlock (pool::lock) and also
> implicit spinlock usage in the wq_manager waitqueue. These spinlocks
> are converted to 'sleeping' spinlocks on a RT-kernel.
>
> Workqueue functions can be invoked from contexts which are truly atomic
> even on a PREEMPT_RT enabled kernel. Taking sleeping locks from such
> contexts is forbidden.
>
> pool::lock can be converted to a raw spinlock as the lock held times
> are short. But the workqueue manager waitqueue is handled inside of
> pool::lock held regions which again violates the lock nesting rules
> of raw and regular spinlocks.

This seems to work for my test case I had test my chance. And lockdep
didn't complain so…

If you prefer this over my 1/2 what do we do about 2/2? Do you want me
to repost it?

Sebastian

2020-05-28 08:40:20

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH 1/2] workqueue: pin the pool while it is managing

On Thu, May 28, 2020 at 4:08 PM Sebastian Andrzej Siewior
<[email protected]> wrote:
>
> On 2020-05-28 03:06:55 [+0000], Lai Jiangshan wrote:
> > So that put_unbound_pool() can ensure all workers in idle,
> > no unfinished manager. And it doens't need to wait any manager
> > and can go to delete all the idle workers straight away.
> >
> > Also removes manager waitqueue, because it is unneeded and as
> > Sebastian Andrzej Siewior said:
> >
> > The workqueue code has it's internal spinlock (pool::lock) and also
> > implicit spinlock usage in the wq_manager waitqueue. These spinlocks
> > are converted to 'sleeping' spinlocks on a RT-kernel.
> >
> > Workqueue functions can be invoked from contexts which are truly atomic
> > even on a PREEMPT_RT enabled kernel. Taking sleeping locks from such
> > contexts is forbidden.
> >
> > pool::lock can be converted to a raw spinlock as the lock held times
> > are short. But the workqueue manager waitqueue is handled inside of
> > pool::lock held regions which again violates the lock nesting rules
> > of raw and regular spinlocks.
>
> This seems to work for my test case I had test my chance. And lockdep
> didn't complain so…
>
> If you prefer this over my 1/2 what do we do about 2/2? Do you want me
> to repost it?

I think we can just wait until Tejun reviews them.

If there is something wrong that I missed in my patch, your patches
are the best choice.

If I need to update my patch, I will repost the 3 patches
(2 of mine, the 2/2 of yours). At least I forgot to add
"Reported-by Sebastian Andrzej Siewior <[email protected]>"
in the patch.

If Tejun queues my patches right away, you can rebase the 2/2
of yours and repost it.

Lai

>
> Sebastian

2020-05-28 14:38:17

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 1/2] workqueue: pin the pool while it is managing

Hello,

On Thu, May 28, 2020 at 03:06:55AM +0000, Lai Jiangshan wrote:
> @@ -2129,10 +2128,21 @@ __acquires(&pool->lock)
> static bool manage_workers(struct worker *worker)
> {
> struct worker_pool *pool = worker->pool;
> + struct work_struct *work = list_first_entry(&pool->worklist,
> + struct work_struct, entry);

I'm not sure about this. It's depending on an external condition (active
work item) which isn't obvious and when that condition breaks the resulting
bug will be one which is difficult to reproduce. Adding to that, pwq isn't
even the object this code path is interested in, which is the cause of the
previous problem too.

> @@ -2140,7 +2150,7 @@ static bool manage_workers(struct worker *worker)
>
> pool->manager = NULL;
> pool->flags &= ~POOL_MANAGER_ACTIVE;
> - wake_up(&wq_manager_wait);
> + put_pwq(pwq);

So, this works only because pwq release bounces through another work item,
so even if a worker of the pool which is currently being destroyed initiates
the release of the containing pool, it still works out, because by the time
the async release path kicks in and grabs the pool lock, everything should
be idle.

I get that this can work but it's sitting on top of a bunch of subtleties.
The current code is more verbose but also significantly more explicit and
straight-forward. I'd rather keep the current behavior unless we can get rid
of the subtleties.

Thanks.

--
tejun

2020-05-29 05:36:01

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH 1/2] workqueue: pin the pool while it is managing

On Thu, May 28, 2020 at 10:35 PM Tejun Heo <[email protected]> wrote:
>
> Hello,
>
> On Thu, May 28, 2020 at 03:06:55AM +0000, Lai Jiangshan wrote:
> > @@ -2129,10 +2128,21 @@ __acquires(&pool->lock)
> > static bool manage_workers(struct worker *worker)
> > {
> > struct worker_pool *pool = worker->pool;
> > + struct work_struct *work = list_first_entry(&pool->worklist,
> > + struct work_struct, entry);
>
> I'm not sure about this. It's depending on an external condition (active
> work item) which isn't obvious and when that condition breaks the resulting
> bug will be one which is difficult to reproduce. Adding to that, pwq isn't
> even the object this code path is interested in, which is the cause of the
> previous problem too.

Ok, I agree with you.

Thanks
Lai