LinuxLists.cc - [RFC PATCH] workqueue: Unbind workers before sending them to exit()

[permalink] [raw]

Subject: Re: [RFC PATCH] workqueue: Unbind workers before sending them to exit()

2022-07-21 03:46:39

[permalink] [raw]

Subject: Re: [RFC PATCH] workqueue: Unbind workers before sending them to exit()

> +static void unbind_worker(struct worker *worker)
> +{
> + kthread_set_per_cpu(worker->task, -1);
> + WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task, wq_unbound_cpumask) < 0);
> +}
> +
> +static void rebind_worker(struct worker *worker, struct worker_pool *pool)
> +{
> + kthread_set_per_cpu(worker->task, pool->cpu);
> + WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask) < 0);
> +}
> +
> /**
> * destroy_worker - destroy a workqueue worker
> * @worker: worker to be destroyed
> @@ -1999,6 +2011,16 @@ static void destroy_worker(struct worker *worker)
>
> list_del_init(&worker->entry);
> worker->flags |= WORKER_DIE;
> +
> + /*
> + * We're sending that thread off to die, so any CPU would do. This is
> + * especially relevant for pcpu kworkers affined to an isolated CPU:
> + * we'd rather not interrupt an isolated CPU just for a kworker to
> + * do_exit().
> + */
> + if (!(worker->flags & WORKER_UNBOUND))
> + unbind_worker(worker);
> +
> wake_up_process(worker->task);
> }

destroy_worker() is called with raw_spin_lock_irq(pool->lock), so
it cannot call the sleepable set_cpus_allowed_ptr().

From __set_cpus_allowed_ptr:
> * NOTE: the caller must have a valid reference to the task, the
> * task must not exit() & deallocate itself prematurely. The
> * call is not atomic; no spinlocks may be held.

I think it needs something like task_set_cpumask_possible() which is
documented as being usable in (raw) spinlocks and set the task's cpumask
to cpu_possible_mask and let the later ttwu help migrate it to a
proper non-isolated CPU or let it keep running.

On Thu, Jul 21, 2022 at 2:03 AM Tejun Heo <[email protected]> wrote:
>
> On Tue, Jul 19, 2022 at 05:57:43PM +0100, Valentin Schneider wrote:
> > It has been reported that isolated CPUs can suffer from interference due to
> > per-CPU kworkers waking up just to die.
> >
> > A surge of workqueue activity (sleeping workfn's exacerbate this) during
> > initial setup can cause extra per-CPU kworkers to be spawned. Then, a
> > latency-sensitive task can be running merrily on an isolated CPU only to be
> > interrupted sometime later by a kworker marked for death (cf.
> > IDLE_WORKER_TIMEOUT, 5 minutes after last kworker activity).
> >
> > Affine kworkers to the wq_unbound_cpumask (which doesn't contain isolated
> > CPUs, cf. HK_TYPE_WQ) before waking them up after marking them with
> > WORKER_DIE.
> >
> > This follows the logic of CPU hot-unplug, which has been packaged into
> > helpers for the occasion.
>
> Idea-wise, seems fine to me, but we have some other issues around twiddling
> cpu affinities right now, so let's wait a bit till Lai chimes in.
>

I think there are some imperfections related to cpu affinities
which need to be fixed too.

Thanks
Lai

2022-07-21 14:13:50

[permalink] [raw]

Subject: Re: [RFC PATCH] workqueue: Unbind workers before sending them to exit()

On 21/07/22 11:35, Lai Jiangshan wrote:
>> @@ -1999,6 +2011,16 @@ static void destroy_worker(struct worker *worker)
>>
>> list_del_init(&worker->entry);
>> worker->flags |= WORKER_DIE;
>> +
>> + /*
>> + * We're sending that thread off to die, so any CPU would do. This is
>> + * especially relevant for pcpu kworkers affined to an isolated CPU:
>> + * we'd rather not interrupt an isolated CPU just for a kworker to
>> + * do_exit().
>> + */
>> + if (!(worker->flags & WORKER_UNBOUND))
>> + unbind_worker(worker);
>> +
>> wake_up_process(worker->task);
>> }
>
> destroy_worker() is called with raw_spin_lock_irq(pool->lock), so
> it cannot call the sleepable set_cpus_allowed_ptr().
>
> From __set_cpus_allowed_ptr:
>> * NOTE: the caller must have a valid reference to the task, the
>> * task must not exit() & deallocate itself prematurely. The
>> * call is not atomic; no spinlocks may be held.
>

Heh, I somehow forgot that this is blocking. Now in this particular case I
think pcpu kworkers are "safe" - they shouldn't be running when
destroy_worker() is invoked on them (though AFAICT that is not a "hard"
guarantee), and it doesn't make any sense for them to use
migrate_disable(). Still, yeah, not ideal.

> I think it needs something like task_set_cpumask_possible() which is
> documented as being usable in (raw) spinlocks and set the task's cpumask
> to cpu_possible_mask and let the later ttwu help migrate it to a
> proper non-isolated CPU or let it keep running.
>

I'll see what I can come up with, thanks for the suggestion.

2022-07-23 06:00:13

[permalink] [raw]

Subject: Re: [RFC PATCH] workqueue: Unbind workers before sending them to exit()

On Thu, Jul 21, 2022 at 02:53:43PM +0100, Valentin Schneider wrote:
> > I think it needs something like task_set_cpumask_possible() which is
> > documented as being usable in (raw) spinlocks and set the task's cpumask
> > to cpu_possible_mask and let the later ttwu help migrate it to a
> > proper non-isolated CPU or let it keep running.
>
> I'll see what I can come up with, thanks for the suggestion.

Alternatively, we can just kill all the idle kworkers on isolated cpus at
the end of the booting process.

Thanks.

--
tejun

2022-07-25 10:45:12

[permalink] [raw]

Subject: Re: [RFC PATCH] workqueue: Unbind workers before sending them to exit()

On 22/07/22 19:16, Tejun Heo wrote:
> On Thu, Jul 21, 2022 at 02:53:43PM +0100, Valentin Schneider wrote:
>> > I think it needs something like task_set_cpumask_possible() which is
>> > documented as being usable in (raw) spinlocks and set the task's cpumask
>> > to cpu_possible_mask and let the later ttwu help migrate it to a
>> > proper non-isolated CPU or let it keep running.
>>
>> I'll see what I can come up with, thanks for the suggestion.
>
> Alternatively, we can just kill all the idle kworkers on isolated cpus at
> the end of the booting process.
>

Hm so my choice of words in the changelog wasn't great - "initial setup"
can be kernel init, but *also* setup of whatever workload is being deployed
onto the system.

So you can be having "normal" background activity (I've seen some IRQs end
up with schedule_work() on isolated CPUs, they're not moved away at boot
time but rather shortly before launching the latency-sensitive app), some
preliminary stats collection / setup to make sure the CPU will be quiet
(e.g. refresh_vm_stats()), and *then* the application starts with
fresh-but-no-longer-required extra pcpu kworkers assigned to its CPU.

2022-07-26 18:08:11

[permalink] [raw]

Subject: Re: [RFC PATCH] workqueue: Unbind workers before sending them to exit()

Hello,

On Mon, Jul 25, 2022 at 11:21:37AM +0100, Valentin Schneider wrote:
> On 22/07/22 19:16, Tejun Heo wrote:
> > On Thu, Jul 21, 2022 at 02:53:43PM +0100, Valentin Schneider wrote:
> >> > I think it needs something like task_set_cpumask_possible() which is
> >> > documented as being usable in (raw) spinlocks and set the task's cpumask
> >> > to cpu_possible_mask and let the later ttwu help migrate it to a
> >> > proper non-isolated CPU or let it keep running.
> >>
> >> I'll see what I can come up with, thanks for the suggestion.
> >
> > Alternatively, we can just kill all the idle kworkers on isolated cpus at
> > the end of the booting process.
>
> Hm so my choice of words in the changelog wasn't great - "initial setup"
> can be kernel init, but *also* setup of whatever workload is being deployed
> onto the system.
>
> So you can be having "normal" background activity (I've seen some IRQs end
> up with schedule_work() on isolated CPUs, they're not moved away at boot
> time but rather shortly before launching the latency-sensitive app), some
> preliminary stats collection / setup to make sure the CPU will be quiet
> (e.g. refresh_vm_stats()), and *then* the application starts with
> fresh-but-no-longer-required extra pcpu kworkers assigned to its CPU.

Ah, I see. I guess we'll need to figure out how to unbind the workers then.

Thanks.

--
tejun

2022-07-26 20:42:57

[permalink] [raw]

Subject: Re: [RFC PATCH] workqueue: Unbind workers before sending them to exit()

On 26/07/22 07:30, Tejun Heo wrote:
> Hello,
>
> On Mon, Jul 25, 2022 at 11:21:37AM +0100, Valentin Schneider wrote:
>> Hm so my choice of words in the changelog wasn't great - "initial setup"
>> can be kernel init, but *also* setup of whatever workload is being deployed
>> onto the system.
>>
>> So you can be having "normal" background activity (I've seen some IRQs end
>> up with schedule_work() on isolated CPUs, they're not moved away at boot
>> time but rather shortly before launching the latency-sensitive app), some
>> preliminary stats collection / setup to make sure the CPU will be quiet
>> (e.g. refresh_vm_stats()), and *then* the application starts with
>> fresh-but-no-longer-required extra pcpu kworkers assigned to its CPU.
>
> Ah, I see. I guess we'll need to figure out how to unbind the workers then.
>

I've been playing with different ways to unbind & wake the workers in a
sleepable context, but so far I haven't been happy with any of my
experiments.

What hasn't changed much between my attempts is transferring to-be-destroyed
kworkers from their pool->idle_list to a reaper_list which is walked by
*something* that does unbind+wakeup. AFAIA as long as the kworker is off
the pool->idle_list we can play with it (i.e. unbind+wake) off the
pool->lock.

It's the *something* that's annoying to get right, I don't want it to be
overly complicated given most users are probably not impacted by what I'm
trying to fix, but I'm getting the feeling it should still be a per-pool
kthread. I toyed with a single reaper kthread but a central synchronization
for all the pools feels like a stupid overhead.

If any of that sounds ludicrous please shout, otherwise I'm going to keep
tinkering :)

> Thanks.
>
> --
> tejun

2022-07-26 23:08:36

[permalink] [raw]

Subject: Re: [RFC PATCH] workqueue: Unbind workers before sending them to exit()

Hello,

On Tue, Jul 26, 2022 at 09:36:06PM +0100, Valentin Schneider wrote:
> It's the *something* that's annoying to get right, I don't want it to be
> overly complicated given most users are probably not impacted by what I'm
> trying to fix, but I'm getting the feeling it should still be a per-pool
> kthread. I toyed with a single reaper kthread but a central synchronization
> for all the pools feels like a stupid overhead.

That sounds like quite a bit of complexity.

> If any of that sounds ludicrous please shout, otherwise I'm going to keep
> tinkering :)

I mean, whatever works works but let's please keep it as minimal as
possible. Why does it need dedicated kthreads in the first place? Wouldn't
scheduling an unbound work item work just as well?

Thanks.

--
tejun

2022-07-27 05:58:53

[permalink] [raw]

Subject: Re: [RFC PATCH] workqueue: Unbind workers before sending them to exit()

On Wed, Jul 27, 2022 at 4:36 AM Valentin Schneider <[email protected]> wrote:
>
> On 26/07/22 07:30, Tejun Heo wrote:
> > Hello,
> >
> > On Mon, Jul 25, 2022 at 11:21:37AM +0100, Valentin Schneider wrote:
> >> Hm so my choice of words in the changelog wasn't great - "initial setup"
> >> can be kernel init, but *also* setup of whatever workload is being deployed
> >> onto the system.
> >>
> >> So you can be having "normal" background activity (I've seen some IRQs end
> >> up with schedule_work() on isolated CPUs, they're not moved away at boot
> >> time but rather shortly before launching the latency-sensitive app), some
> >> preliminary stats collection / setup to make sure the CPU will be quiet
> >> (e.g. refresh_vm_stats()), and *then* the application starts with
> >> fresh-but-no-longer-required extra pcpu kworkers assigned to its CPU.
> >
> > Ah, I see. I guess we'll need to figure out how to unbind the workers then.
> >
>
> I've been playing with different ways to unbind & wake the workers in a
> sleepable context, but so far I haven't been happy with any of my
> experiments.

I'm writing code to handle the problems of cpu affinity and prematurely
waking up of newly created worker.

This work of unbinding the dying worker is also on the list.
I haven't figured out a good solution.

I was planning to add set_cpus_allowed_ptr_off_rq() which only set
cpumasks to the task only if it is sleeping and returns -EBUSY otherwise.
And it is ensured and documented as being usable in an atomic context
and it is recommended to be used for dying tasks only.

I can't really ensure it would be implemented as I'm expecting since
it touches scheduler code.

I'd better back off.

>
> What hasn't changed much between my attempts is transferring to-be-destroyed
> kworkers from their pool->idle_list to a reaper_list which is walked by
> *something* that does unbind+wakeup. AFAIA as long as the kworker is off
> the pool->idle_list we can play with it (i.e. unbind+wake) off the
> pool->lock.
>
> It's the *something* that's annoying to get right, I don't want it to be
> overly complicated given most users are probably not impacted by what I'm
> trying to fix, but I'm getting the feeling it should still be a per-pool
> kthread. I toyed with a single reaper kthread but a central synchronization
> for all the pools feels like a stupid overhead.

I think fixing it in the workqueue.c is complicated.

Nevertheless, I will also try to fix it inside workqueue only to see
what will come up.

>
> If any of that sounds ludicrous please shout, otherwise I'm going to keep
> tinkering :)
>
> > Thanks.
> >
> > --
> > tejun
>

2022-07-27 06:45:13

[permalink] [raw]

Subject: Re: [RFC PATCH] workqueue: Unbind workers before sending them to exit()

On Wed, Jul 27, 2022 at 1:38 PM Lai Jiangshan <[email protected]> wrote:
>
> On Wed, Jul 27, 2022 at 4:36 AM Valentin Schneider <[email protected]> wrote:
> >
> > On 26/07/22 07:30, Tejun Heo wrote:
> > > Hello,
> > >
> > > On Mon, Jul 25, 2022 at 11:21:37AM +0100, Valentin Schneider wrote:
> > >> Hm so my choice of words in the changelog wasn't great - "initial setup"
> > >> can be kernel init, but *also* setup of whatever workload is being deployed
> > >> onto the system.
> > >>
> > >> So you can be having "normal" background activity (I've seen some IRQs end
> > >> up with schedule_work() on isolated CPUs, they're not moved away at boot
> > >> time but rather shortly before launching the latency-sensitive app), some
> > >> preliminary stats collection / setup to make sure the CPU will be quiet
> > >> (e.g. refresh_vm_stats()), and *then* the application starts with
> > >> fresh-but-no-longer-required extra pcpu kworkers assigned to its CPU.
> > >
> > > Ah, I see. I guess we'll need to figure out how to unbind the workers then.
> > >
> >
> > I've been playing with different ways to unbind & wake the workers in a
> > sleepable context, but so far I haven't been happy with any of my
> > experiments.
>
>
> I'm writing code to handle the problems of cpu affinity and prematurely
> waking up of newly created worker.
>
> This work of unbinding the dying worker is also on the list.
> I haven't figured out a good solution.
>
> I was planning to add set_cpus_allowed_ptr_off_rq() which only set
> cpumasks to the task only if it is sleeping and returns -EBUSY otherwise.
> And it is ensured and documented as being usable in an atomic context
> and it is recommended to be used for dying tasks only.
>
> I can't really ensure it would be implemented as I'm expecting since
> it touches scheduler code.
>
> I'd better back off.
>
> >
> > What hasn't changed much between my attempts is transferring to-be-destroyed
> > kworkers from their pool->idle_list to a reaper_list which is walked by
> > *something* that does unbind+wakeup. AFAIA as long as the kworker is off
> > the pool->idle_list we can play with it (i.e. unbind+wake) off the
> > pool->lock.
> >
> > It's the *something* that's annoying to get right, I don't want it to be
> > overly complicated given most users are probably not impacted by what I'm
> > trying to fix, but I'm getting the feeling it should still be a per-pool
> > kthread. I toyed with a single reaper kthread but a central synchronization
> > for all the pools feels like a stupid overhead.
>
> I think fixing it in the workqueue.c is complicated.
>
> Nevertheless, I will also try to fix it inside workqueue only to see
> what will come up.

I'm going to kind of revert 3347fc9f36e7 ("workqueue: destroy worker
directly in the idle timeout handler"), so that we can have a sleepable
destroy_worker().

>
> >
> > If any of that sounds ludicrous please shout, otherwise I'm going to keep
> > tinkering :)
> >
> > > Thanks.
> > >
> > > --
> > > tejun
> >

2022-07-27 09:27:51

[permalink] [raw]

Subject: Re: [RFC PATCH] workqueue: Unbind workers before sending them to exit()

On Wed, Jul 27, 2022 at 2:30 PM Lai Jiangshan <[email protected]> wrote:

> >
> > >
> > > What hasn't changed much between my attempts is transferring to-be-destroyed
> > > kworkers from their pool->idle_list to a reaper_list which is walked by
> > > *something* that does unbind+wakeup. AFAIA as long as the kworker is off
> > > the pool->idle_list we can play with it (i.e. unbind+wake) off the
> > > pool->lock.
> > >
> > > It's the *something* that's annoying to get right, I don't want it to be
> > > overly complicated given most users are probably not impacted by what I'm
> > > trying to fix, but I'm getting the feeling it should still be a per-pool
> > > kthread. I toyed with a single reaper kthread but a central synchronization
> > > for all the pools feels like a stupid overhead.
> >
> > I think fixing it in the workqueue.c is complicated.
> >
> > Nevertheless, I will also try to fix it inside workqueue only to see
> > what will come up.
>
> I'm going to kind of revert 3347fc9f36e7 ("workqueue: destroy worker
> directly in the idle timeout handler"), so that we can have a sleepable
> destroy_worker().
>

It is not a good idea. The woken up manager might still be in
the isolated CPU.

On Wed, Jul 27, 2022 at 6:59 AM Tejun Heo <[email protected]> wrote:
>
> I mean, whatever works works but let's please keep it as minimal as
> possible. Why does it need dedicated kthreads in the first place? Wouldn't
> scheduling an unbound work item work just as well?
>

Scheduling an unbound work item will work well.

2022-07-27 09:59:37