2023-07-29 14:05:37

by Aaron Tomlin

[permalink] [raw]
Subject: [RFC PATCH 0/2] workqueue: Introduce PF_WQ_RESCUE_WORKER

The Linux kernel does not provide a way to differentiate between a
kworker and a rescue kworker for user-mode.
From user-mode, one can establish if a task is a kworker by testing for
PF_WQ_WORKER in a specified task's flags bit mask (or bitmap) via
/proc/[PID]/stat. Indeed, one can examine /proc/[PID]/stack and search
for the function namely "rescuer_thread". This is only available to the
root user.

It can be useful to identify a rescue kworker since their CPU affinity
cannot be modified and their initial CPU assignment can be safely ignored.
Furthermore, a workqueue that was created with WQ_MEM_RECLAIM and
WQ_SYSFS the cpumask file is not applicable to the rescue kworker.
By design a rescue kworker should run anywhere.

This patch series introduces PF_WQ_RESCUE_WORKER and ensures it is set and
cleared appropriately and simplifies current_is_workqueue_rescuer().

Aaron Tomlin (2):
workqueue: Introduce PF_WQ_RESCUE_WORKER
workqueue: Simplify current_is_workqueue_rescuer()

include/linux/sched.h | 2 +-
kernel/workqueue.c | 25 +++++++++++++++----------
2 files changed, 16 insertions(+), 11 deletions(-)

--
2.39.1



2023-07-29 14:12:29

by Aaron Tomlin

[permalink] [raw]
Subject: [RFC PATCH 2/2] workqueue: Simplify current_is_workqueue_rescuer()

No functional change.

This patch simplifies current_is_workqueue_rescuer()
due to the addition of PF_WQ_RESCUE_WORKER.

Signed-off-by: Aaron Tomlin <[email protected]>
---
kernel/workqueue.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 6d38d714b72b..3b7a4d60cb6a 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -4890,9 +4890,9 @@ EXPORT_SYMBOL(current_work);
*/
bool current_is_workqueue_rescuer(void)
{
- struct worker *worker = current_wq_worker();
-
- return worker && worker->rescue_wq;
+ if (in_task() && (current->flags & PF_WQ_RESCUE_WORKER))
+ return true;
+ return false;
}

/**
--
2.39.1


2023-07-29 14:59:34

by Aaron Tomlin

[permalink] [raw]
Subject: [RFC PATCH 1/2] workqueue: Introduce PF_WQ_RESCUE_WORKER

The Linux kernel does not provide a way to differentiate between a
kworker and a rescue kworker for user-mode.
From user-mode, one can establish if a task is a kworker by testing for
PF_WQ_WORKER in a specified task's flags bit mask (or bitmap) via
/proc/[PID]/stat. Indeed, one can examine /proc/[PID]/stack and search
for the function namely "rescuer_thread". This is only available to the
root user.

It can be useful to identify a rescue kworker since their CPU affinity
cannot be modified and their initial CPU assignment can be safely ignored.
Furthermore, a workqueue that was created with WQ_MEM_RECLAIM and
WQ_SYSFS the cpumask file is not applicable to the rescue kworker.
By design a rescue kworker should run anywhere.

This patch introduces PF_WQ_RESCUE_WORKER and ensures it is set and
cleared appropriately.

Signed-off-by: Aaron Tomlin <[email protected]>
---
include/linux/sched.h | 2 +-
kernel/workqueue.c | 19 ++++++++++++-------
2 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 609bde814cb0..039fcf8d9ed6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1734,7 +1734,7 @@ extern struct pid *cad_pid;
#define PF_USED_MATH 0x00002000 /* If unset the fpu must be initialized before use */
#define PF_USER_WORKER 0x00004000 /* Kernel thread cloned from userspace thread */
#define PF_NOFREEZE 0x00008000 /* This thread should not be frozen */
-#define PF__HOLE__00010000 0x00010000
+#define PF_WQ_RESCUE_WORKER 0x00010000 /* I am a rescue workqueue worker */
#define PF_KSWAPD 0x00020000 /* I am kswapd */
#define PF_MEMALLOC_NOFS 0x00040000 /* All allocation requests will inherit GFP_NOFS */
#define PF_MEMALLOC_NOIO 0x00080000 /* All allocation requests will inherit GFP_NOIO */
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 02a8f402eeb5..6d38d714b72b 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -2665,13 +2665,18 @@ static void process_scheduled_works(struct worker *worker)
}
}

-static void set_pf_worker(bool val)
+static void set_pf_worker_and_rescuer(bool worker, bool rescue)
{
mutex_lock(&wq_pool_attach_mutex);
- if (val)
+ if (worker) {
current->flags |= PF_WQ_WORKER;
- else
+ if (rescue)
+ current->flags |= PF_WQ_RESCUE_WORKER;
+ } else {
current->flags &= ~PF_WQ_WORKER;
+ if (rescue)
+ current->flags &= ~PF_WQ_RESCUE_WORKER;
+ }
mutex_unlock(&wq_pool_attach_mutex);
}

@@ -2693,14 +2698,14 @@ static int worker_thread(void *__worker)
struct worker_pool *pool = worker->pool;

/* tell the scheduler that this is a workqueue worker */
- set_pf_worker(true);
+ set_pf_worker_and_rescuer(true, false);
woke_up:
raw_spin_lock_irq(&pool->lock);

/* am I supposed to die? */
if (unlikely(worker->flags & WORKER_DIE)) {
raw_spin_unlock_irq(&pool->lock);
- set_pf_worker(false);
+ set_pf_worker_and_rescuer(false, false);

set_task_comm(worker->task, "kworker/dying");
ida_free(&pool->worker_ida, worker->id);
@@ -2804,7 +2809,7 @@ static int rescuer_thread(void *__rescuer)
* Mark rescuer as worker too. As WORKER_PREP is never cleared, it
* doesn't participate in concurrency management.
*/
- set_pf_worker(true);
+ set_pf_worker_and_rescuer(true, true);
repeat:
set_current_state(TASK_IDLE);

@@ -2903,7 +2908,7 @@ static int rescuer_thread(void *__rescuer)

if (should_stop) {
__set_current_state(TASK_RUNNING);
- set_pf_worker(false);
+ set_pf_worker_and_rescuer(false, true);
return 0;
}

--
2.39.1


2023-07-29 17:08:19

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 1/2] workqueue: Introduce PF_WQ_RESCUE_WORKER

On Sat, Jul 29, 2023 at 02:53:33PM +0100, Aaron Tomlin wrote:
> The Linux kernel does not provide a way to differentiate between a
> kworker and a rescue kworker for user-mode.
> From user-mode, one can establish if a task is a kworker by testing for
> PF_WQ_WORKER in a specified task's flags bit mask (or bitmap) via
> /proc/[PID]/stat. Indeed, one can examine /proc/[PID]/stack and search
> for the function namely "rescuer_thread". This is only available to the
> root user.
>
> It can be useful to identify a rescue kworker since their CPU affinity
> cannot be modified and their initial CPU assignment can be safely ignored.
> Furthermore, a workqueue that was created with WQ_MEM_RECLAIM and
> WQ_SYSFS the cpumask file is not applicable to the rescue kworker.
> By design a rescue kworker should run anywhere.
>
> This patch introduces PF_WQ_RESCUE_WORKER and ensures it is set and
> cleared appropriately.

Is the implication that PF_flags are considered ABI? We've been changing
them quite a bit over the years.

Also, while we have a few spare bits atm, we used to be nearly out for a
while, and I just don't think this is sane usage of them. We don't use
PF flags just for userspace.

2023-08-01 00:30:17

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] workqueue: Introduce PF_WQ_RESCUE_WORKER

Hello,

On Sat, Jul 29, 2023 at 02:53:32PM +0100, Aaron Tomlin wrote:
> It can be useful to identify a rescue kworker since their CPU affinity
> cannot be modified and their initial CPU assignment can be safely ignored.

You really shouldn't be setting affinities on kworkers manually. There's no
way of knowing which kworker is going to execute which workqueue. Please use
the attributes API and sysfs interface to modify per-workqueue worker
attributes. If that's not sufficient and you need finer grained control, the
right thing to do is using kthread_worker which gives you a dedicated
kthread that you can manipulate as appropriate.

Thanks.

--
tejun

2023-08-01 10:19:37

by Aaron Tomlin

[permalink] [raw]
Subject: Re: [RFC PATCH 1/2] workqueue: Introduce PF_WQ_RESCUE_WORKER

> Is the implication that PF_flags are considered ABI? We've been changing
> them quite a bit over the years.

Hi Peter, Tejun,

I never assumed they were.

In this context, one should always check the Linux kernel source code first
i.e. do not assume what is exported via /proc/[PID]/stat will be stable/or
consistent between releases.

> Also, while we have a few spare bits atm, we used to be nearly out for a
> while, and I just don't think this is sane usage of them. We don't use PF
> flags just for userspace.

Fair statement.

Albeit, I suspect it would still be useful for user-mode to easily
differentiate between a kworker and a rescuer kworker. According to
create_worker(), we do make it clear the difference between a CPU-specific
and unbound kworker by way of the task's name. Looking at init_rescuer() a
rescuer kworker is simply given the name of its workqueue. Would you
consider modifying the rescuer's task's name so it is prefixed with
"kworker/r-%s" and then include the workqueue's name e.g.
"kworker/r-ext4-rsv-conver" acceptable?



Kind regards,
--
Aaron Tomlin

2023-08-01 12:04:08

by Aaron Tomlin

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] workqueue: Introduce PF_WQ_RESCUE_WORKER

> You really shouldn't be setting affinities on kworkers manually. There's
> no way of knowing which kworker is going to execute which workqueue.
> Please use the attributes API and sysfs interface to modify per-workqueue
> worker attributes. If that's not sufficient and you need finer grained
> control, the right thing to do is using kthread_worker which gives you a
> dedicated kthread that you can manipulate as appropriate.

Hi Tejun,

I completely agree. Each kworker has PF_NO_SETAFFINITY applied anyway.
If I understand correctly, only an unbound kworker can have their CPU
affinity modified via sysfs. The objective of this series was to easily
identify a rescuer kworker from user-mode.


Kind regards,
--
Aaron Tomlin

2023-08-02 18:52:27

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] workqueue: Introduce PF_WQ_RESCUE_WORKER

Hello,

On Tue, Aug 01, 2023 at 11:53:01AM +0100, Aaron Tomlin wrote:
> > You really shouldn't be setting affinities on kworkers manually. There's
> > no way of knowing which kworker is going to execute which workqueue.
> > Please use the attributes API and sysfs interface to modify per-workqueue
> > worker attributes. If that's not sufficient and you need finer grained
> > control, the right thing to do is using kthread_worker which gives you a
> > dedicated kthread that you can manipulate as appropriate.
>
> I completely agree. Each kworker has PF_NO_SETAFFINITY applied anyway.
> If I understand correctly, only an unbound kworker can have their CPU
> affinity modified via sysfs. The objective of this series was to easily
> identify a rescuer kworker from user-mode.

But why do you need to identify rescue workers? What are you trying to
achieve?

Thanks.

--
tejun

2023-08-03 21:01:42

by Aaron Tomlin

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] workqueue: Introduce PF_WQ_RESCUE_WORKER

> But why do you need to identify rescue workers? What are you trying to
> achieve?

Hi Tejun,

I had a conversation with a colleague of mine. It can be useful to identify
and account for all kernel threads. From the perspective of user-mode, the
name given currently to the rescuer kworker is ambiguous. For instance,
"kworker/u16:9-kcryptd/253:0" is clearly identifiable as an unbound kworker
for the specified workqueue which can have their CPU affinity adjusted as
you mentioned before. I think if we followed the same naming convention
for a rescuer kworker then it would be more consistent. I'll send a patch
so it can be discussed further.


Kind regards,
--
Aaron Tomlin

2023-08-03 21:45:14

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] workqueue: Introduce PF_WQ_RESCUE_WORKER

On Thu, Aug 03, 2023 at 09:19:14PM +0100, Aaron Tomlin wrote:
> > But why do you need to identify rescue workers? What are you trying to
> > achieve?
>
> Hi Tejun,
>
> I had a conversation with a colleague of mine. It can be useful to identify
> and account for all kernel threads. From the perspective of user-mode, the
> name given currently to the rescuer kworker is ambiguous. For instance,
> "kworker/u16:9-kcryptd/253:0" is clearly identifiable as an unbound kworker
> for the specified workqueue which can have their CPU affinity adjusted as

Note that the name changes to the work item the worker is currently
executing. It won't stay that way. Workers are shared across the workqueues,
so I'm not sure "identify and account all kernel threads" is working as well
as you think it is.

> you mentioned before. I think if we followed the same naming convention
> for a rescuer kworker then it would be more consistent. I'll send a patch
> so it can be discussed further.

We can certainly rename them to indicate that they are rescuers - e.g. maybe
krescuer? But, at the moment, the proposed reason seems rather dubious.

Thanks.

--
tejun

2023-08-06 01:55:44

by Aaron Tomlin

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] workqueue: Introduce PF_WQ_RESCUE_WORKER

> Note that the name changes to the work item the worker is currently
> executing. It won't stay that way. Workers are shared across the
> workqueues, so I'm not sure "identify and account all kernel threads" is
> working as well as you think it is.

Hi Tejun,

Indeed. The point is that these kworker kthreads are easily identifiable.

> We can certainly rename them to indicate that they are rescuers - e.g.
> maybe krescuer? But, at the moment, the proposed reason seems rather
> dubious.

Personally, I would prefer "kworker/r-%s" and then include the specified
workqueue's name e.g. "kworker/r-ext4-rsv-conver". So the rescuer task's
name is more consistent with the current naming scheme.
I will send a follow up patch.


Kind regards,

--
Aaron Tomlin

2023-12-11 14:52:32

by Juri Lelli

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] workqueue: Introduce PF_WQ_RESCUE_WORKER

Hi,

Just stumbled upon this series while looking into rescuers myself. :)

On 29/07/23 14:53, Aaron Tomlin wrote:
> The Linux kernel does not provide a way to differentiate between a
> kworker and a rescue kworker for user-mode.
> From user-mode, one can establish if a task is a kworker by testing for
> PF_WQ_WORKER in a specified task's flags bit mask (or bitmap) via
> /proc/[PID]/stat. Indeed, one can examine /proc/[PID]/stack and search
> for the function namely "rescuer_thread". This is only available to the
> root user.
>
> It can be useful to identify a rescue kworker since their CPU affinity
> cannot be modified and their initial CPU assignment can be safely ignored.
> Furthermore, a workqueue that was created with WQ_MEM_RECLAIM and
> WQ_SYSFS the cpumask file is not applicable to the rescue kworker.
> By design a rescue kworker should run anywhere.

Guess this is a requirement because, if workqueue processing is stuck
for some reason, getting rescuers to run on the same set of cpus
workqueues have been restricted to already doesn't really have good
chances of making any progress?

Wonder if we still might need some sort of fail hard/warn mode in case
strict isolation is in place? Or maybe we have that already?

Thanks!
Juri

2023-12-11 18:40:05

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] workqueue: Introduce PF_WQ_RESCUE_WORKER

Hello,

On Mon, Dec 11, 2023 at 03:51:57PM +0100, Juri Lelli wrote:
> Guess this is a requirement because, if workqueue processing is stuck
> for some reason, getting rescuers to run on the same set of cpus
> workqueues have been restricted to already doesn't really have good
> chances of making any progress?

The only problem rescuers try to solve is deadlocks caused by lack of
memory, so on the cpu side, it just follows whatever worker pool it's trying
to help.

> Wonder if we still might need some sort of fail hard/warn mode in case
> strict isolation is in place? Or maybe we have that already?

For both percpu and unbound workqueues, the rescuers just follow whatever
pool it's trying to help at the moment, so it shouldn't cause any surprises
in terms of isolation. It just temporarily joins the already active but
stuck pool.

Thanks.

--
tejun

2023-12-12 09:56:46

by Juri Lelli

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] workqueue: Introduce PF_WQ_RESCUE_WORKER

Hello,

Thanks for the quick reply!

On 11/12/23 08:39, Tejun Heo wrote:
> Hello,
>
> On Mon, Dec 11, 2023 at 03:51:57PM +0100, Juri Lelli wrote:
> > Guess this is a requirement because, if workqueue processing is stuck
> > for some reason, getting rescuers to run on the same set of cpus
> > workqueues have been restricted to already doesn't really have good
> > chances of making any progress?
>
> The only problem rescuers try to solve is deadlocks caused by lack of
> memory, so on the cpu side, it just follows whatever worker pool it's trying
> to help.
>
> > Wonder if we still might need some sort of fail hard/warn mode in case
> > strict isolation is in place? Or maybe we have that already?
>
> For both percpu and unbound workqueues, the rescuers just follow whatever
> pool it's trying to help at the moment, so it shouldn't cause any surprises
> in terms of isolation. It just temporarily joins the already active but
> stuck pool.

Hummm, OK, but in terms of which CPU the rescuer is possibly woken up,
how are we making sure that the wake up is always happening on
housekeeping CPUs (assuming unbound workqueues have been restricted to
those)?

AFAICS, we have

send_mayday ->
wake_up_process(wq->rescuer->task)

which is not affined to the workqueue cpumask it's called to rescue, so
in theory can be woken up anywhere?

Thanks,
Juri

2023-12-12 17:15:38

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] workqueue: Introduce PF_WQ_RESCUE_WORKER

Hello, Juri.

On Tue, Dec 12, 2023 at 10:56:02AM +0100, Juri Lelli wrote:
> Hummm, OK, but in terms of which CPU the rescuer is possibly woken up,
> how are we making sure that the wake up is always happening on
> housekeeping CPUs (assuming unbound workqueues have been restricted to
> those)?
>
> AFAICS, we have
>
> send_mayday ->
> wake_up_process(wq->rescuer->task)
>
> which is not affined to the workqueue cpumask it's called to rescue, so
> in theory can be woken up anywhere?

Ah, was only thinking about work item execution. Yeah, it's not following
the isolation rule there and we probably should affine it as we're waking it
up.

Thanks.

--
tejun

2023-12-12 19:07:13

by Aaron Tomlin

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] workqueue: Introduce PF_WQ_RESCUE_WORKER

On Tue, Dec 12, 2023 at 07:14:48AM -1000, Tejun Heo wrote:
> Hello, Juri.
>
> On Tue, Dec 12, 2023 at 10:56:02AM +0100, Juri Lelli wrote:
> > Hummm, OK, but in terms of which CPU the rescuer is possibly woken up,
> > how are we making sure that the wake up is always happening on
> > housekeeping CPUs (assuming unbound workqueues have been restricted to
> > those)?
> >
> > AFAICS, we have
> >
> > send_mayday ->
> > wake_up_process(wq->rescuer->task)
> >
> > which is not affined to the workqueue cpumask it's called to rescue, so
> > in theory can be woken up anywhere?
>
> Ah, was only thinking about work item execution. Yeah, it's not following
> the isolation rule there and we probably should affine it as we're waking it
> up.

Hi Tejun,

I am confused.

I thought by design we want a rescuer kthread to execute on any CPU, no?


Kind regards,

--
Aaron Tomlin

2023-12-12 20:16:32

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] workqueue: Introduce PF_WQ_RESCUE_WORKER

On Tue, Dec 12, 2023 at 07:06:48PM +0000, Aaron Tomlin wrote:
> I thought by design we want a rescuer kthread to execute on any CPU, no?

Well, it needs to be able to move around because it dynamically attaches to
the worker pool it's rescuing and needs to take on its cpumask, but it
doesn't have to be able to run on all cpus all the time.

Thanks.

--
tejun

2023-12-13 08:59:54

by Juri Lelli

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] workqueue: Introduce PF_WQ_RESCUE_WORKER

On 12/12/23 07:14, Tejun Heo wrote:
> Hello, Juri.
>
> On Tue, Dec 12, 2023 at 10:56:02AM +0100, Juri Lelli wrote:
> > Hummm, OK, but in terms of which CPU the rescuer is possibly woken up,
> > how are we making sure that the wake up is always happening on
> > housekeeping CPUs (assuming unbound workqueues have been restricted to
> > those)?
> >
> > AFAICS, we have
> >
> > send_mayday ->
> > wake_up_process(wq->rescuer->task)
> >
> > which is not affined to the workqueue cpumask it's called to rescue, so
> > in theory can be woken up anywhere?
>
> Ah, was only thinking about work item execution. Yeah, it's not following
> the isolation rule there and we probably should affine it as we're waking it
> up.

Something like the following then maybe?

---
kernel/workqueue.c | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 2989b57e154a7..ed73f7f80d57d 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -4405,6 +4405,12 @@ static void apply_wqattrs_commit(struct apply_wqattrs_ctx *ctx)
link_pwq(ctx->dfl_pwq);
swap(ctx->wq->dfl_pwq, ctx->dfl_pwq);

+ /* rescuer needs to respect wq cpumask changes */
+ if (ctx->wq->rescuer) {
+ kthread_bind_mask(ctx->wq->rescuer->task, ctx->attrs->cpumask);
+ wake_up_process(ctx->wq->rescuer->task);
+ }
+
mutex_unlock(&ctx->wq->mutex);
}

2023-12-13 15:36:24

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] workqueue: Introduce PF_WQ_RESCUE_WORKER

Hello,

On Wed, Dec 13, 2023 at 09:59:42AM +0100, Juri Lelli wrote:
> Something like the following then maybe?
>
> ---
> kernel/workqueue.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 2989b57e154a7..ed73f7f80d57d 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -4405,6 +4405,12 @@ static void apply_wqattrs_commit(struct apply_wqattrs_ctx *ctx)
> link_pwq(ctx->dfl_pwq);
> swap(ctx->wq->dfl_pwq, ctx->dfl_pwq);
>
> + /* rescuer needs to respect wq cpumask changes */
> + if (ctx->wq->rescuer) {
> + kthread_bind_mask(ctx->wq->rescuer->task, ctx->attrs->cpumask);
> + wake_up_process(ctx->wq->rescuer->task);
> + }
> +
> mutex_unlock(&ctx->wq->mutex);
> }

I'm not sure kthread_bind_mask() would be safe here. The rescuer might be
running a work item. wait_task_inactive() might fail and we don't want to
change cpumask while the rescuer is active anyway.

Maybe the easiest way to do this is making rescuer_thread() restore the wq's
cpumask right before going to sleep, and making apply_wqattrs_commit() just
wake up the rescuer.

Thanks.

--
tejun

2023-12-13 18:32:28

by Juri Lelli

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] workqueue: Introduce PF_WQ_RESCUE_WORKER

On 13/12/23 05:35, Tejun Heo wrote:
> Hello,
>
> On Wed, Dec 13, 2023 at 09:59:42AM +0100, Juri Lelli wrote:
> > Something like the following then maybe?
> >
> > ---
> > kernel/workqueue.c | 6 ++++++
> > 1 file changed, 6 insertions(+)
> >
> > diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> > index 2989b57e154a7..ed73f7f80d57d 100644
> > --- a/kernel/workqueue.c
> > +++ b/kernel/workqueue.c
> > @@ -4405,6 +4405,12 @@ static void apply_wqattrs_commit(struct apply_wqattrs_ctx *ctx)
> > link_pwq(ctx->dfl_pwq);
> > swap(ctx->wq->dfl_pwq, ctx->dfl_pwq);
> >
> > + /* rescuer needs to respect wq cpumask changes */
> > + if (ctx->wq->rescuer) {
> > + kthread_bind_mask(ctx->wq->rescuer->task, ctx->attrs->cpumask);
> > + wake_up_process(ctx->wq->rescuer->task);
> > + }
> > +
> > mutex_unlock(&ctx->wq->mutex);
> > }
>
> I'm not sure kthread_bind_mask() would be safe here. The rescuer might be
> running a work item. wait_task_inactive() might fail and we don't want to
> change cpumask while the rescuer is active anyway.
>
> Maybe the easiest way to do this is making rescuer_thread() restore the wq's
> cpumask right before going to sleep, and making apply_wqattrs_commit() just
> wake up the rescuer.

Hummm, don't think we can call that either while the rescuer is actually
running. Maybe we can simply s/kthread_bind_mask/set_cpus_allowed_ptr/
in the above?

Thanks,
Juri

2023-12-13 18:38:51

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] workqueue: Introduce PF_WQ_RESCUE_WORKER

On Wed, Dec 13, 2023 at 07:32:10PM +0100, Juri Lelli wrote:
> > Maybe the easiest way to do this is making rescuer_thread() restore the wq's
> > cpumask right before going to sleep, and making apply_wqattrs_commit() just
> > wake up the rescuer.
>
> Hummm, don't think we can call that either while the rescuer is actually
> running. Maybe we can simply s/kthread_bind_mask/set_cpus_allowed_ptr/
> in the above?

So, we have to use set_cpus_allowed_ptr() but we still don't want to change
the affinity of a rescuer which is already running a task for a pool.

Thanks.

--
tejun

2023-12-14 11:26:00

by Juri Lelli

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] workqueue: Introduce PF_WQ_RESCUE_WORKER

On 13/12/23 08:38, Tejun Heo wrote:
> On Wed, Dec 13, 2023 at 07:32:10PM +0100, Juri Lelli wrote:
> > > Maybe the easiest way to do this is making rescuer_thread() restore the wq's
> > > cpumask right before going to sleep, and making apply_wqattrs_commit() just
> > > wake up the rescuer.
> >
> > Hummm, don't think we can call that either while the rescuer is actually
> > running. Maybe we can simply s/kthread_bind_mask/set_cpus_allowed_ptr/
> > in the above?
>
> So, we have to use set_cpus_allowed_ptr() but we still don't want to change
> the affinity of a rescuer which is already running a task for a pool.

But then, even today, a rescuer might keep handling work on a cpu
outside its wq cpumask if the associated wq cpumask change can proceed
w/o waiting for it to finish the iteration?

BTW, apologies for all the questions, but I'd like to make sure I can
get the implications hopefully right. :)

Thanks,
Juri

2023-12-14 19:48:46

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] workqueue: Introduce PF_WQ_RESCUE_WORKER

Hello,

On Thu, Dec 14, 2023 at 12:25:25PM +0100, Juri Lelli wrote:
> > So, we have to use set_cpus_allowed_ptr() but we still don't want to change
> > the affinity of a rescuer which is already running a task for a pool.
>
> But then, even today, a rescuer might keep handling work on a cpu
> outside its wq cpumask if the associated wq cpumask change can proceed
> w/o waiting for it to finish the iteration?

Yeah, that can happen and pool cpumasks naturally being subsets of the wq's
cpumask that they're serving, your original approach likely isn't broken
either.

> BTW, apologies for all the questions, but I'd like to make sure I can
> get the implications hopefully right. :)

I obviously haven't thought through it very well, so thanks for the
questions. So, yeah, I think we actually need to set the rescuer's cpumask
when wq's cpumask changes and doing it where you were suggesting should
probably work.

Thanks.

--
tejun

2023-12-15 06:51:42

by Juri Lelli

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] workqueue: Introduce PF_WQ_RESCUE_WORKER

On 14/12/23 09:47, Tejun Heo wrote:
> Hello,
>
> On Thu, Dec 14, 2023 at 12:25:25PM +0100, Juri Lelli wrote:
> > > So, we have to use set_cpus_allowed_ptr() but we still don't want to change
> > > the affinity of a rescuer which is already running a task for a pool.
> >
> > But then, even today, a rescuer might keep handling work on a cpu
> > outside its wq cpumask if the associated wq cpumask change can proceed
> > w/o waiting for it to finish the iteration?
>
> Yeah, that can happen and pool cpumasks naturally being subsets of the wq's
> cpumask that they're serving, your original approach likely isn't broken
> either.
>
> > BTW, apologies for all the questions, but I'd like to make sure I can
> > get the implications hopefully right. :)
>
> I obviously haven't thought through it very well, so thanks for the
> questions. So, yeah, I think we actually need to set the rescuer's cpumask
> when wq's cpumask changes and doing it where you were suggesting should
> probably work.

OK. Going to send a proper patch asap.

Thanks!
Juri


2023-12-19 08:55:40

by Juri Lelli

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] workqueue: Introduce PF_WQ_RESCUE_WORKER

Hello again,

On 15/12/23 07:50, Juri Lelli wrote:
> On 14/12/23 09:47, Tejun Heo wrote:
> > Hello,
> >
> > On Thu, Dec 14, 2023 at 12:25:25PM +0100, Juri Lelli wrote:
> > > > So, we have to use set_cpus_allowed_ptr() but we still don't want to change
> > > > the affinity of a rescuer which is already running a task for a pool.
> > >
> > > But then, even today, a rescuer might keep handling work on a cpu
> > > outside its wq cpumask if the associated wq cpumask change can proceed
> > > w/o waiting for it to finish the iteration?
> >
> > Yeah, that can happen and pool cpumasks naturally being subsets of the wq's
> > cpumask that they're serving, your original approach likely isn't broken
> > either.
> >
> > > BTW, apologies for all the questions, but I'd like to make sure I can
> > > get the implications hopefully right. :)
> >
> > I obviously haven't thought through it very well, so thanks for the
> > questions. So, yeah, I think we actually need to set the rescuer's cpumask
> > when wq's cpumask changes and doing it where you were suggesting should
> > probably work.
>
> OK. Going to send a proper patch asap.

I actually didn't do that yet as it turns out the proposed approach
doesn't cover !WQ_SYSFS unbounded wqs. Well, I thought those should be
covered as well, since we have (initiated by echo <mask> into
/sys/devices/virtual/workqueue/cpumask)

workqueue_apply_unbound_cpumask ->
apply_wqattrs_commit

but for some reason the mask change is not reflected into rescuers
affinity.

Trying to dig deeper I went ahead and extended the recent wq_dump.py
addition with the following

---
ls/workqueue/wq_dump.py | 29 +++++++++++++++++++++++++++++
1 file changed, 29 insertions(+)

diff --git a/tools/workqueue/wq_dump.py b/tools/workqueue/wq_dump.py
index d0df5833f2c18..6da621989e210 100644
--- a/tools/workqueue/wq_dump.py
+++ b/tools/workqueue/wq_dump.py
@@ -175,3 +175,32 @@ for wq in list_for_each_entry('struct workqueue_struct', workqueues.address_of_(
if wq.flags & WQ_UNBOUND:
print(f' {wq.dfl_pwq.pool.id.value_():{max_pool_id_len}}', end='')
print('')
+
+print('')
+print('Workqueue -> rescuer')
+print('=====================')
+print(f'wq_unbound_cpumask={cpumask_str(wq_unbound_cpumask)}')
+print('')
+print('[ workqueue \ type unbound_cpumask rescuer pid cpumask]')
+
+for wq in list_for_each_entry('struct workqueue_struct', workqueues.address_of_(), 'list'):
+ print(f'{wq.name.string_().decode()[-24:]:24}', end='')
+ if wq.flags & WQ_UNBOUND:
+ if wq.flags & WQ_ORDERED:
+ print(' ordered ', end='')
+ else:
+ print(' unbound', end='')
+ if wq.unbound_attrs.affn_strict:
+ print(',S ', end='')
+ else:
+ print(' ', end='')
+ print(f' {cpumask_str(wq.unbound_attrs.cpumask):24}', end='')
+ else:
+ print(' percpu ', end='')
+ print(' ', end='')
+
+ if wq.flags & WQ_MEM_RECLAIM:
+ print(f' {wq.rescuer.task.comm.string_().decode()[-24:]:24}', end='')
+ print(f' {wq.rescuer.task.pid.value_():5}', end='')
+ print(f' {cpumask_str(wq.rescuer.task.cpus_ptr)}', end='')
+ print('')
---

which shows the following situation after an

# echo 00,00000003 > /sys/devices/virtual/workqueue/cpumask

on the system I'm testing with:

...
Workqueue -> rescuer
=====================
wq_unbound_cpumask=00000003

[ workqueue \ type unbound_cpumask rescuer pid cpumask]
events percpu
events_highpri percpu
events_long percpu
events_unbound unbound 0xffffffff 000000ff
events_freezable percpu
events_power_efficient percpu
events_freezable_power_ percpu
rcu_gp percpu kworker/R-rcu_g 4 0xffffffff 000000ff
rcu_par_gp percpu kworker/R-rcu_p 5 0xffffffff 000000ff
slub_flushwq percpu kworker/R-slub_ 6 0xffffffff 000000ff
netns ordered 0xffffffff 000000ff kworker/R-netns 7 0xffffffff 000000ff
mm_percpu_wq percpu kworker/R-mm_pe 13 0xffffffff 000000ff
cpuset_migrate_mm ordered 0xffffffff 000000ff
inet_frag_wq percpu kworker/R-inet_ 300 0xffffffff 000000ff
pm percpu
cgroup_destroy percpu
cgroup_pidlist_destroy percpu
writeback unbound 0xffffffff 000000ff kworker/R-write 308 0xffffffff 000000ff
cgwb_release percpu
cryptd percpu kworker/R-crypt 314 0xffffffff 000000ff
kintegrityd percpu kworker/R-kinte 315 0xffffffff 000000ff
kblockd percpu kworker/R-kbloc 316 0xffffffff 000000ff
kacpid percpu
kacpi_notify percpu
kacpi_hotplug ordered 0xffffffff 000000ff
kec ordered 0xffffffff 000000ff
kec_query percpu
tpm_dev_wq percpu kworker/R-tpm_d 352 0xffffffff 000000ff
usb_hub_wq percpu
md percpu kworker/R-md 353 0xffffffff 000000ff
md_misc percpu
md_bitmap unbound 0xffffffff 000000ff kworker/R-md_bi 354 0xffffffff 000000ff
edac-poller ordered 0xffffffff 000000ff kworker/R-edac- 355 0xffffffff 000000ff
...

I guess I expected wq_unbound_cpumask and unbound_cpumask for each
unbound wq to be kept in sync, so I'm evidently missing details. :)

Can you please help me here understanding what am I missing?

Thanks!
Juri