2024-04-15 05:36:35

by Sven Schnelle

[permalink] [raw]
Subject: [PATCH] workqueue: fix selection of wake_cpu in kick_pool()

With cpu_possible_mask=0-63 and cpu_online_mask=0-7 the following
kernel oops was observed:

smp: Bringing up secondary CPUs ...
smp: Brought up 1 node, 8 CPUs
Unable to handle kernel pointer dereference in virtual kernel address space
Failing address: 0000000000000000 TEID: 0000000000000803
[..]
Call Trace:
arch_vcpu_is_preempted+0x12/0x80
select_idle_sibling+0x42/0x560
select_task_rq_fair+0x29a/0x3b0
try_to_wake_up+0x38e/0x6e0
kick_pool+0xa4/0x198
__queue_work.part.0+0x2bc/0x3a8
call_timer_fn+0x36/0x160
__run_timers+0x1e2/0x328
__run_timer_base+0x5a/0x88
run_timer_softirq+0x40/0x78
__do_softirq+0x118/0x388
irq_exit_rcu+0xc0/0xd8
do_ext_irq+0xae/0x168
ext_int_handler+0xbe/0xf0
psw_idle_exit+0x0/0xc
default_idle_call+0x3c/0x110
do_idle+0xd4/0x158
cpu_startup_entry+0x40/0x48
rest_init+0xc6/0xc8
start_kernel+0x3c4/0x5e0
startup_continue+0x3c/0x50

The crash is caused by calling arch_vcpu_is_preempted() for an offline
CPU. To avoid this, select the cpu with cpumask_any_and_distribute()
to mask __pod_cpumask with cpu_online_mask.

Fixes: 8639ecebc9b1 ("workqueue: Implement non-strict affinity scope for unbound workqueues")
Signed-off-by: Sven Schnelle <[email protected]>
---
kernel/workqueue.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 0066c8f6c154..d02b0c02c9e2 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1277,7 +1277,8 @@ static bool kick_pool(struct worker_pool *pool)
!cpumask_test_cpu(p->wake_cpu, pool->attrs->__pod_cpumask)) {
struct work_struct *work = list_first_entry(&pool->worklist,
struct work_struct, entry);
- p->wake_cpu = cpumask_any_distribute(pool->attrs->__pod_cpumask);
+ p->wake_cpu = cpumask_any_and_distribute(pool->attrs->__pod_cpumask,
+ cpu_online_mask);
get_work_pwq(work)->stats[PWQ_STAT_REPATRIATED]++;
}
#endif
--
2.40.1



2024-04-16 23:07:05

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH] workqueue: fix selection of wake_cpu in kick_pool()

Hello,

On Mon, Apr 15, 2024 at 07:35:49AM +0200, Sven Schnelle wrote:
> @@ -1277,7 +1277,8 @@ static bool kick_pool(struct worker_pool *pool)
> !cpumask_test_cpu(p->wake_cpu, pool->attrs->__pod_cpumask)) {
> struct work_struct *work = list_first_entry(&pool->worklist,
> struct work_struct, entry);
> - p->wake_cpu = cpumask_any_distribute(pool->attrs->__pod_cpumask);
> + p->wake_cpu = cpumask_any_and_distribute(pool->attrs->__pod_cpumask,
> + cpu_online_mask);

I think this can still race with the last CPU in the pod going down and
return nr_cpu_ids. Maybe something like the following would be better?

int wake_cpu;

wake_cpu = cpumask_any_distribute_and(...);
if (wake_cpu < nr_cpus_ids) {
p->wake_cpu = wake_cpu;
// update stat;
}

This generally seems like a good idea but isn't this still racy? The CPU may
go down between setting p->wake_cpu and wake_up_process().

Thanks.

--
tejun

2024-04-17 15:51:09

by Sven Schnelle

[permalink] [raw]
Subject: Re: [PATCH] workqueue: fix selection of wake_cpu in kick_pool()

Tejun Heo <[email protected]> writes:

> On Mon, Apr 15, 2024 at 07:35:49AM +0200, Sven Schnelle wrote:
>> @@ -1277,7 +1277,8 @@ static bool kick_pool(struct worker_pool *pool)
>> !cpumask_test_cpu(p->wake_cpu, pool->attrs->__pod_cpumask)) {
>> struct work_struct *work = list_first_entry(&pool->worklist,
>> struct work_struct, entry);
>> - p->wake_cpu = cpumask_any_distribute(pool->attrs->__pod_cpumask);
>> + p->wake_cpu = cpumask_any_and_distribute(pool->attrs->__pod_cpumask,
>> + cpu_online_mask);
>
> I think this can still race with the last CPU in the pod going down and
> return nr_cpu_ids. Maybe something like the following would be better?
>
> int wake_cpu;
>
> wake_cpu = cpumask_any_distribute_and(...);
> if (wake_cpu < nr_cpus_ids) {
> p->wake_cpu = wake_cpu;
> // update stat;
> }
>
> This generally seems like a good idea but isn't this still racy? The CPU may
> go down between setting p->wake_cpu and wake_up_process().

Don't know without reading the source, but how does this code normally
protect against that?

Thanks
Sven

2024-04-18 01:56:04

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH] workqueue: fix selection of wake_cpu in kick_pool()

Hello,

On Wed, Apr 17, 2024 at 05:36:38PM +0200, Sven Schnelle wrote:
> > This generally seems like a good idea but isn't this still racy? The CPU may
> > go down between setting p->wake_cpu and wake_up_process().
>
> Don't know without reading the source, but how does this code normally
> protect against that?

Probably by wrapping determining the wake_cpu and the wake_up inside
cpu_read_lock() section.

Thanks.

--
tejun

2024-04-18 05:55:03

by Sven Schnelle

[permalink] [raw]
Subject: Re: [PATCH] workqueue: fix selection of wake_cpu in kick_pool()

Tejun Heo <[email protected]> writes:

> Hello,
>
> On Wed, Apr 17, 2024 at 05:36:38PM +0200, Sven Schnelle wrote:
>> > This generally seems like a good idea but isn't this still racy? The CPU may
>> > go down between setting p->wake_cpu and wake_up_process().
>>
>> Don't know without reading the source, but how does this code normally
>> protect against that?
>
> Probably by wrapping determining the wake_cpu and the wake_up inside
> cpu_read_lock() section.

Thanks. Should i send a v2 and incorporate your additional changes or do
you want to do that?

2024-04-18 15:56:26

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH] workqueue: fix selection of wake_cpu in kick_pool()

On Thu, Apr 18, 2024 at 07:54:37AM +0200, Sven Schnelle wrote:
> Tejun Heo <[email protected]> writes:
>
> > Hello,
> >
> > On Wed, Apr 17, 2024 at 05:36:38PM +0200, Sven Schnelle wrote:
> >> > This generally seems like a good idea but isn't this still racy? The CPU may
> >> > go down between setting p->wake_cpu and wake_up_process().
> >>
> >> Don't know without reading the source, but how does this code normally
> >> protect against that?
> >
> > Probably by wrapping determining the wake_cpu and the wake_up inside
> > cpu_read_lock() section.
>
> Thanks. Should i send a v2 and incorporate your additional changes or do
> you want to do that?

Yes, please send a v2.

Thanks.

--
tejun

2024-04-19 08:28:09

by Sven Schnelle

[permalink] [raw]
Subject: Re: [PATCH] workqueue: fix selection of wake_cpu in kick_pool()

Hi Tejun,

Tejun Heo <[email protected]> writes:

> On Wed, Apr 17, 2024 at 05:36:38PM +0200, Sven Schnelle wrote:
>> > This generally seems like a good idea but isn't this still racy? The CPU may
>> > go down between setting p->wake_cpu and wake_up_process().
>>
>> Don't know without reading the source, but how does this code normally
>> protect against that?
>
> Probably by wrapping determining the wake_cpu and the wake_up inside
> cpu_read_lock() section.

Do you mean rcu_read_lock()? cpus_read_lock() takes a mutex, and the
crash happens in softirq context - so cpus_read_lock() can't be the
correct lock.

If i read the code correctly, cpu hotplug uses stop_machine_cpuslocked()
- so rcu_read_lock() should be sufficient for non-atomic context.

Looking at the backtrace the crash is actually happening in
arch_vpu_is_preempted(). I don't know the semantics of that function,
whether it is ok to call it for offline CPUs, or whether the calling
code should make sure that the cpu is online (which would be my guess).

Following the backtrace from my initial mail, I can't find a place where
a check is done whether p->wake_cpu is actually online. Eventually
available_idle_cpu() is calling vcpu_is_preempted(). I wonder whether
available_idle_cpu() should do a cpu_online() check right at the
beginning?

Adding Peter to CC, he probably knows.

Thanks,
Sven

2024-04-19 15:41:01

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH] workqueue: fix selection of wake_cpu in kick_pool()

Hello, Sven.

On Fri, Apr 19, 2024 at 10:27:05AM +0200, Sven Schnelle wrote:
> > Probably by wrapping determining the wake_cpu and the wake_up inside
> > cpu_read_lock() section.
>
> Do you mean rcu_read_lock()? cpus_read_lock() takes a mutex, and the
> crash happens in softirq context - so cpus_read_lock() can't be the
> correct lock.

I meant cpus_read_lock() but yeah we can't use that here.

> If i read the code correctly, cpu hotplug uses stop_machine_cpuslocked()
> - so rcu_read_lock() should be sufficient for non-atomic context.
>
> Looking at the backtrace the crash is actually happening in
> arch_vpu_is_preempted(). I don't know the semantics of that function,
> whether it is ok to call it for offline CPUs, or whether the calling
> code should make sure that the cpu is online (which would be my guess).
>
> Following the backtrace from my initial mail, I can't find a place where
> a check is done whether p->wake_cpu is actually online. Eventually
> available_idle_cpu() is calling vcpu_is_preempted(). I wonder whether
> available_idle_cpu() should do a cpu_online() check right at the
> beginning?

Yeah, adding a cpu_online() test there makes more sense to me.

> Adding Peter to CC, he probably knows.

Peter?

Thanks.

--
tejun

2024-04-22 22:44:53

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH] workqueue: fix selection of wake_cpu in kick_pool()

Hello,

On Mon, Apr 15, 2024 at 07:35:49AM +0200, Sven Schnelle wrote:
> With cpu_possible_mask=0-63 and cpu_online_mask=0-7 the following
> kernel oops was observed:
>
> smp: Bringing up secondary CPUs ...
> smp: Brought up 1 node, 8 CPUs
> Unable to handle kernel pointer dereference in virtual kernel address space
> Failing address: 0000000000000000 TEID: 0000000000000803
> [..]
> Call Trace:
> arch_vcpu_is_preempted+0x12/0x80
> select_idle_sibling+0x42/0x560
> select_task_rq_fair+0x29a/0x3b0
> try_to_wake_up+0x38e/0x6e0
> kick_pool+0xa4/0x198
> __queue_work.part.0+0x2bc/0x3a8
> call_timer_fn+0x36/0x160
> __run_timers+0x1e2/0x328
> __run_timer_base+0x5a/0x88
> run_timer_softirq+0x40/0x78
> __do_softirq+0x118/0x388
> irq_exit_rcu+0xc0/0xd8
> do_ext_irq+0xae/0x168
> ext_int_handler+0xbe/0xf0
> psw_idle_exit+0x0/0xc
> default_idle_call+0x3c/0x110
> do_idle+0xd4/0x158
> cpu_startup_entry+0x40/0x48
> rest_init+0xc6/0xc8
> start_kernel+0x3c4/0x5e0
> startup_continue+0x3c/0x50
>
> The crash is caused by calling arch_vcpu_is_preempted() for an offline
> CPU. To avoid this, select the cpu with cpumask_any_and_distribute()
> to mask __pod_cpumask with cpu_online_mask.
>
> Fixes: 8639ecebc9b1 ("workqueue: Implement non-strict affinity scope for unbound workqueues")
> Signed-off-by: Sven Schnelle <[email protected]>
> ---
> kernel/workqueue.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 0066c8f6c154..d02b0c02c9e2 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -1277,7 +1277,8 @@ static bool kick_pool(struct worker_pool *pool)
> !cpumask_test_cpu(p->wake_cpu, pool->attrs->__pod_cpumask)) {
> struct work_struct *work = list_first_entry(&pool->worklist,
> struct work_struct, entry);
> - p->wake_cpu = cpumask_any_distribute(pool->attrs->__pod_cpumask);
> + p->wake_cpu = cpumask_any_and_distribute(pool->attrs->__pod_cpumask,
> + cpu_online_mask);

So, this wouldn't necessarily fix the problem completely but regardless of
how that's plugged, this is still something we want to do to avoid picking
offline CPUs. Can you please update the patch description accordingly and
resend?

Thanks.

--
tejun

2024-04-23 06:20:57

by Sven Schnelle

[permalink] [raw]
Subject: Re: [PATCH] workqueue: fix selection of wake_cpu in kick_pool()

Tejun Heo <[email protected]> writes:

> Hello,
>
> On Mon, Apr 15, 2024 at 07:35:49AM +0200, Sven Schnelle wrote:
>> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
>> index 0066c8f6c154..d02b0c02c9e2 100644
>> --- a/kernel/workqueue.c
>> +++ b/kernel/workqueue.c
>> @@ -1277,7 +1277,8 @@ static bool kick_pool(struct worker_pool *pool)
>> !cpumask_test_cpu(p->wake_cpu, pool->attrs->__pod_cpumask)) {
>> struct work_struct *work = list_first_entry(&pool->worklist,
>> struct work_struct, entry);
>> - p->wake_cpu = cpumask_any_distribute(pool->attrs->__pod_cpumask);
>> + p->wake_cpu = cpumask_any_and_distribute(pool->attrs->__pod_cpumask,
>> + cpu_online_mask);
>
> So, this wouldn't necessarily fix the problem completely but regardless of
> how that's plugged, this is still something we want to do to avoid picking
> offline CPUs. Can you please update the patch description accordingly and
> resend?

I'll just sent v2. I didn't mention the arch_vcpu_is_preempted() issue
in the commit description, as i'm not yet sure whether that's a wrong
assumption the s390 code or in the common code. Still waiting whether
Peter has some insight.