From: Yicong Yang <[email protected]>
On load balance we didn't check whether the candidate task is migration
disabled or not, this may hit the WARN_ON in set_task_cpu() since the
migration disabled tasks are expected to run on their current CPU.
We've run into this case several times on our server:
------------[ cut here ]------------
WARNING: CPU: 7 PID: 0 at kernel/sched/core.c:3115 set_task_cpu+0x188/0x240
Modules linked in: hclgevf xt_CHECKSUM ipt_REJECT nf_reject_ipv4 <...snip>
CPU: 7 PID: 0 Comm: swapper/7 Kdump: loaded Tainted: G O 6.1.0-rc4+ #1
Hardware name: Huawei TaiShan 2280 V2/BC82AMDC, BIOS 2280-V2 CS V5.B221.01 12/09/2021
pstate: 604000c9 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : set_task_cpu+0x188/0x240
lr : load_balance+0x5d0/0xc60
sp : ffff80000803bc70
x29: ffff80000803bc70 x28: ffff004089e190e8 x27: ffff004089e19040
x26: ffff007effcabc38 x25: 0000000000000000 x24: 0000000000000001
x23: ffff80000803be84 x22: 000000000000000c x21: ffffb093e79e2a78
x20: 000000000000000c x19: ffff004089e19040 x18: 0000000000000000
x17: 0000000000001fad x16: 0000000000000030 x15: 0000000000000000
x14: 0000000000000003 x13: 0000000000000000 x12: 0000000000000000
x11: 0000000000000001 x10: 0000000000000400 x9 : ffffb093e4cee530
x8 : 00000000fffffffe x7 : 0000000000ce168a x6 : 000000000000013e
x5 : 00000000ffffffe1 x4 : 0000000000000001 x3 : 0000000000000b2a
x2 : 0000000000000b2a x1 : ffffb093e6d6c510 x0 : 0000000000000001
Call trace:
set_task_cpu+0x188/0x240
load_balance+0x5d0/0xc60
rebalance_domains+0x26c/0x380
_nohz_idle_balance.isra.0+0x1e0/0x370
run_rebalance_domains+0x6c/0x80
__do_softirq+0x128/0x3d8
____do_softirq+0x18/0x24
call_on_irq_stack+0x2c/0x38
do_softirq_own_stack+0x24/0x3c
__irq_exit_rcu+0xcc/0xf4
irq_exit_rcu+0x18/0x24
el1_interrupt+0x4c/0xe4
el1h_64_irq_handler+0x18/0x2c
el1h_64_irq+0x74/0x78
arch_cpu_idle+0x18/0x4c
default_idle_call+0x58/0x194
do_idle+0x244/0x2b0
cpu_startup_entry+0x30/0x3c
secondary_start_kernel+0x14c/0x190
__secondary_switched+0xb0/0xb4
---[ end trace 0000000000000000 ]---
Signed-off-by: Yicong Yang <[email protected]>
---
kernel/sched/fair.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7a1b1f855b96..8fe767362d22 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8433,6 +8433,10 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
if (kthread_is_per_cpu(p))
return 0;
+ /* Migration disabled tasks need to be kept on their running CPU. */
+ if (is_migration_disabled(p))
+ return 0;
+
if (!cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)) {
int cpu;
--
2.24.0
On 2023-03-13 at 14:57:59 +0800, Yicong Yang wrote:
> From: Yicong Yang <[email protected]>
>
> On load balance we didn't check whether the candidate task is migration
> disabled or not, this may hit the WARN_ON in set_task_cpu() since the
> migration disabled tasks are expected to run on their current CPU.
> We've run into this case several times on our server:
>
> ------------[ cut here ]------------
> WARNING: CPU: 7 PID: 0 at kernel/sched/core.c:3115 set_task_cpu+0x188/0x240
> Modules linked in: hclgevf xt_CHECKSUM ipt_REJECT nf_reject_ipv4 <...snip>
> CPU: 7 PID: 0 Comm: swapper/7 Kdump: loaded Tainted: G O 6.1.0-rc4+ #1
> Hardware name: Huawei TaiShan 2280 V2/BC82AMDC, BIOS 2280-V2 CS V5.B221.01 12/09/2021
> pstate: 604000c9 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> pc : set_task_cpu+0x188/0x240
> lr : load_balance+0x5d0/0xc60
> sp : ffff80000803bc70
> x29: ffff80000803bc70 x28: ffff004089e190e8 x27: ffff004089e19040
> x26: ffff007effcabc38 x25: 0000000000000000 x24: 0000000000000001
> x23: ffff80000803be84 x22: 000000000000000c x21: ffffb093e79e2a78
> x20: 000000000000000c x19: ffff004089e19040 x18: 0000000000000000
> x17: 0000000000001fad x16: 0000000000000030 x15: 0000000000000000
> x14: 0000000000000003 x13: 0000000000000000 x12: 0000000000000000
> x11: 0000000000000001 x10: 0000000000000400 x9 : ffffb093e4cee530
> x8 : 00000000fffffffe x7 : 0000000000ce168a x6 : 000000000000013e
> x5 : 00000000ffffffe1 x4 : 0000000000000001 x3 : 0000000000000b2a
> x2 : 0000000000000b2a x1 : ffffb093e6d6c510 x0 : 0000000000000001
> Call trace:
> set_task_cpu+0x188/0x240
> load_balance+0x5d0/0xc60
> rebalance_domains+0x26c/0x380
> _nohz_idle_balance.isra.0+0x1e0/0x370
> run_rebalance_domains+0x6c/0x80
> __do_softirq+0x128/0x3d8
> ____do_softirq+0x18/0x24
> call_on_irq_stack+0x2c/0x38
> do_softirq_own_stack+0x24/0x3c
> __irq_exit_rcu+0xcc/0xf4
> irq_exit_rcu+0x18/0x24
> el1_interrupt+0x4c/0xe4
> el1h_64_irq_handler+0x18/0x2c
> el1h_64_irq+0x74/0x78
> arch_cpu_idle+0x18/0x4c
> default_idle_call+0x58/0x194
> do_idle+0x244/0x2b0
> cpu_startup_entry+0x30/0x3c
> secondary_start_kernel+0x14c/0x190
> __secondary_switched+0xb0/0xb4
> ---[ end trace 0000000000000000 ]---
>
> Signed-off-by: Yicong Yang <[email protected]>
> ---
> kernel/sched/fair.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 7a1b1f855b96..8fe767362d22 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8433,6 +8433,10 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
> if (kthread_is_per_cpu(p))
> return 0;
>
> + /* Migration disabled tasks need to be kept on their running CPU. */
> + if (is_migration_disabled(p))
> + return 0;
> +
> if (!cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)) {
> int cpu;
>
> --
> 2.24.0
>
Looks reasonable to me. Would it be possible to also update the comments at the beginning of
can_migrate_task() starts with: "We do not migrate tasks that are:"
Reviewed-by: Chen Yu <[email protected]>
thanks,
Chenyu
On 2023/3/14 11:08, Chen Yu wrote:
> On 2023-03-13 at 14:57:59 +0800, Yicong Yang wrote:
>> From: Yicong Yang <[email protected]>
>>
>> On load balance we didn't check whether the candidate task is migration
>> disabled or not, this may hit the WARN_ON in set_task_cpu() since the
>> migration disabled tasks are expected to run on their current CPU.
>> We've run into this case several times on our server:
>>
>> ------------[ cut here ]------------
>> WARNING: CPU: 7 PID: 0 at kernel/sched/core.c:3115 set_task_cpu+0x188/0x240
>> Modules linked in: hclgevf xt_CHECKSUM ipt_REJECT nf_reject_ipv4 <...snip>
>> CPU: 7 PID: 0 Comm: swapper/7 Kdump: loaded Tainted: G O 6.1.0-rc4+ #1
>> Hardware name: Huawei TaiShan 2280 V2/BC82AMDC, BIOS 2280-V2 CS V5.B221.01 12/09/2021
>> pstate: 604000c9 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>> pc : set_task_cpu+0x188/0x240
>> lr : load_balance+0x5d0/0xc60
>> sp : ffff80000803bc70
>> x29: ffff80000803bc70 x28: ffff004089e190e8 x27: ffff004089e19040
>> x26: ffff007effcabc38 x25: 0000000000000000 x24: 0000000000000001
>> x23: ffff80000803be84 x22: 000000000000000c x21: ffffb093e79e2a78
>> x20: 000000000000000c x19: ffff004089e19040 x18: 0000000000000000
>> x17: 0000000000001fad x16: 0000000000000030 x15: 0000000000000000
>> x14: 0000000000000003 x13: 0000000000000000 x12: 0000000000000000
>> x11: 0000000000000001 x10: 0000000000000400 x9 : ffffb093e4cee530
>> x8 : 00000000fffffffe x7 : 0000000000ce168a x6 : 000000000000013e
>> x5 : 00000000ffffffe1 x4 : 0000000000000001 x3 : 0000000000000b2a
>> x2 : 0000000000000b2a x1 : ffffb093e6d6c510 x0 : 0000000000000001
>> Call trace:
>> set_task_cpu+0x188/0x240
>> load_balance+0x5d0/0xc60
>> rebalance_domains+0x26c/0x380
>> _nohz_idle_balance.isra.0+0x1e0/0x370
>> run_rebalance_domains+0x6c/0x80
>> __do_softirq+0x128/0x3d8
>> ____do_softirq+0x18/0x24
>> call_on_irq_stack+0x2c/0x38
>> do_softirq_own_stack+0x24/0x3c
>> __irq_exit_rcu+0xcc/0xf4
>> irq_exit_rcu+0x18/0x24
>> el1_interrupt+0x4c/0xe4
>> el1h_64_irq_handler+0x18/0x2c
>> el1h_64_irq+0x74/0x78
>> arch_cpu_idle+0x18/0x4c
>> default_idle_call+0x58/0x194
>> do_idle+0x244/0x2b0
>> cpu_startup_entry+0x30/0x3c
>> secondary_start_kernel+0x14c/0x190
>> __secondary_switched+0xb0/0xb4
>> ---[ end trace 0000000000000000 ]---
>>
>> Signed-off-by: Yicong Yang <[email protected]>
>> ---
>> kernel/sched/fair.c | 4 ++++
>> 1 file changed, 4 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 7a1b1f855b96..8fe767362d22 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -8433,6 +8433,10 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
>> if (kthread_is_per_cpu(p))
>> return 0;
>>
>> + /* Migration disabled tasks need to be kept on their running CPU. */
>> + if (is_migration_disabled(p))
>> + return 0;
>> +
>> if (!cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)) {
>> int cpu;
>>
>> --
>> 2.24.0
>>
> Looks reasonable to me. Would it be possible to also update the comments at the beginning of
> can_migrate_task() starts with: "We do not migrate tasks that are:"
>
Thanks for the suggestion! It seems only uncommented conditions are summarized in that graph,
otherwise they're mentioned close to there branch like kthread_is_per_cpu(p) case. I can add
it in v2 if you think it'll be useful.
> Reviewed-by: Chen Yu <[email protected]>
Thanks,
Yicong
>
> thanks,
> Chenyu
>
> .
>
On 13/03/23 14:57, Yicong Yang wrote:
> kernel/sched/fair.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 7a1b1f855b96..8fe767362d22 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8433,6 +8433,10 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
> if (kthread_is_per_cpu(p))
> return 0;
>
> + /* Migration disabled tasks need to be kept on their running CPU. */
> + if (is_migration_disabled(p))
> + return 0;
> +
> if (!cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)) {
> int cpu;
That cpumask check should cover migration_disabled tasks, unless they
haven't gone through migrate_disable_switch() yet
(p->migration_disabled == 1, but the cpus_ptr hasn't been touched yet).
Now, if that's the case, the task has to be src_rq's current (since it
hasn't switched out), which means can_migrate_task() should exit via:
if (task_on_cpu(env->src_rq, p)) {
schedstat_inc(p->stats.nr_failed_migrations_running);
return 0;
}
and thus not try to detach_task(). With that in mind, I don't get how your
splat can happen, nor how the change change can help (a remote task p could
execute migrate_disable() concurrently with can_migrate_task(p)).
I'm a bit confused here, detach_tasks() happens entirely with src_rq
rq_lock held, so there shouldn't be any surprises.
Can you share any extra context? E.g. exact HEAD of your tree, maybe the
migrate_disable task in question if you have that info.
>
> --
> 2.24.0
On 2023-03-15 at 17:55:13 +0800, Yicong Yang wrote:
> On 2023/3/14 11:08, Chen Yu wrote:
> > On 2023-03-13 at 14:57:59 +0800, Yicong Yang wrote:
> >> From: Yicong Yang <[email protected]>
> >>
> >> On load balance we didn't check whether the candidate task is migration
> >> disabled or not, this may hit the WARN_ON in set_task_cpu() since the
> >> migration disabled tasks are expected to run on their current CPU.
> >> We've run into this case several times on our server:
> >>
> >> ------------[ cut here ]------------
> >> WARNING: CPU: 7 PID: 0 at kernel/sched/core.c:3115 set_task_cpu+0x188/0x240
> >> Modules linked in: hclgevf xt_CHECKSUM ipt_REJECT nf_reject_ipv4 <...snip>
> >> CPU: 7 PID: 0 Comm: swapper/7 Kdump: loaded Tainted: G O 6.1.0-rc4+ #1
> >> Hardware name: Huawei TaiShan 2280 V2/BC82AMDC, BIOS 2280-V2 CS V5.B221.01 12/09/2021
> >> pstate: 604000c9 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> >> pc : set_task_cpu+0x188/0x240
> >> lr : load_balance+0x5d0/0xc60
> >> sp : ffff80000803bc70
> >> x29: ffff80000803bc70 x28: ffff004089e190e8 x27: ffff004089e19040
> >> x26: ffff007effcabc38 x25: 0000000000000000 x24: 0000000000000001
> >> x23: ffff80000803be84 x22: 000000000000000c x21: ffffb093e79e2a78
> >> x20: 000000000000000c x19: ffff004089e19040 x18: 0000000000000000
> >> x17: 0000000000001fad x16: 0000000000000030 x15: 0000000000000000
> >> x14: 0000000000000003 x13: 0000000000000000 x12: 0000000000000000
> >> x11: 0000000000000001 x10: 0000000000000400 x9 : ffffb093e4cee530
> >> x8 : 00000000fffffffe x7 : 0000000000ce168a x6 : 000000000000013e
> >> x5 : 00000000ffffffe1 x4 : 0000000000000001 x3 : 0000000000000b2a
> >> x2 : 0000000000000b2a x1 : ffffb093e6d6c510 x0 : 0000000000000001
> >> Call trace:
> >> set_task_cpu+0x188/0x240
> >> load_balance+0x5d0/0xc60
> >> rebalance_domains+0x26c/0x380
> >> _nohz_idle_balance.isra.0+0x1e0/0x370
> >> run_rebalance_domains+0x6c/0x80
> >> __do_softirq+0x128/0x3d8
> >> ____do_softirq+0x18/0x24
> >> call_on_irq_stack+0x2c/0x38
> >> do_softirq_own_stack+0x24/0x3c
> >> __irq_exit_rcu+0xcc/0xf4
> >> irq_exit_rcu+0x18/0x24
> >> el1_interrupt+0x4c/0xe4
> >> el1h_64_irq_handler+0x18/0x2c
> >> el1h_64_irq+0x74/0x78
> >> arch_cpu_idle+0x18/0x4c
> >> default_idle_call+0x58/0x194
> >> do_idle+0x244/0x2b0
> >> cpu_startup_entry+0x30/0x3c
> >> secondary_start_kernel+0x14c/0x190
> >> __secondary_switched+0xb0/0xb4
> >> ---[ end trace 0000000000000000 ]---
> >>
> >> Signed-off-by: Yicong Yang <[email protected]>
> >> ---
> >> kernel/sched/fair.c | 4 ++++
> >> 1 file changed, 4 insertions(+)
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index 7a1b1f855b96..8fe767362d22 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -8433,6 +8433,10 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
> >> if (kthread_is_per_cpu(p))
> >> return 0;
> >>
> >> + /* Migration disabled tasks need to be kept on their running CPU. */
> >> + if (is_migration_disabled(p))
> >> + return 0;
> >> +
> >> if (!cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)) {
> >> int cpu;
> >>
> >> --
> >> 2.24.0
> >>
> > Looks reasonable to me. Would it be possible to also update the comments at the beginning of
> > can_migrate_task() starts with: "We do not migrate tasks that are:"
> >
>
> Thanks for the suggestion! It seems only uncommented conditions are summarized in that graph,
> otherwise they're mentioned close to there branch like kthread_is_per_cpu(p) case. I can add
> it in v2 if you think it'll be useful.
>
It seems that I overlooked migrate_disable(). It can only set current task rather than arbitrary task.
As Valentin described in his reply, I'm also thinking of what type of race condition can trigger this.
Are you refering to something like this:
cpu1 cpu2
load_balance
rq_lock(cpu2);
detach_task(cpu2, p)
can_migrate_task(p) returns true
migrate_disable(current=p)
set_task_cpu(p, cpu1);
WARN(p can not migrate)
But can_migrate_task(p) should return 0 because p is always the current one as
long as the rq_lock been taken by cpu1.
Thanks,
Chenyu
> > Reviewed-by: Chen Yu <[email protected]>
>
> Thanks,
> Yicong
>
> >
> > thanks,
> > Chenyu
> >
> > .
> >
Hi Valentin,
On 2023/3/15 23:34, Valentin Schneider wrote:
> On 13/03/23 14:57, Yicong Yang wrote:
>> kernel/sched/fair.c | 4 ++++
>> 1 file changed, 4 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 7a1b1f855b96..8fe767362d22 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -8433,6 +8433,10 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
>> if (kthread_is_per_cpu(p))
>> return 0;
>>
>> + /* Migration disabled tasks need to be kept on their running CPU. */
>> + if (is_migration_disabled(p))
>> + return 0;
>> +
>> if (!cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)) {
>> int cpu;
>
> That cpumask check should cover migration_disabled tasks, unless they
> haven't gone through migrate_disable_switch() yet
> (p->migration_disabled == 1, but the cpus_ptr hasn't been touched yet).
>
> Now, if that's the case, the task has to be src_rq's current (since it
> hasn't switched out), which means can_migrate_task() should exit via:
>
> if (task_on_cpu(env->src_rq, p)) {
> schedstat_inc(p->stats.nr_failed_migrations_running);
> return 0;
> }
>
> and thus not try to detach_task(). With that in mind, I don't get how your
> splat can happen, nor how the change change can help (a remote task p could
> execute migrate_disable() concurrently with can_migrate_task(p)).
>
I see, for migrate disabled tasks, if !p->on_cpu the migration can be avoid by
the cpus_ptr check and if p->on_cpu migration can be avoid by the task_on_cpu()
check. So this patch won't help.
> I'm a bit confused here, detach_tasks() happens entirely with src_rq
> rq_lock held, so there shouldn't be any surprises.
>
Since it's a arm64 machine, could the WARN_ON_ONCE() test be false positive?
I mean the update of p->migration_disabled is not observed by the balance
CPU and trigger this warning incorrectly.
> Can you share any extra context? E.g. exact HEAD of your tree, maybe the
> migrate_disable task in question if you have that info.
>
We met this on our internal version based on 6.1-rc4, the scheduler is not
patched. We also saw this in previous version like 6.0. This patch is applied
since 6.2 so we haven't seen this recently, but as you point out this patch doesn't
solve the problem. The questioned tasks are some systemd services(udevd, etc)
and I assume the migration disable is caused by seccomp:
seccomp()
...
bpf_prog_run_pin_on_cpu()
migrate_disable()
...
migrate_enable()
Thanks,
Yicong
Hi Chenyu,
On 2023/3/16 14:43, Chen Yu wrote:
> On 2023-03-15 at 17:55:13 +0800, Yicong Yang wrote:
>> On 2023/3/14 11:08, Chen Yu wrote:
>>> On 2023-03-13 at 14:57:59 +0800, Yicong Yang wrote:
>>>> From: Yicong Yang <[email protected]>
>>>>
>>>> On load balance we didn't check whether the candidate task is migration
>>>> disabled or not, this may hit the WARN_ON in set_task_cpu() since the
>>>> migration disabled tasks are expected to run on their current CPU.
>>>> We've run into this case several times on our server:
>>>>
>>>> ------------[ cut here ]------------
>>>> WARNING: CPU: 7 PID: 0 at kernel/sched/core.c:3115 set_task_cpu+0x188/0x240
>>>> Modules linked in: hclgevf xt_CHECKSUM ipt_REJECT nf_reject_ipv4 <...snip>
>>>> CPU: 7 PID: 0 Comm: swapper/7 Kdump: loaded Tainted: G O 6.1.0-rc4+ #1
>>>> Hardware name: Huawei TaiShan 2280 V2/BC82AMDC, BIOS 2280-V2 CS V5.B221.01 12/09/2021
>>>> pstate: 604000c9 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>>> pc : set_task_cpu+0x188/0x240
>>>> lr : load_balance+0x5d0/0xc60
>>>> sp : ffff80000803bc70
>>>> x29: ffff80000803bc70 x28: ffff004089e190e8 x27: ffff004089e19040
>>>> x26: ffff007effcabc38 x25: 0000000000000000 x24: 0000000000000001
>>>> x23: ffff80000803be84 x22: 000000000000000c x21: ffffb093e79e2a78
>>>> x20: 000000000000000c x19: ffff004089e19040 x18: 0000000000000000
>>>> x17: 0000000000001fad x16: 0000000000000030 x15: 0000000000000000
>>>> x14: 0000000000000003 x13: 0000000000000000 x12: 0000000000000000
>>>> x11: 0000000000000001 x10: 0000000000000400 x9 : ffffb093e4cee530
>>>> x8 : 00000000fffffffe x7 : 0000000000ce168a x6 : 000000000000013e
>>>> x5 : 00000000ffffffe1 x4 : 0000000000000001 x3 : 0000000000000b2a
>>>> x2 : 0000000000000b2a x1 : ffffb093e6d6c510 x0 : 0000000000000001
>>>> Call trace:
>>>> set_task_cpu+0x188/0x240
>>>> load_balance+0x5d0/0xc60
>>>> rebalance_domains+0x26c/0x380
>>>> _nohz_idle_balance.isra.0+0x1e0/0x370
>>>> run_rebalance_domains+0x6c/0x80
>>>> __do_softirq+0x128/0x3d8
>>>> ____do_softirq+0x18/0x24
>>>> call_on_irq_stack+0x2c/0x38
>>>> do_softirq_own_stack+0x24/0x3c
>>>> __irq_exit_rcu+0xcc/0xf4
>>>> irq_exit_rcu+0x18/0x24
>>>> el1_interrupt+0x4c/0xe4
>>>> el1h_64_irq_handler+0x18/0x2c
>>>> el1h_64_irq+0x74/0x78
>>>> arch_cpu_idle+0x18/0x4c
>>>> default_idle_call+0x58/0x194
>>>> do_idle+0x244/0x2b0
>>>> cpu_startup_entry+0x30/0x3c
>>>> secondary_start_kernel+0x14c/0x190
>>>> __secondary_switched+0xb0/0xb4
>>>> ---[ end trace 0000000000000000 ]---
>>>>
>>>> Signed-off-by: Yicong Yang <[email protected]>
>>>> ---
>>>> kernel/sched/fair.c | 4 ++++
>>>> 1 file changed, 4 insertions(+)
>>>>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index 7a1b1f855b96..8fe767362d22 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -8433,6 +8433,10 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
>>>> if (kthread_is_per_cpu(p))
>>>> return 0;
>>>>
>>>> + /* Migration disabled tasks need to be kept on their running CPU. */
>>>> + if (is_migration_disabled(p))
>>>> + return 0;
>>>> +
>>>> if (!cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)) {
>>>> int cpu;
>>>>
>>>> --
>>>> 2.24.0
>>>>
>>> Looks reasonable to me. Would it be possible to also update the comments at the beginning of
>>> can_migrate_task() starts with: "We do not migrate tasks that are:"
>>>
>>
>> Thanks for the suggestion! It seems only uncommented conditions are summarized in that graph,
>> otherwise they're mentioned close to there branch like kthread_is_per_cpu(p) case. I can add
>> it in v2 if you think it'll be useful.
>>
> It seems that I overlooked migrate_disable(). It can only set current task rather than arbitrary task.
> As Valentin described in his reply, I'm also thinking of what type of race condition can trigger this.
> Are you refering to something like this:
> cpu1 cpu2
> load_balance
> rq_lock(cpu2);
> detach_task(cpu2, p)
> can_migrate_task(p) returns true
> migrate_disable(current=p)
> set_task_cpu(p, cpu1);
> WARN(p can not migrate)
> But can_migrate_task(p) should return 0 because p is always the current one as
> long as the rq_lock been taken by cpu1.
>
Yes it's right the current checks should avoid the issue. As I replied to Valentin there maybe
other reasons and needs to further check.
Thanks,
Yicong
On 16/03/23 17:13, Yicong Yang wrote:
> Hi Valentin,
>
> On 2023/3/15 23:34, Valentin Schneider wrote:
>> That cpumask check should cover migration_disabled tasks, unless they
>> haven't gone through migrate_disable_switch() yet
>> (p->migration_disabled == 1, but the cpus_ptr hasn't been touched yet).
>>
>> Now, if that's the case, the task has to be src_rq's current (since it
>> hasn't switched out), which means can_migrate_task() should exit via:
>>
>> if (task_on_cpu(env->src_rq, p)) {
>> schedstat_inc(p->stats.nr_failed_migrations_running);
>> return 0;
>> }
>>
>> and thus not try to detach_task(). With that in mind, I don't get how your
>> splat can happen, nor how the change change can help (a remote task p could
>> execute migrate_disable() concurrently with can_migrate_task(p)).
>>
>
> I see, for migrate disabled tasks, if !p->on_cpu the migration can be avoid by
> the cpus_ptr check and if p->on_cpu migration can be avoid by the task_on_cpu()
> check. So this patch won't help.
>
Right.
>> I'm a bit confused here, detach_tasks() happens entirely with src_rq
>> rq_lock held, so there shouldn't be any surprises.
>>
>
> Since it's a arm64 machine, could the WARN_ON_ONCE() test be false positive?
> I mean the update of p->migration_disabled is not observed by the balance
> CPU and trigger this warning incorrectly.
>
Since the balance CPU holds the src_rq's rq_lock for the entire duration of
detach_tasks(), the actual value of p->migration_disabled shouldn't matter.
Either p has been scheduled out while being migration_disabled, in which
case its ->cpus_ptr has been updated, or p is still running, in which case
can_migrate_task() should return false (p->on_cpu). But from your report,
we're somehow not seeing one of these.
>> Can you share any extra context? E.g. exact HEAD of your tree, maybe the
>> migrate_disable task in question if you have that info.
>>
>
> We met this on our internal version based on 6.1-rc4, the scheduler is not
> patched. We also saw this in previous version like 6.0. This patch is applied
> since 6.2 so we haven't seen this recently, but as you point out this patch doesn't
> solve the problem.
Could you try to reproduce this on an unpatched upstream kernel?
Hi Valentin,
On 2023/3/16 20:12, Valentin Schneider wrote:
> On 16/03/23 17:13, Yicong Yang wrote:
>> Hi Valentin,
>>
>> On 2023/3/15 23:34, Valentin Schneider wrote:
>>> That cpumask check should cover migration_disabled tasks, unless they
>>> haven't gone through migrate_disable_switch() yet
>>> (p->migration_disabled == 1, but the cpus_ptr hasn't been touched yet).
>>>
>>> Now, if that's the case, the task has to be src_rq's current (since it
>>> hasn't switched out), which means can_migrate_task() should exit via:
>>>
>>> if (task_on_cpu(env->src_rq, p)) {
>>> schedstat_inc(p->stats.nr_failed_migrations_running);
>>> return 0;
>>> }
>>>
>>> and thus not try to detach_task(). With that in mind, I don't get how your
>>> splat can happen, nor how the change change can help (a remote task p could
>>> execute migrate_disable() concurrently with can_migrate_task(p)).
>>>
>>
>> I see, for migrate disabled tasks, if !p->on_cpu the migration can be avoid by
>> the cpus_ptr check and if p->on_cpu migration can be avoid by the task_on_cpu()
>> check. So this patch won't help.
>>
>
> Right.
>
>>> I'm a bit confused here, detach_tasks() happens entirely with src_rq
>>> rq_lock held, so there shouldn't be any surprises.
>>>
>>
>> Since it's a arm64 machine, could the WARN_ON_ONCE() test be false positive?
>> I mean the update of p->migration_disabled is not observed by the balance
>> CPU and trigger this warning incorrectly.
>>
>
> Since the balance CPU holds the src_rq's rq_lock for the entire duration of
> detach_tasks(), the actual value of p->migration_disabled shouldn't matter.
>
> Either p has been scheduled out while being migration_disabled, in which
> case its ->cpus_ptr has been updated, or p is still running, in which case
> can_migrate_task() should return false (p->on_cpu). But from your report,
> we're somehow not seeing one of these.
>
>>> Can you share any extra context? E.g. exact HEAD of your tree, maybe the
>>> migrate_disable task in question if you have that info.
>>>
>>
>> We met this on our internal version based on 6.1-rc4, the scheduler is not
>> patched. We also saw this in previous version like 6.0. This patch is applied
>> since 6.2 so we haven't seen this recently, but as you point out this patch doesn't
>> solve the problem.
>
> Could you try to reproduce this on an unpatched upstream kernel?
>
Sorry for the late reply. Yes it can be reproduced on the upstream kernel (tested below on
6.4-rc1). Since it happens occasionally with the normal setup, I wrote a test kthread
with migration enable/disable periodically:
static int workload_func(void *data)
{
cpumask_var_t cpumask;
int i;
if (!zalloc_cpumask_var(&cpumask, GFP_KERNEL))
return -ENOMEM;
for (i = 0; i < 8; i++)
cpumask_set_cpu(i, cpumask);
set_cpus_allowed_ptr(current, cpumask);
free_cpumask_var(cpumask);
while (!kthread_should_stop()) {
migrate_disable();
mdelay(1000);
cond_resched();
migrate_enable();
mdelay(1000);
}
return -1;
}
Launching this and bind another workload to the same CPU it's currently running like
`taskset -c $workload_cpu stress-ng -c 1` will trigger the issue. In fact, the problem
is not because of the migration disable mechanism which works well, but because the
balancing policy after found all the tasks on the source CPU are pinned. With below
debug print added:
@@ -8527,6 +8527,20 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
if (kthread_is_per_cpu(p))
return 0;
+ if (is_migration_disabled(p)) {
+ if (!p->on_cpu && cpumask_test_cpu(env->dst_cpu, p->cpus_ptr))
+ pr_err("dst_cpu %d on_cpu %d cpus_ptr %*pbl cpus_mask %*pbl",
+ env->dst_cpu, p->on_cpu, cpumask_pr_args(p->cpus_ptr),
+ cpumask_pr_args(&p->cpus_mask));
+ }
+
if (!cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)) {
int cpu;
I got below output:
[ 686.135619] dst_cpu 1 on_cpu 0 cpus_ptr 1 cpus_mask 0-7
[ 686.148809] ------------[ cut here ]------------
[ 686.169505] WARNING: CPU: 64 PID: 0 at kernel/sched/core.c:3210 set_task_cpu+0x190/0x250
[ 686.186537] Modules linked in: kthread_workload(O) bluetooth rfkill xt_CHECKSUM iptable_mangle xt_conntrack ipt_REJECT nf_reject_ipv4 ip6table_filter ip6_tables iptable_filter ib_isert iscsi_target_mod ib_ipoib ib_umad rpcrdma ib_iser libiscsi scsi_transport_iscsi crct10dif_ce hns_roce_hw_v2 arm_spe_pmu sbsa_gwdt sm4_generic sm4 xts ecb hisi_hpre hisi_uncore_l3c_pmu hisi_uncore_hha_pmu hisi_uncore_ddrc_pmu hisi_trng_v2 rng_core hisi_uncore_pmu spi_dw_mmio hisi_zip hisi_sec2 hisi_qm uacce hclge hns3 hnae3 hisi_sas_v3_hw hisi_sas_main libsas [last unloaded: kthread_workload(O)]
[ 686.293937] CPU: 64 PID: 0 Comm: swapper/64 Tainted: G O 6.4.0-rc1-migration-race-debug+ #24
[ 686.314616] Hardware name: Huawei TaiShan 2280 V2/BC82AMDC, BIOS 2280-V2 CS V5.B211.01 11/10/2021
[ 686.333285] pstate: 604000c9 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 686.347930] pc : set_task_cpu+0x190/0x250
[...]
It shows that we're going to balance the task to its current CPU (CPU 1) rather than
the balancer CPU (CPU 64). It's because we're going to find a new dst_cpu if the task
on the src_cpu is pinned, and the new_dst_cpu happens to be the task's current CPU.
So the right way to solve this maybe avoid selecting the src_cpu as the new_dst_cpu and
below patch works to solve this issue.
@@ -8550,7 +8564,7 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
/* Prevent to re-select dst_cpu via env's CPUs: */
for_each_cpu_and(cpu, env->dst_grpmask, env->cpus) {
- if (cpumask_test_cpu(cpu, p->cpus_ptr)) {
+ if (cpumask_test_cpu(cpu, p->cpus_ptr) && cpu != env->src_cpu) {
env->flags |= LBF_DST_PINNED;
env->new_dst_cpu = cpu;
break;
Thanks,
Yicong
On 16/05/23 19:10, Yicong Yang wrote:
> Hi Valentin,
> Sorry for the late reply. Yes it can be reproduced on the upstream kernel (tested below on
> 6.4-rc1). Since it happens occasionally with the normal setup, I wrote a test kthread
> with migration enable/disable periodically:
>
> static int workload_func(void *data)
> {
> cpumask_var_t cpumask;
> int i;
>
> if (!zalloc_cpumask_var(&cpumask, GFP_KERNEL))
> return -ENOMEM;
>
> for (i = 0; i < 8; i++)
> cpumask_set_cpu(i, cpumask);
>
> set_cpus_allowed_ptr(current, cpumask);
> free_cpumask_var(cpumask);
>
> while (!kthread_should_stop()) {
> migrate_disable();
> mdelay(1000);
> cond_resched();
> migrate_enable();
> mdelay(1000);
> }
>
> return -1;
> }
>
> Launching this and bind another workload to the same CPU it's currently running like
> `taskset -c $workload_cpu stress-ng -c 1` will trigger the issue. In fact, the problem
> is not because of the migration disable mechanism which works well, but because the
> balancing policy after found all the tasks on the source CPU are pinned. With below
> debug print added:
>
> @@ -8527,6 +8527,20 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
> if (kthread_is_per_cpu(p))
> return 0;
>
> + if (is_migration_disabled(p)) {
> + if (!p->on_cpu && cpumask_test_cpu(env->dst_cpu, p->cpus_ptr))
> + pr_err("dst_cpu %d on_cpu %d cpus_ptr %*pbl cpus_mask %*pbl",
> + env->dst_cpu, p->on_cpu, cpumask_pr_args(p->cpus_ptr),
> + cpumask_pr_args(&p->cpus_mask));
> + }
> +
> if (!cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)) {
> int cpu;
>
> I got below output:
>
> [ 686.135619] dst_cpu 1 on_cpu 0 cpus_ptr 1 cpus_mask 0-7
> [ 686.148809] ------------[ cut here ]------------
> [ 686.169505] WARNING: CPU: 64 PID: 0 at kernel/sched/core.c:3210 set_task_cpu+0x190/0x250
> [ 686.186537] Modules linked in: kthread_workload(O) bluetooth rfkill xt_CHECKSUM iptable_mangle xt_conntrack ipt_REJECT nf_reject_ipv4 ip6table_filter ip6_tables iptable_filter ib_isert iscsi_target_mod ib_ipoib ib_umad rpcrdma ib_iser libiscsi scsi_transport_iscsi crct10dif_ce hns_roce_hw_v2 arm_spe_pmu sbsa_gwdt sm4_generic sm4 xts ecb hisi_hpre hisi_uncore_l3c_pmu hisi_uncore_hha_pmu hisi_uncore_ddrc_pmu hisi_trng_v2 rng_core hisi_uncore_pmu spi_dw_mmio hisi_zip hisi_sec2 hisi_qm uacce hclge hns3 hnae3 hisi_sas_v3_hw hisi_sas_main libsas [last unloaded: kthread_workload(O)]
> [ 686.293937] CPU: 64 PID: 0 Comm: swapper/64 Tainted: G O 6.4.0-rc1-migration-race-debug+ #24
> [ 686.314616] Hardware name: Huawei TaiShan 2280 V2/BC82AMDC, BIOS 2280-V2 CS V5.B211.01 11/10/2021
> [ 686.333285] pstate: 604000c9 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [ 686.347930] pc : set_task_cpu+0x190/0x250
> [...]
>
> It shows that we're going to balance the task to its current CPU (CPU 1) rather than
> the balancer CPU (CPU 64). It's because we're going to find a new dst_cpu if the task
> on the src_cpu is pinned, and the new_dst_cpu happens to be the task's current CPU.
>
Nicely found! Thanks for having spent time on this. I haven't been able to
retrigger the issue using your reproducer, but I think you have indeed
found the root cause of it.
> So the right way to solve this maybe avoid selecting the src_cpu as the new_dst_cpu and
> below patch works to solve this issue.
>
> @@ -8550,7 +8564,7 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
>
> /* Prevent to re-select dst_cpu via env's CPUs: */
> for_each_cpu_and(cpu, env->dst_grpmask, env->cpus) {
> - if (cpumask_test_cpu(cpu, p->cpus_ptr)) {
> + if (cpumask_test_cpu(cpu, p->cpus_ptr) && cpu != env->src_cpu) {
> env->flags |= LBF_DST_PINNED;
> env->new_dst_cpu = cpu;
> break;
>
Other than having some better inplace cpumask helpers, I don't think we can
make this look better. Could you send this change as a proper patch, please?
> Thanks,
> Yicong