LinuxLists.cc - [Question] report a race condition between CPU hotplug state machine and hrtimer 'sched_cfs_period

[permalink] [raw]

Subject: Re: [Question] report a race condition between CPU hotplug state machine and hrtimer 'sched_cfs_period_timer' for cfs bandwidth throttling

On Fri, Jun 09 2023 at 19:24, Xiongfeng Wang wrote:

Cc+ scheduler people, leave context intact

> Hello,
> When I do some low power tests, the following hung task is printed.
>
> Call trace:
> __switch_to+0xd4/0x160
> __schedule+0x38c/0x8c4
> __cond_resched+0x24/0x50
> unmap_kernel_range_noflush+0x210/0x240
> kretprobe_trampoline+0x0/0xc8
> __vunmap+0x70/0x31c
> __vfree+0x34/0x8c
> vfree+0x40/0x58
> free_vm_stack_cache+0x44/0x74
> cpuhp_invoke_callback+0xc4/0x71c
> _cpu_down+0x108/0x284
> kretprobe_trampoline+0x0/0xc8
> suspend_enter+0xd8/0x8ec
> suspend_devices_and_enter+0x1f0/0x360
> pm_suspend.part.1+0x428/0x53c
> pm_suspend+0x3c/0xa0
> devdrv_suspend_proc+0x148/0x248 [drv_devmng]
> devdrv_manager_set_power_state+0x140/0x680 [drv_devmng]
> devdrv_manager_ioctl+0xcc/0x210 [drv_devmng]
> drv_ascend_intf_ioctl+0x84/0x248 [drv_davinci_intf]
> __arm64_sys_ioctl+0xb4/0xf0
> el0_svc_common.constprop.0+0x140/0x374
> do_el0_svc+0x80/0xa0
> el0_svc+0x1c/0x28
> el0_sync_handler+0x90/0xf0
> el0_sync+0x168/0x180
>
> After some analysis, I found it is caused by the following race condition.
>
> 1. A task running on CPU1 is throttled for cfs bandwidth. CPU1 starts the
> hrtimer cfs_bandwidth 'period_timer' and enqueue the hrtimer on CPU1's rbtree.
> 2. Then the task is migrated to CPU2 and starts to offline CPU1. CPU1 starts
> CPUHP AP steps, and then the hrtimer 'period_timer' expires and re-enqueued on CPU1.
> 3. CPU1 runs to take_cpu_down() and disable irq. After CPU1 finished CPUHP AP
> steps, CPU2 starts the rest CPUHP step.
> 4. When CPU2 runs to free_vm_stack_cache(), it is sched out in __vunmap()
> because it run out of CPU quota. start_cfs_bandwidth() does not restart the
> hrtimer because 'cfs_b->period_active' is set.
> 5. The task waits the hrtimer 'period_timer' to expire to wake itself up, but
> CPU1 has disabled irq and the hrtimer won't expire until it is migrated to CPU2
> in hrtimers_dead_cpu(). But the task is blocked and cannot proceed to
> hrtimers_dead_cpu() step. So the task hungs.
>
> CPU1 CPU2
> Task set cfs_quota
> start hrtimer cfs_bandwidth 'period_timer'
> start to offline CPU1
> CPU1 start CPUHP AP step
> ...
> 'period_timer' expired and re-enqueued on CPU1
> ...
> disable irq in take_cpu_down()
> ...
> CPU2 start the rest CPUHP steps
> ...
> sched out in free_vm_stack_cache()
> wait for 'period_timer' expires
>
>
> Appreciate it a lot if anyone can give some suggestion on how fix this problem !
>
> Thanks,
> Xiongfeng

2023-06-12 13:13:10

[permalink] [raw]

Subject: Re: [Question] report a race condition between CPU hotplug state machine and hrtimer 'sched_cfs_period_timer' for cfs bandwidth throttling

On 2023/6/9 22:55, Thomas Gleixner wrote:
> On Fri, Jun 09 2023 at 19:24, Xiongfeng Wang wrote:
>
> Cc+ scheduler people, leave context intact
>
>> Hello,
>> When I do some low power tests, the following hung task is printed.
>>
>> Call trace:
>> __switch_to+0xd4/0x160
>> __schedule+0x38c/0x8c4
>> __cond_resched+0x24/0x50
>> unmap_kernel_range_noflush+0x210/0x240
>> kretprobe_trampoline+0x0/0xc8
>> __vunmap+0x70/0x31c
>> __vfree+0x34/0x8c
>> vfree+0x40/0x58
>> free_vm_stack_cache+0x44/0x74
>> cpuhp_invoke_callback+0xc4/0x71c
>> _cpu_down+0x108/0x284
>> kretprobe_trampoline+0x0/0xc8
>> suspend_enter+0xd8/0x8ec
>> suspend_devices_and_enter+0x1f0/0x360
>> pm_suspend.part.1+0x428/0x53c
>> pm_suspend+0x3c/0xa0
>> devdrv_suspend_proc+0x148/0x248 [drv_devmng]
>> devdrv_manager_set_power_state+0x140/0x680 [drv_devmng]
>> devdrv_manager_ioctl+0xcc/0x210 [drv_devmng]
>> drv_ascend_intf_ioctl+0x84/0x248 [drv_davinci_intf]
>> __arm64_sys_ioctl+0xb4/0xf0
>> el0_svc_common.constprop.0+0x140/0x374
>> do_el0_svc+0x80/0xa0
>> el0_svc+0x1c/0x28
>> el0_sync_handler+0x90/0xf0
>> el0_sync+0x168/0x180
>>
>> After some analysis, I found it is caused by the following race condition.
>>
>> 1. A task running on CPU1 is throttled for cfs bandwidth. CPU1 starts the
>> hrtimer cfs_bandwidth 'period_timer' and enqueue the hrtimer on CPU1's rbtree.
>> 2. Then the task is migrated to CPU2 and starts to offline CPU1. CPU1 starts
>> CPUHP AP steps, and then the hrtimer 'period_timer' expires and re-enqueued on CPU1.
>> 3. CPU1 runs to take_cpu_down() and disable irq. After CPU1 finished CPUHP AP
>> steps, CPU2 starts the rest CPUHP step.
>> 4. When CPU2 runs to free_vm_stack_cache(), it is sched out in __vunmap()
>> because it run out of CPU quota. start_cfs_bandwidth() does not restart the
>> hrtimer because 'cfs_b->period_active' is set.
>> 5. The task waits the hrtimer 'period_timer' to expire to wake itself up, but
>> CPU1 has disabled irq and the hrtimer won't expire until it is migrated to CPU2
>> in hrtimers_dead_cpu(). But the task is blocked and cannot proceed to
>> hrtimers_dead_cpu() step. So the task hungs.
>>
>> CPU1 CPU2
>> Task set cfs_quota
>> start hrtimer cfs_bandwidth 'period_timer'
>> start to offline CPU1
>> CPU1 start CPUHP AP step
>> ...
>> 'period_timer' expired and re-enqueued on CPU1
>> ...
>> disable irq in take_cpu_down()
>> ...
>> CPU2 start the rest CPUHP steps
>> ...
>> sched out in free_vm_stack_cache()
>> wait for 'period_timer' expires
>>
>>
>> Appreciate it a lot if anyone can give some suggestion on how fix this problem !
>>
>> Thanks,
>> Xiongfeng
> .
>

Test script:
taskset -cp 1 $$
mkdir /sys/fs/cgroup/cpu/test
echo $$ > /sys/fs/cgroup/cpu/test/tasks
echo 80000 > /sys/fs/cgroup/cpu/test/cpu.cfs_quota_us
echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_period_us
taskset -cp 2 $$
echo 0 > /sys/devices/system/cpu/cpu1/online

Tests show that the following modification can solve the problem of above test
scripts. But I am not sure if there exists some other issues.

diff --cc kernel/sched/fair.c
index d9d6519fae01,bd6624353608..000000000000
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@@ -5411,10 -5411,16 +5411,15 @@@ void start_cfs_bandwidth(struct cfs_ban
{
lockdep_assert_held(&cfs_b->lock);

- if (cfs_b->period_active)
+ if (cfs_b->period_active) {
+ struct hrtimer_clock_base *clock_base = cfs_b->period_timer.base;
+ int cpu = clock_base->cpu_base->cpu;
+ if (!cpu_active(cpu) && cpu != smp_processor_id())
+ hrtimer_start_expires(&cfs_b->period_timer,
HRTIMER_MODE_ABS_PINNED);
return;
+ }

cfs_b->period_active = 1;
-
hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
}

Thanks,
Xiongfeng

2023-06-26 08:48:37

[permalink] [raw]

Subject: Re: [Question] report a race condition between CPU hotplug state machine and hrtimer 'sched_cfs_period_timer' for cfs bandwidth throttling

Hi,

Kindly ping~
Could you please take a look at this issue and the below temporary fix ?

Thanks,
Xiongfeng

On 2023/6/12 20:49, Xiongfeng Wang wrote:
>
>
> On 2023/6/9 22:55, Thomas Gleixner wrote:
>> On Fri, Jun 09 2023 at 19:24, Xiongfeng Wang wrote:
>>
>> Cc+ scheduler people, leave context intact
>>
>>> Hello,
>>> When I do some low power tests, the following hung task is printed.
>>>
>>> Call trace:
>>> __switch_to+0xd4/0x160
>>> __schedule+0x38c/0x8c4
>>> __cond_resched+0x24/0x50
>>> unmap_kernel_range_noflush+0x210/0x240
>>> kretprobe_trampoline+0x0/0xc8
>>> __vunmap+0x70/0x31c
>>> __vfree+0x34/0x8c
>>> vfree+0x40/0x58
>>> free_vm_stack_cache+0x44/0x74
>>> cpuhp_invoke_callback+0xc4/0x71c
>>> _cpu_down+0x108/0x284
>>> kretprobe_trampoline+0x0/0xc8
>>> suspend_enter+0xd8/0x8ec
>>> suspend_devices_and_enter+0x1f0/0x360
>>> pm_suspend.part.1+0x428/0x53c
>>> pm_suspend+0x3c/0xa0
>>> devdrv_suspend_proc+0x148/0x248 [drv_devmng]
>>> devdrv_manager_set_power_state+0x140/0x680 [drv_devmng]
>>> devdrv_manager_ioctl+0xcc/0x210 [drv_devmng]
>>> drv_ascend_intf_ioctl+0x84/0x248 [drv_davinci_intf]
>>> __arm64_sys_ioctl+0xb4/0xf0
>>> el0_svc_common.constprop.0+0x140/0x374
>>> do_el0_svc+0x80/0xa0
>>> el0_svc+0x1c/0x28
>>> el0_sync_handler+0x90/0xf0
>>> el0_sync+0x168/0x180
>>>
>>> After some analysis, I found it is caused by the following race condition.
>>>
>>> 1. A task running on CPU1 is throttled for cfs bandwidth. CPU1 starts the
>>> hrtimer cfs_bandwidth 'period_timer' and enqueue the hrtimer on CPU1's rbtree.
>>> 2. Then the task is migrated to CPU2 and starts to offline CPU1. CPU1 starts
>>> CPUHP AP steps, and then the hrtimer 'period_timer' expires and re-enqueued on CPU1.
>>> 3. CPU1 runs to take_cpu_down() and disable irq. After CPU1 finished CPUHP AP
>>> steps, CPU2 starts the rest CPUHP step.
>>> 4. When CPU2 runs to free_vm_stack_cache(), it is sched out in __vunmap()
>>> because it run out of CPU quota. start_cfs_bandwidth() does not restart the
>>> hrtimer because 'cfs_b->period_active' is set.
>>> 5. The task waits the hrtimer 'period_timer' to expire to wake itself up, but
>>> CPU1 has disabled irq and the hrtimer won't expire until it is migrated to CPU2
>>> in hrtimers_dead_cpu(). But the task is blocked and cannot proceed to
>>> hrtimers_dead_cpu() step. So the task hungs.
>>>
>>> CPU1 CPU2
>>> Task set cfs_quota
>>> start hrtimer cfs_bandwidth 'period_timer'
>>> start to offline CPU1
>>> CPU1 start CPUHP AP step
>>> ...
>>> 'period_timer' expired and re-enqueued on CPU1
>>> ...
>>> disable irq in take_cpu_down()
>>> ...
>>> CPU2 start the rest CPUHP steps
>>> ...
>>> sched out in free_vm_stack_cache()
>>> wait for 'period_timer' expires
>>>
>>>
>>> Appreciate it a lot if anyone can give some suggestion on how fix this problem !
>>>
>>> Thanks,
>>> Xiongfeng
>> .
>>
>
> Test script:
> taskset -cp 1 $$
> mkdir /sys/fs/cgroup/cpu/test
> echo $$ > /sys/fs/cgroup/cpu/test/tasks
> echo 80000 > /sys/fs/cgroup/cpu/test/cpu.cfs_quota_us
> echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_period_us
> taskset -cp 2 $$
> echo 0 > /sys/devices/system/cpu/cpu1/online
>
>
> Tests show that the following modification can solve the problem of above test
> scripts. But I am not sure if there exists some other issues.
>
> diff --cc kernel/sched/fair.c
> index d9d6519fae01,bd6624353608..000000000000
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@@ -5411,10 -5411,16 +5411,15 @@@ void start_cfs_bandwidth(struct cfs_ban
> {
> lockdep_assert_held(&cfs_b->lock);
>
> - if (cfs_b->period_active)
> + if (cfs_b->period_active) {
> + struct hrtimer_clock_base *clock_base = cfs_b->period_timer.base;
> + int cpu = clock_base->cpu_base->cpu;
> + if (!cpu_active(cpu) && cpu != smp_processor_id())
> + hrtimer_start_expires(&cfs_b->period_timer,
> HRTIMER_MODE_ABS_PINNED);
> return;
> + }
>
> cfs_b->period_active = 1;
> -
> hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
> hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
> }
>
> Thanks,
> Xiongfeng
>
> .
>

2023-06-27 17:38:19

[permalink] [raw]

Subject: Re: [Question] report a race condition between CPU hotplug state machine and hrtimer 'sched_cfs_period_timer' for cfs bandwidth throttling

On Mon, 26 Jun 2023 at 10:23, Xiongfeng Wang <[email protected]> wrote:
>
> Hi,
>
> Kindly ping~
> Could you please take a look at this issue and the below temporary fix ?
>
> Thanks,
> Xiongfeng
>
> On 2023/6/12 20:49, Xiongfeng Wang wrote:
> >
> >
> > On 2023/6/9 22:55, Thomas Gleixner wrote:
> >> On Fri, Jun 09 2023 at 19:24, Xiongfeng Wang wrote:
> >>
> >> Cc+ scheduler people, leave context intact
> >>
> >>> Hello,
> >>> When I do some low power tests, the following hung task is printed.
> >>>
> >>> Call trace:
> >>> __switch_to+0xd4/0x160
> >>> __schedule+0x38c/0x8c4
> >>> __cond_resched+0x24/0x50
> >>> unmap_kernel_range_noflush+0x210/0x240
> >>> kretprobe_trampoline+0x0/0xc8
> >>> __vunmap+0x70/0x31c
> >>> __vfree+0x34/0x8c
> >>> vfree+0x40/0x58
> >>> free_vm_stack_cache+0x44/0x74
> >>> cpuhp_invoke_callback+0xc4/0x71c
> >>> _cpu_down+0x108/0x284
> >>> kretprobe_trampoline+0x0/0xc8
> >>> suspend_enter+0xd8/0x8ec
> >>> suspend_devices_and_enter+0x1f0/0x360
> >>> pm_suspend.part.1+0x428/0x53c
> >>> pm_suspend+0x3c/0xa0
> >>> devdrv_suspend_proc+0x148/0x248 [drv_devmng]
> >>> devdrv_manager_set_power_state+0x140/0x680 [drv_devmng]
> >>> devdrv_manager_ioctl+0xcc/0x210 [drv_devmng]
> >>> drv_ascend_intf_ioctl+0x84/0x248 [drv_davinci_intf]
> >>> __arm64_sys_ioctl+0xb4/0xf0
> >>> el0_svc_common.constprop.0+0x140/0x374
> >>> do_el0_svc+0x80/0xa0
> >>> el0_svc+0x1c/0x28
> >>> el0_sync_handler+0x90/0xf0
> >>> el0_sync+0x168/0x180
> >>>
> >>> After some analysis, I found it is caused by the following race condition.
> >>>
> >>> 1. A task running on CPU1 is throttled for cfs bandwidth. CPU1 starts the
> >>> hrtimer cfs_bandwidth 'period_timer' and enqueue the hrtimer on CPU1's rbtree.
> >>> 2. Then the task is migrated to CPU2 and starts to offline CPU1. CPU1 starts
> >>> CPUHP AP steps, and then the hrtimer 'period_timer' expires and re-enqueued on CPU1.
> >>> 3. CPU1 runs to take_cpu_down() and disable irq. After CPU1 finished CPUHP AP
> >>> steps, CPU2 starts the rest CPUHP step.
> >>> 4. When CPU2 runs to free_vm_stack_cache(), it is sched out in __vunmap()
> >>> because it run out of CPU quota. start_cfs_bandwidth() does not restart the
> >>> hrtimer because 'cfs_b->period_active' is set.
> >>> 5. The task waits the hrtimer 'period_timer' to expire to wake itself up, but
> >>> CPU1 has disabled irq and the hrtimer won't expire until it is migrated to CPU2
> >>> in hrtimers_dead_cpu(). But the task is blocked and cannot proceed to
> >>> hrtimers_dead_cpu() step. So the task hungs.
> >>>
> >>> CPU1 CPU2
> >>> Task set cfs_quota
> >>> start hrtimer cfs_bandwidth 'period_timer'
> >>> start to offline CPU1
> >>> CPU1 start CPUHP AP step
> >>> ...
> >>> 'period_timer' expired and re-enqueued on CPU1
> >>> ...
> >>> disable irq in take_cpu_down()
> >>> ...
> >>> CPU2 start the rest CPUHP steps
> >>> ...
> >>> sched out in free_vm_stack_cache()
> >>> wait for 'period_timer' expires
> >>>
> >>>
> >>> Appreciate it a lot if anyone can give some suggestion on how fix this problem !
> >>>
> >>> Thanks,
> >>> Xiongfeng
> >> .
> >>
> >
> > Test script:
> > taskset -cp 1 $$
> > mkdir /sys/fs/cgroup/cpu/test
> > echo $$ > /sys/fs/cgroup/cpu/test/tasks
> > echo 80000 > /sys/fs/cgroup/cpu/test/cpu.cfs_quota_us
> > echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_period_us
> > taskset -cp 2 $$
> > echo 0 > /sys/devices/system/cpu/cpu1/online
> >
> >
> > Tests show that the following modification can solve the problem of above test
> > scripts. But I am not sure if there exists some other issues.
> >
> > diff --cc kernel/sched/fair.c
> > index d9d6519fae01,bd6624353608..000000000000
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@@ -5411,10 -5411,16 +5411,15 @@@ void start_cfs_bandwidth(struct cfs_ban
> > {
> > lockdep_assert_held(&cfs_b->lock);
> >
> > - if (cfs_b->period_active)
> > + if (cfs_b->period_active) {
> > + struct hrtimer_clock_base *clock_base = cfs_b->period_timer.base;
> > + int cpu = clock_base->cpu_base->cpu;
> > + if (!cpu_active(cpu) && cpu != smp_processor_id())
> > + hrtimer_start_expires(&cfs_b->period_timer,
> > HRTIMER_MODE_ABS_PINNED);
> > return;
> > + }

I have been able to reproduce your problem and run your fix on top. I
still wonder if there is a
Could we have a helper from hrtimer to get the cpu of the clock_base ?

> >
> > cfs_b->period_active = 1;
> > -
> > hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
> > hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
> > }
> >
> > Thanks,
> > Xiongfeng
> >
> > .
> >

2023-06-28 12:25:10

[permalink] [raw]

Subject: Re: [Question] report a race condition between CPU hotplug state machine and hrtimer 'sched_cfs_period_timer' for cfs bandwidth throttling

On Tue, Jun 27 2023 at 18:46, Vincent Guittot wrote:
> On Mon, 26 Jun 2023 at 10:23, Xiongfeng Wang <[email protected]> wrote:
>> > diff --cc kernel/sched/fair.c
>> > index d9d6519fae01,bd6624353608..000000000000
>> > --- a/kernel/sched/fair.c
>> > +++ b/kernel/sched/fair.c
>> > @@@ -5411,10 -5411,16 +5411,15 @@@ void start_cfs_bandwidth(struct cfs_ban
>> > {
>> > lockdep_assert_held(&cfs_b->lock);
>> >
>> > - if (cfs_b->period_active)
>> > + if (cfs_b->period_active) {
>> > + struct hrtimer_clock_base *clock_base = cfs_b->period_timer.base;
>> > + int cpu = clock_base->cpu_base->cpu;
>> > + if (!cpu_active(cpu) && cpu != smp_processor_id())
>> > + hrtimer_start_expires(&cfs_b->period_timer,
>> > HRTIMER_MODE_ABS_PINNED);
>> > return;
>> > + }
>
> I have been able to reproduce your problem and run your fix on top. I
> still wonder if there is a
> Could we have a helper from hrtimer to get the cpu of the clock_base ?

No, because this is fundamentally wrong.

If the CPU is on the way out, then the scheduler hotplug machinery
has to handle the period timer so that the problem Xiongfeng analyzed
does not happen in the first place.

sched_cpu_wait_empty() would be the obvious place to cleanup armed CFS
timers, but let me look into whether we can migrate hrtimers early in
general.

Aside of that the above is wrong by itself.

if (cfs_b->period_active)
hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);

This only ends up on the outgoing CPU if either:

1) The code runs on the outgoing CPU

or

2) The hrtimer is concurrently executing the hrtimer callback on the
outgoing CPU.

So this:

if (cfs_b->period_active) {
struct hrtimer_clock_base *clock_base = cfs_b->period_timer.base;
int cpu = clock_base->cpu_base->cpu;

if (!cpu_active(cpu) && cpu != smp_processor_id())
hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
return;
}

only works, if

1) The code runs _not_ on the outgoing CPU

and

2) The hrtimer is _not_ concurrently executing the hrtimer callback on
the outgoing CPU.

If the callback is executing (it spins on cfs_b->lock), then the
timer is requeued on the outgoing CPU. Not what you want, right?

Plus accessing hrtimer->clock_base->cpu_base->cpu lockless is fragile at
best.

Thanks,

tglx

2023-06-28 12:46:23

[permalink] [raw]

Subject: Re: [Question] report a race condition between CPU hotplug state machine and hrtimer 'sched_cfs_period_timer' for cfs bandwidth throttling

On Wed, 28 Jun 2023 at 14:03, Thomas Gleixner <[email protected]> wrote:
>
> On Tue, Jun 27 2023 at 18:46, Vincent Guittot wrote:
> > On Mon, 26 Jun 2023 at 10:23, Xiongfeng Wang <[email protected]> wrote:
> >> > diff --cc kernel/sched/fair.c
> >> > index d9d6519fae01,bd6624353608..000000000000
> >> > --- a/kernel/sched/fair.c
> >> > +++ b/kernel/sched/fair.c
> >> > @@@ -5411,10 -5411,16 +5411,15 @@@ void start_cfs_bandwidth(struct cfs_ban
> >> > {
> >> > lockdep_assert_held(&cfs_b->lock);
> >> >
> >> > - if (cfs_b->period_active)
> >> > + if (cfs_b->period_active) {
> >> > + struct hrtimer_clock_base *clock_base = cfs_b->period_timer.base;
> >> > + int cpu = clock_base->cpu_base->cpu;
> >> > + if (!cpu_active(cpu) && cpu != smp_processor_id())
> >> > + hrtimer_start_expires(&cfs_b->period_timer,
> >> > HRTIMER_MODE_ABS_PINNED);
> >> > return;
> >> > + }
> >
> > I have been able to reproduce your problem and run your fix on top. I
> > still wonder if there is a
> > Could we have a helper from hrtimer to get the cpu of the clock_base ?
>
> No, because this is fundamentally wrong.
>
> If the CPU is on the way out, then the scheduler hotplug machinery
> has to handle the period timer so that the problem Xiongfeng analyzed
> does not happen in the first place.

But the hrtimer was enqueued before it starts to offline the cpu
Then, hrtimers_dead_cpu should take care of migrating the hrtimer out
of the outgoing cpu but :
- it must run on another target cpu to migrate the hrtimer.
- it runs in the context of the caller which can be throttled.

>
> sched_cpu_wait_empty() would be the obvious place to cleanup armed CFS
> timers, but let me look into whether we can migrate hrtimers early in
> general.

but for that we must check if the timer is enqueued on the outgoing
cpu and we then need to choose a target cpu

>
> Aside of that the above is wrong by itself.
>
> if (cfs_b->period_active)
> hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
>
> This only ends up on the outgoing CPU if either:
>
> 1) The code runs on the outgoing CPU
>
> or
>
> 2) The hrtimer is concurrently executing the hrtimer callback on the
> outgoing CPU.
>
> So this:
>
> if (cfs_b->period_active) {
> struct hrtimer_clock_base *clock_base = cfs_b->period_timer.base;
> int cpu = clock_base->cpu_base->cpu;
>
> if (!cpu_active(cpu) && cpu != smp_processor_id())
> hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
> return;
> }
>
> only works, if
>
> 1) The code runs _not_ on the outgoing CPU
>
> and
>
> 2) The hrtimer is _not_ concurrently executing the hrtimer callback on
> the outgoing CPU.
>
> If the callback is executing (it spins on cfs_b->lock), then the
> timer is requeued on the outgoing CPU. Not what you want, right?
>
> Plus accessing hrtimer->clock_base->cpu_base->cpu lockless is fragile at
> best.
>
> Thanks,
>
> tglx

2023-06-28 13:43:51

[permalink] [raw]

Subject: Re: [Question] report a race condition between CPU hotplug state machine and hrtimer 'sched_cfs_period_timer' for cfs bandwidth throttling

On Tue, 27 Jun 2023 at 18:46, Vincent Guittot
<[email protected]> wrote:
>
> On Mon, 26 Jun 2023 at 10:23, Xiongfeng Wang <[email protected]> wrote:
> >
> > Hi,
> >
> > Kindly ping~
> > Could you please take a look at this issue and the below temporary fix ?
> >
> > Thanks,
> > Xiongfeng
> >
> > On 2023/6/12 20:49, Xiongfeng Wang wrote:
> > >
> > >
> > > On 2023/6/9 22:55, Thomas Gleixner wrote:
> > >> On Fri, Jun 09 2023 at 19:24, Xiongfeng Wang wrote:
> > >>
> > >> Cc+ scheduler people, leave context intact
> > >>
> > >>> Hello,
> > >>> When I do some low power tests, the following hung task is printed.
> > >>>
> > >>> Call trace:
> > >>> __switch_to+0xd4/0x160
> > >>> __schedule+0x38c/0x8c4
> > >>> __cond_resched+0x24/0x50
> > >>> unmap_kernel_range_noflush+0x210/0x240
> > >>> kretprobe_trampoline+0x0/0xc8
> > >>> __vunmap+0x70/0x31c
> > >>> __vfree+0x34/0x8c
> > >>> vfree+0x40/0x58
> > >>> free_vm_stack_cache+0x44/0x74
> > >>> cpuhp_invoke_callback+0xc4/0x71c
> > >>> _cpu_down+0x108/0x284
> > >>> kretprobe_trampoline+0x0/0xc8
> > >>> suspend_enter+0xd8/0x8ec
> > >>> suspend_devices_and_enter+0x1f0/0x360
> > >>> pm_suspend.part.1+0x428/0x53c
> > >>> pm_suspend+0x3c/0xa0
> > >>> devdrv_suspend_proc+0x148/0x248 [drv_devmng]
> > >>> devdrv_manager_set_power_state+0x140/0x680 [drv_devmng]
> > >>> devdrv_manager_ioctl+0xcc/0x210 [drv_devmng]
> > >>> drv_ascend_intf_ioctl+0x84/0x248 [drv_davinci_intf]
> > >>> __arm64_sys_ioctl+0xb4/0xf0
> > >>> el0_svc_common.constprop.0+0x140/0x374
> > >>> do_el0_svc+0x80/0xa0
> > >>> el0_svc+0x1c/0x28
> > >>> el0_sync_handler+0x90/0xf0
> > >>> el0_sync+0x168/0x180
> > >>>
> > >>> After some analysis, I found it is caused by the following race condition.
> > >>>
> > >>> 1. A task running on CPU1 is throttled for cfs bandwidth. CPU1 starts the
> > >>> hrtimer cfs_bandwidth 'period_timer' and enqueue the hrtimer on CPU1's rbtree.
> > >>> 2. Then the task is migrated to CPU2 and starts to offline CPU1. CPU1 starts
> > >>> CPUHP AP steps, and then the hrtimer 'period_timer' expires and re-enqueued on CPU1.
> > >>> 3. CPU1 runs to take_cpu_down() and disable irq. After CPU1 finished CPUHP AP
> > >>> steps, CPU2 starts the rest CPUHP step.
> > >>> 4. When CPU2 runs to free_vm_stack_cache(), it is sched out in __vunmap()
> > >>> because it run out of CPU quota. start_cfs_bandwidth() does not restart the
> > >>> hrtimer because 'cfs_b->period_active' is set.
> > >>> 5. The task waits the hrtimer 'period_timer' to expire to wake itself up, but
> > >>> CPU1 has disabled irq and the hrtimer won't expire until it is migrated to CPU2
> > >>> in hrtimers_dead_cpu(). But the task is blocked and cannot proceed to
> > >>> hrtimers_dead_cpu() step. So the task hungs.
> > >>>
> > >>> CPU1 CPU2
> > >>> Task set cfs_quota
> > >>> start hrtimer cfs_bandwidth 'period_timer'
> > >>> start to offline CPU1
> > >>> CPU1 start CPUHP AP step
> > >>> ...
> > >>> 'period_timer' expired and re-enqueued on CPU1
> > >>> ...
> > >>> disable irq in take_cpu_down()
> > >>> ...
> > >>> CPU2 start the rest CPUHP steps
> > >>> ...
> > >>> sched out in free_vm_stack_cache()
> > >>> wait for 'period_timer' expires
> > >>>
> > >>>
> > >>> Appreciate it a lot if anyone can give some suggestion on how fix this problem !
> > >>>
> > >>> Thanks,
> > >>> Xiongfeng
> > >> .
> > >>
> > >
> > > Test script:
> > > taskset -cp 1 $$
> > > mkdir /sys/fs/cgroup/cpu/test
> > > echo $$ > /sys/fs/cgroup/cpu/test/tasks
> > > echo 80000 > /sys/fs/cgroup/cpu/test/cpu.cfs_quota_us
> > > echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_period_us
> > > taskset -cp 2 $$
> > > echo 0 > /sys/devices/system/cpu/cpu1/online
> > >
> > >
> > > Tests show that the following modification can solve the problem of above test
> > > scripts. But I am not sure if there exists some other issues.
> > >
> > > diff --cc kernel/sched/fair.c
> > > index d9d6519fae01,bd6624353608..000000000000
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@@ -5411,10 -5411,16 +5411,15 @@@ void start_cfs_bandwidth(struct cfs_ban
> > > {
> > > lockdep_assert_held(&cfs_b->lock);
> > >
> > > - if (cfs_b->period_active)
> > > + if (cfs_b->period_active) {
> > > + struct hrtimer_clock_base *clock_base = cfs_b->period_timer.base;
> > > + int cpu = clock_base->cpu_base->cpu;
> > > + if (!cpu_active(cpu) && cpu != smp_processor_id())
> > > + hrtimer_start_expires(&cfs_b->period_timer,
> > > HRTIMER_MODE_ABS_PINNED);
> > > return;
> > > + }
>
> I have been able to reproduce your problem and run your fix on top. I
> still wonder if there is a

Looks like I have been preempted and never finished the sentence. The
full sentence is:
I still wonder if there is a race condition where the hang can still
happen but i haven't been able to find one so far

> Could we have a helper from hrtimer to get the cpu of the clock_base ?
>
>
> > >
> > > cfs_b->period_active = 1;
> > > -
> > > hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
> > > hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
> > > }
> > >
> > > Thanks,
> > > Xiongfeng
> > >
> > > .
> > >

2023-06-28 21:16:02

[permalink] [raw]

Subject: Re: [Question] report a race condition between CPU hotplug state machine and hrtimer 'sched_cfs_period_timer' for cfs bandwidth throttling

On Wed, Jun 28 2023 at 15:30, Vincent Guittot wrote:
> On Tue, 27 Jun 2023 at 18:46, Vincent Guittot
>> > > + struct hrtimer_clock_base *clock_base = cfs_b->period_timer.base;
>> > > + int cpu = clock_base->cpu_base->cpu;
>> > > + if (!cpu_active(cpu) && cpu != smp_processor_id())
>> > > + hrtimer_start_expires(&cfs_b->period_timer,
>> > > HRTIMER_MODE_ABS_PINNED);
>> > > return;
>> > > + }

Can you please trim your replies?

>> I have been able to reproduce your problem and run your fix on top. I
>> still wonder if there is a
>
> Looks like I have been preempted and never finished the sentence. The
> full sentence is:
> I still wonder if there is a race condition where the hang can still
> happen but i haven't been able to find one so far

As I explained before. Assume the timer fires on the outgoing CPU and
the other CPU tries to rearm it concurrently. It will stay on the
outgoing CPU and not move over.

Thanks,

tglx

2023-06-28 22:44:50

[permalink] [raw]

Subject: Re: [Question] report a race condition between CPU hotplug state machine and hrtimer 'sched_cfs_period_timer' for cfs bandwidth throttling

On Wed, Jun 28 2023 at 14:35, Vincent Guittot wrote:
> On Wed, 28 Jun 2023 at 14:03, Thomas Gleixner <[email protected]> wrote:
>> No, because this is fundamentally wrong.
>>
>> If the CPU is on the way out, then the scheduler hotplug machinery
>> has to handle the period timer so that the problem Xiongfeng analyzed
>> does not happen in the first place.
>
> But the hrtimer was enqueued before it starts to offline the cpu

It does not really matter when it was enqueued. The important point is
that it was enqueued on that outgoing CPU for whatever reason.

> Then, hrtimers_dead_cpu should take care of migrating the hrtimer out
> of the outgoing cpu but :
> - it must run on another target cpu to migrate the hrtimer.
> - it runs in the context of the caller which can be throttled.

Sure. I completely understand the problem. The hrtimer hotplug callback
does not run because the task is stuck and waits for the timer to
expire. Circular dependency.

>> sched_cpu_wait_empty() would be the obvious place to cleanup armed CFS
>> timers, but let me look into whether we can migrate hrtimers early in
>> general.
>
> but for that we must check if the timer is enqueued on the outgoing
> cpu and we then need to choose a target cpu.

You're right. I somehow assumed that cfs knows where it queued stuff,
but obviously it does not.

I think we can avoid all that by simply taking that user space task out
of the picture completely, which avoids debating whether there are other
possible weird conditions to consider alltogether.

Something like the untested below should just work.

Thanks,

tglx
---
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -1490,6 +1490,13 @@ static int cpu_down(unsigned int cpu, en
return err;
}

+static long __cpu_device_down(void *arg)
+{
+ struct device *dev = arg;
+
+ return cpu_down(dev->id, CPUHP_OFFLINE);
+}
+
/**
* cpu_device_down - Bring down a cpu device
* @dev: Pointer to the cpu device to offline
@@ -1502,7 +1509,12 @@ static int cpu_down(unsigned int cpu, en
*/
int cpu_device_down(struct device *dev)
{
- return cpu_down(dev->id, CPUHP_OFFLINE);
+ unsigned int cpu = cpumask_any_but(cpu_online_mask, dev->id);
+
+ if (cpu >= nr_cpu_ids)
+ return -EBUSY;
+
+ return work_on_cpu(cpu, __cpu_device_down, dev);
}

int remove_cpu(unsigned int cpu)

2023-06-29 01:46:24

[permalink] [raw]

Subject: Re: [Question] report a race condition between CPU hotplug state machine and hrtimer 'sched_cfs_period_timer' for cfs bandwidth throttling

On 2023/6/28 0:46, Vincent Guittot wrote:
> On Mon, 26 Jun 2023 at 10:23, Xiongfeng Wang <[email protected]> wrote:
>>
>> Hi,
>>
>> Kindly ping~
>> Could you please take a look at this issue and the below temporary fix ?
>>
>> Thanks,
>> Xiongfeng
>>
>> On 2023/6/12 20:49, Xiongfeng Wang wrote:
>>>
>>>
>>> On 2023/6/9 22:55, Thomas Gleixner wrote:
>>>> On Fri, Jun 09 2023 at 19:24, Xiongfeng Wang wrote:
>>>>
>>>> Cc+ scheduler people, leave context intact
>>>>
>>>>> Hello,
>>>>> When I do some low power tests, the following hung task is printed.
>>>>>
>>>>> Call trace:
>>>>> __switch_to+0xd4/0x160
>>>>> __schedule+0x38c/0x8c4
>>>>> __cond_resched+0x24/0x50
>>>>> unmap_kernel_range_noflush+0x210/0x240
>>>>> kretprobe_trampoline+0x0/0xc8
>>>>> __vunmap+0x70/0x31c
>>>>> __vfree+0x34/0x8c
>>>>> vfree+0x40/0x58
>>>>> free_vm_stack_cache+0x44/0x74
>>>>> cpuhp_invoke_callback+0xc4/0x71c
>>>>> _cpu_down+0x108/0x284
>>>>> kretprobe_trampoline+0x0/0xc8
>>>>> suspend_enter+0xd8/0x8ec
>>>>> suspend_devices_and_enter+0x1f0/0x360
>>>>> pm_suspend.part.1+0x428/0x53c
>>>>> pm_suspend+0x3c/0xa0
>>>>> devdrv_suspend_proc+0x148/0x248 [drv_devmng]
>>>>> devdrv_manager_set_power_state+0x140/0x680 [drv_devmng]
>>>>> devdrv_manager_ioctl+0xcc/0x210 [drv_devmng]
>>>>> drv_ascend_intf_ioctl+0x84/0x248 [drv_davinci_intf]
>>>>> __arm64_sys_ioctl+0xb4/0xf0
>>>>> el0_svc_common.constprop.0+0x140/0x374
>>>>> do_el0_svc+0x80/0xa0
>>>>> el0_svc+0x1c/0x28
>>>>> el0_sync_handler+0x90/0xf0
>>>>> el0_sync+0x168/0x180
>>>>>
>>>>> After some analysis, I found it is caused by the following race condition.
>>>>>
>>>>> 1. A task running on CPU1 is throttled for cfs bandwidth. CPU1 starts the
>>>>> hrtimer cfs_bandwidth 'period_timer' and enqueue the hrtimer on CPU1's rbtree.
>>>>> 2. Then the task is migrated to CPU2 and starts to offline CPU1. CPU1 starts
>>>>> CPUHP AP steps, and then the hrtimer 'period_timer' expires and re-enqueued on CPU1.
>>>>> 3. CPU1 runs to take_cpu_down() and disable irq. After CPU1 finished CPUHP AP
>>>>> steps, CPU2 starts the rest CPUHP step.
>>>>> 4. When CPU2 runs to free_vm_stack_cache(), it is sched out in __vunmap()
>>>>> because it run out of CPU quota. start_cfs_bandwidth() does not restart the
>>>>> hrtimer because 'cfs_b->period_active' is set.
>>>>> 5. The task waits the hrtimer 'period_timer' to expire to wake itself up, but
>>>>> CPU1 has disabled irq and the hrtimer won't expire until it is migrated to CPU2
>>>>> in hrtimers_dead_cpu(). But the task is blocked and cannot proceed to
>>>>> hrtimers_dead_cpu() step. So the task hungs.
>>>>>
>>>>> CPU1 CPU2
>>>>> Task set cfs_quota
>>>>> start hrtimer cfs_bandwidth 'period_timer'
>>>>> start to offline CPU1
>>>>> CPU1 start CPUHP AP step
>>>>> ...
>>>>> 'period_timer' expired and re-enqueued on CPU1
>>>>> ...
>>>>> disable irq in take_cpu_down()
>>>>> ...
>>>>> CPU2 start the rest CPUHP steps
>>>>> ...
>>>>> sched out in free_vm_stack_cache()
>>>>> wait for 'period_timer' expires
>>>>>
>>>>>
>>>>> Appreciate it a lot if anyone can give some suggestion on how fix this problem !
>>>>>
>>>>> Thanks,
>>>>> Xiongfeng
>>>> .
>>>>
>>>
>>> Test script:
>>> taskset -cp 1 $$
>>> mkdir /sys/fs/cgroup/cpu/test
>>> echo $$ > /sys/fs/cgroup/cpu/test/tasks
>>> echo 80000 > /sys/fs/cgroup/cpu/test/cpu.cfs_quota_us
>>> echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_period_us
>>> taskset -cp 2 $$
>>> echo 0 > /sys/devices/system/cpu/cpu1/online
>>>
>>>
>>> Tests show that the following modification can solve the problem of above test
>>> scripts. But I am not sure if there exists some other issues.
>>>
>>> diff --cc kernel/sched/fair.c
>>> index d9d6519fae01,bd6624353608..000000000000
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@@ -5411,10 -5411,16 +5411,15 @@@ void start_cfs_bandwidth(struct cfs_ban
>>> {
>>> lockdep_assert_held(&cfs_b->lock);
>>>
>>> - if (cfs_b->period_active)
>>> + if (cfs_b->period_active) {
>>> + struct hrtimer_clock_base *clock_base = cfs_b->period_timer.base;
>>> + int cpu = clock_base->cpu_base->cpu;
>>> + if (!cpu_active(cpu) && cpu != smp_processor_id())
>>> + hrtimer_start_expires(&cfs_b->period_timer,
>>> HRTIMER_MODE_ABS_PINNED);
>>> return;
>>> + }
>
> I have been able to reproduce your problem and run your fix on top. I
> still wonder if there is a

Sorry, I forgot to provide the kernel modification to help reproduce the issue.
At first, the issue can only be reproduced on the product environment with
product stress testcase. After firguring out the reason, I add the following
modification. It make sure the process ran out cfs quota and can be sched out in
free_vm_stack_cache. Although the real schedule point is in __vunmap(), this can
also show the issue exists.

diff --git a/kernel/fork.c b/kernel/fork.c
index 0fb86b65ae60..3b2d83fb407a 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -110,6 +110,8 @@
#define CREATE_TRACE_POINTS
#include <trace/events/task.h>

+#include <linux/delay.h>
+
/*
* Minimum number of threads to boot the kernel
*/
@@ -199,6 +201,9 @@ static int free_vm_stack_cache(unsigned int cpu)
struct vm_struct **cached_vm_stacks = per_cpu_ptr(cached_stacks, cpu);
int i;

+ mdelay(2000);
+ cond_resched();
+
for (i = 0; i < NR_CACHED_STACKS; i++) {
struct vm_struct *vm_stack = cached_vm_stacks[i];

Thanks,
Xiongfeng

> Could we have a helper from hrtimer to get the cpu of the clock_base ?
>
>
>>>
>>> cfs_b->period_active = 1;
>>> -
>>> hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
>>> hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
>>> }
>>>
>>> Thanks,
>>> Xiongfeng
>>>
>>> .
>>>
> .
>

2023-06-29 02:01:49

[permalink] [raw]

Subject: Re: [Question] report a race condition between CPU hotplug state machine and hrtimer 'sched_cfs_period_timer' for cfs bandwidth throttling

On 2023/6/29 6:01, Thomas Gleixner wrote:
> On Wed, Jun 28 2023 at 14:35, Vincent Guittot wrote:
>> On Wed, 28 Jun 2023 at 14:03, Thomas Gleixner <[email protected]> wrote:
>>> No, because this is fundamentally wrong.
>>>
>>> If the CPU is on the way out, then the scheduler hotplug machinery
>>> has to handle the period timer so that the problem Xiongfeng analyzed
>>> does not happen in the first place.
>>
>> But the hrtimer was enqueued before it starts to offline the cpu
>
> It does not really matter when it was enqueued. The important point is
> that it was enqueued on that outgoing CPU for whatever reason.
>
>> Then, hrtimers_dead_cpu should take care of migrating the hrtimer out
>> of the outgoing cpu but :
>> - it must run on another target cpu to migrate the hrtimer.
>> - it runs in the context of the caller which can be throttled.
>
> Sure. I completely understand the problem. The hrtimer hotplug callback
> does not run because the task is stuck and waits for the timer to
> expire. Circular dependency.
>
>>> sched_cpu_wait_empty() would be the obvious place to cleanup armed CFS
>>> timers, but let me look into whether we can migrate hrtimers early in
>>> general.
>>
>> but for that we must check if the timer is enqueued on the outgoing
>> cpu and we then need to choose a target cpu.
>
> You're right. I somehow assumed that cfs knows where it queued stuff,
> but obviously it does not.
>
> I think we can avoid all that by simply taking that user space task out
> of the picture completely, which avoids debating whether there are other
> possible weird conditions to consider alltogether.
>
> Something like the untested below should just work.
>
> Thanks,
>
> tglx
> ---
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -1490,6 +1490,13 @@ static int cpu_down(unsigned int cpu, en
> return err;
> }
>
> +static long __cpu_device_down(void *arg)
> +{
> + struct device *dev = arg;
> +
> + return cpu_down(dev->id, CPUHP_OFFLINE);
> +}
> +
> /**
> * cpu_device_down - Bring down a cpu device
> * @dev: Pointer to the cpu device to offline
> @@ -1502,7 +1509,12 @@ static int cpu_down(unsigned int cpu, en
> */
> int cpu_device_down(struct device *dev)
> {
> - return cpu_down(dev->id, CPUHP_OFFLINE);
> + unsigned int cpu = cpumask_any_but(cpu_online_mask, dev->id);
> +
> + if (cpu >= nr_cpu_ids)
> + return -EBUSY;
> +
> + return work_on_cpu(cpu, __cpu_device_down, dev);
> }
>
> int remove_cpu(unsigned int cpu)
> .
>

Test with the following kernel modification which helps reproduce the issue. The
hang task does not happen any more. Thanks a lot.

Thanks,
Xiongfeng

--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -110,6 +110,8 @@
#define CREATE_TRACE_POINTS
#include <trace/events/task.h>

+#include <linux/delay.h>
+
/*
* Minimum number of threads to boot the kernel
*/
@@ -199,6 +201,9 @@ static int free_vm_stack_cache(unsigned int cpu)
struct vm_struct **cached_vm_stacks = per_cpu_ptr(cached_stacks, cpu);
int i;

+ mdelay(2000);
+ cond_resched();
+
for (i = 0; i < NR_CACHED_STACKS; i++) {
struct vm_struct *vm_stack = cached_vm_stacks[i];

2023-06-29 08:41:49

[permalink] [raw]

Subject: Re: [Question] report a race condition between CPU hotplug state machine and hrtimer 'sched_cfs_period_timer' for cfs bandwidth throttling

On Thu, 29 Jun 2023 at 00:01, Thomas Gleixner <[email protected]> wrote:
>
> On Wed, Jun 28 2023 at 14:35, Vincent Guittot wrote:
> > On Wed, 28 Jun 2023 at 14:03, Thomas Gleixner <[email protected]> wrote:
> >> No, because this is fundamentally wrong.
> >>
> >> If the CPU is on the way out, then the scheduler hotplug machinery
> >> has to handle the period timer so that the problem Xiongfeng analyzed
> >> does not happen in the first place.
> >
> > But the hrtimer was enqueued before it starts to offline the cpu
>
> It does not really matter when it was enqueued. The important point is
> that it was enqueued on that outgoing CPU for whatever reason.
>
> > Then, hrtimers_dead_cpu should take care of migrating the hrtimer out
> > of the outgoing cpu but :
> > - it must run on another target cpu to migrate the hrtimer.
> > - it runs in the context of the caller which can be throttled.
>
> Sure. I completely understand the problem. The hrtimer hotplug callback
> does not run because the task is stuck and waits for the timer to
> expire. Circular dependency.
>
> >> sched_cpu_wait_empty() would be the obvious place to cleanup armed CFS
> >> timers, but let me look into whether we can migrate hrtimers early in
> >> general.
> >
> > but for that we must check if the timer is enqueued on the outgoing
> > cpu and we then need to choose a target cpu.
>
> You're right. I somehow assumed that cfs knows where it queued stuff,
> but obviously it does not.

scheduler doesn't know where hrtimer enqueues the timer

>
> I think we can avoid all that by simply taking that user space task out
> of the picture completely, which avoids debating whether there are other
> possible weird conditions to consider alltogether.

yes, the offline sequence should not be impacted by the caller context

>
> Something like the untested below should just work.
>
> Thanks,
>
> tglx
> ---
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -1490,6 +1490,13 @@ static int cpu_down(unsigned int cpu, en
> return err;
> }
>
> +static long __cpu_device_down(void *arg)
> +{
> + struct device *dev = arg;
> +
> + return cpu_down(dev->id, CPUHP_OFFLINE);
> +}
> +
> /**
> * cpu_device_down - Bring down a cpu device
> * @dev: Pointer to the cpu device to offline
> @@ -1502,7 +1509,12 @@ static int cpu_down(unsigned int cpu, en
> */
> int cpu_device_down(struct device *dev)
> {
> - return cpu_down(dev->id, CPUHP_OFFLINE);
> + unsigned int cpu = cpumask_any_but(cpu_online_mask, dev->id);
> +
> + if (cpu >= nr_cpu_ids)
> + return -EBUSY;
> +
> + return work_on_cpu(cpu, __cpu_device_down, dev);

The comment for work_on_cpu :

* It is up to the caller to ensure that the cpu doesn't go offline.
* The caller must not hold any locks which would prevent @fn from completing.

make me wonder if this should be done only once the hotplug lock is
taken so the selected cpu will not go offline

> }
>
> int remove_cpu(unsigned int cpu)

2023-06-29 08:53:23