2023-06-09 12:00:55

by Xiongfeng Wang

[permalink] [raw]
Subject: [Question] report a race condition between CPU hotplug state machine and hrtimer 'sched_cfs_period_timer' for cfs bandwidth throttling

Hello,
When I do some low power tests, the following hung task is printed.

Call trace:
__switch_to+0xd4/0x160
__schedule+0x38c/0x8c4
__cond_resched+0x24/0x50
unmap_kernel_range_noflush+0x210/0x240
kretprobe_trampoline+0x0/0xc8
__vunmap+0x70/0x31c
__vfree+0x34/0x8c
vfree+0x40/0x58
free_vm_stack_cache+0x44/0x74
cpuhp_invoke_callback+0xc4/0x71c
_cpu_down+0x108/0x284
kretprobe_trampoline+0x0/0xc8
suspend_enter+0xd8/0x8ec
suspend_devices_and_enter+0x1f0/0x360
pm_suspend.part.1+0x428/0x53c
pm_suspend+0x3c/0xa0
devdrv_suspend_proc+0x148/0x248 [drv_devmng]
devdrv_manager_set_power_state+0x140/0x680 [drv_devmng]
devdrv_manager_ioctl+0xcc/0x210 [drv_devmng]
drv_ascend_intf_ioctl+0x84/0x248 [drv_davinci_intf]
__arm64_sys_ioctl+0xb4/0xf0
el0_svc_common.constprop.0+0x140/0x374
do_el0_svc+0x80/0xa0
el0_svc+0x1c/0x28
el0_sync_handler+0x90/0xf0
el0_sync+0x168/0x180

After some analysis, I found it is caused by the following race condition.

1. A task running on CPU1 is throttled for cfs bandwidth. CPU1 starts the
hrtimer cfs_bandwidth 'period_timer' and enqueue the hrtimer on CPU1's rbtree.
2. Then the task is migrated to CPU2 and starts to offline CPU1. CPU1 starts
CPUHP AP steps, and then the hrtimer 'period_timer' expires and re-enqueued on CPU1.
3. CPU1 runs to take_cpu_down() and disable irq. After CPU1 finished CPUHP AP
steps, CPU2 starts the rest CPUHP step.
4. When CPU2 runs to free_vm_stack_cache(), it is sched out in __vunmap()
because it run out of CPU quota. start_cfs_bandwidth() does not restart the
hrtimer because 'cfs_b->period_active' is set.
5. The task waits the hrtimer 'period_timer' to expire to wake itself up, but
CPU1 has disabled irq and the hrtimer won't expire until it is migrated to CPU2
in hrtimers_dead_cpu(). But the task is blocked and cannot proceed to
hrtimers_dead_cpu() step. So the task hungs.

CPU1 CPU2
Task set cfs_quota
start hrtimer cfs_bandwidth 'period_timer'
start to offline CPU1
CPU1 start CPUHP AP step
...
'period_timer' expired and re-enqueued on CPU1
...
disable irq in take_cpu_down()
...
CPU2 start the rest CPUHP steps
...
sched out in free_vm_stack_cache()
wait for 'period_timer' expires


Appreciate it a lot if anyone can give some suggestion on how fix this problem !

Thanks,
Xiongfeng




2023-06-09 15:06:30

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [Question] report a race condition between CPU hotplug state machine and hrtimer 'sched_cfs_period_timer' for cfs bandwidth throttling

On Fri, Jun 09 2023 at 19:24, Xiongfeng Wang wrote:

Cc+ scheduler people, leave context intact

> Hello,
> When I do some low power tests, the following hung task is printed.
>
> Call trace:
> __switch_to+0xd4/0x160
> __schedule+0x38c/0x8c4
> __cond_resched+0x24/0x50
> unmap_kernel_range_noflush+0x210/0x240
> kretprobe_trampoline+0x0/0xc8
> __vunmap+0x70/0x31c
> __vfree+0x34/0x8c
> vfree+0x40/0x58
> free_vm_stack_cache+0x44/0x74
> cpuhp_invoke_callback+0xc4/0x71c
> _cpu_down+0x108/0x284
> kretprobe_trampoline+0x0/0xc8
> suspend_enter+0xd8/0x8ec
> suspend_devices_and_enter+0x1f0/0x360
> pm_suspend.part.1+0x428/0x53c
> pm_suspend+0x3c/0xa0
> devdrv_suspend_proc+0x148/0x248 [drv_devmng]
> devdrv_manager_set_power_state+0x140/0x680 [drv_devmng]
> devdrv_manager_ioctl+0xcc/0x210 [drv_devmng]
> drv_ascend_intf_ioctl+0x84/0x248 [drv_davinci_intf]
> __arm64_sys_ioctl+0xb4/0xf0
> el0_svc_common.constprop.0+0x140/0x374
> do_el0_svc+0x80/0xa0
> el0_svc+0x1c/0x28
> el0_sync_handler+0x90/0xf0
> el0_sync+0x168/0x180
>
> After some analysis, I found it is caused by the following race condition.
>
> 1. A task running on CPU1 is throttled for cfs bandwidth. CPU1 starts the
> hrtimer cfs_bandwidth 'period_timer' and enqueue the hrtimer on CPU1's rbtree.
> 2. Then the task is migrated to CPU2 and starts to offline CPU1. CPU1 starts
> CPUHP AP steps, and then the hrtimer 'period_timer' expires and re-enqueued on CPU1.
> 3. CPU1 runs to take_cpu_down() and disable irq. After CPU1 finished CPUHP AP
> steps, CPU2 starts the rest CPUHP step.
> 4. When CPU2 runs to free_vm_stack_cache(), it is sched out in __vunmap()
> because it run out of CPU quota. start_cfs_bandwidth() does not restart the
> hrtimer because 'cfs_b->period_active' is set.
> 5. The task waits the hrtimer 'period_timer' to expire to wake itself up, but
> CPU1 has disabled irq and the hrtimer won't expire until it is migrated to CPU2
> in hrtimers_dead_cpu(). But the task is blocked and cannot proceed to
> hrtimers_dead_cpu() step. So the task hungs.
>
> CPU1 CPU2
> Task set cfs_quota
> start hrtimer cfs_bandwidth 'period_timer'
> start to offline CPU1
> CPU1 start CPUHP AP step
> ...
> 'period_timer' expired and re-enqueued on CPU1
> ...
> disable irq in take_cpu_down()
> ...
> CPU2 start the rest CPUHP steps
> ...
> sched out in free_vm_stack_cache()
> wait for 'period_timer' expires
>
>
> Appreciate it a lot if anyone can give some suggestion on how fix this problem !
>
> Thanks,
> Xiongfeng

2023-06-12 13:13:10

by Xiongfeng Wang

[permalink] [raw]
Subject: Re: [Question] report a race condition between CPU hotplug state machine and hrtimer 'sched_cfs_period_timer' for cfs bandwidth throttling



On 2023/6/9 22:55, Thomas Gleixner wrote:
> On Fri, Jun 09 2023 at 19:24, Xiongfeng Wang wrote:
>
> Cc+ scheduler people, leave context intact
>
>> Hello,
>> When I do some low power tests, the following hung task is printed.
>>
>> Call trace:
>> __switch_to+0xd4/0x160
>> __schedule+0x38c/0x8c4
>> __cond_resched+0x24/0x50
>> unmap_kernel_range_noflush+0x210/0x240
>> kretprobe_trampoline+0x0/0xc8
>> __vunmap+0x70/0x31c
>> __vfree+0x34/0x8c
>> vfree+0x40/0x58
>> free_vm_stack_cache+0x44/0x74
>> cpuhp_invoke_callback+0xc4/0x71c
>> _cpu_down+0x108/0x284
>> kretprobe_trampoline+0x0/0xc8
>> suspend_enter+0xd8/0x8ec
>> suspend_devices_and_enter+0x1f0/0x360
>> pm_suspend.part.1+0x428/0x53c
>> pm_suspend+0x3c/0xa0
>> devdrv_suspend_proc+0x148/0x248 [drv_devmng]
>> devdrv_manager_set_power_state+0x140/0x680 [drv_devmng]
>> devdrv_manager_ioctl+0xcc/0x210 [drv_devmng]
>> drv_ascend_intf_ioctl+0x84/0x248 [drv_davinci_intf]
>> __arm64_sys_ioctl+0xb4/0xf0
>> el0_svc_common.constprop.0+0x140/0x374
>> do_el0_svc+0x80/0xa0
>> el0_svc+0x1c/0x28
>> el0_sync_handler+0x90/0xf0
>> el0_sync+0x168/0x180
>>
>> After some analysis, I found it is caused by the following race condition.
>>
>> 1. A task running on CPU1 is throttled for cfs bandwidth. CPU1 starts the
>> hrtimer cfs_bandwidth 'period_timer' and enqueue the hrtimer on CPU1's rbtree.
>> 2. Then the task is migrated to CPU2 and starts to offline CPU1. CPU1 starts
>> CPUHP AP steps, and then the hrtimer 'period_timer' expires and re-enqueued on CPU1.
>> 3. CPU1 runs to take_cpu_down() and disable irq. After CPU1 finished CPUHP AP
>> steps, CPU2 starts the rest CPUHP step.
>> 4. When CPU2 runs to free_vm_stack_cache(), it is sched out in __vunmap()
>> because it run out of CPU quota. start_cfs_bandwidth() does not restart the
>> hrtimer because 'cfs_b->period_active' is set.
>> 5. The task waits the hrtimer 'period_timer' to expire to wake itself up, but
>> CPU1 has disabled irq and the hrtimer won't expire until it is migrated to CPU2
>> in hrtimers_dead_cpu(). But the task is blocked and cannot proceed to
>> hrtimers_dead_cpu() step. So the task hungs.
>>
>> CPU1 CPU2
>> Task set cfs_quota
>> start hrtimer cfs_bandwidth 'period_timer'
>> start to offline CPU1
>> CPU1 start CPUHP AP step
>> ...
>> 'period_timer' expired and re-enqueued on CPU1
>> ...
>> disable irq in take_cpu_down()
>> ...
>> CPU2 start the rest CPUHP steps
>> ...
>> sched out in free_vm_stack_cache()
>> wait for 'period_timer' expires
>>
>>
>> Appreciate it a lot if anyone can give some suggestion on how fix this problem !
>>
>> Thanks,
>> Xiongfeng
> .
>

Test script:
taskset -cp 1 $$
mkdir /sys/fs/cgroup/cpu/test
echo $$ > /sys/fs/cgroup/cpu/test/tasks
echo 80000 > /sys/fs/cgroup/cpu/test/cpu.cfs_quota_us
echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_period_us
taskset -cp 2 $$
echo 0 > /sys/devices/system/cpu/cpu1/online


Tests show that the following modification can solve the problem of above test
scripts. But I am not sure if there exists some other issues.

diff --cc kernel/sched/fair.c
index d9d6519fae01,bd6624353608..000000000000
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@@ -5411,10 -5411,16 +5411,15 @@@ void start_cfs_bandwidth(struct cfs_ban
{
lockdep_assert_held(&cfs_b->lock);

- if (cfs_b->period_active)
+ if (cfs_b->period_active) {
+ struct hrtimer_clock_base *clock_base = cfs_b->period_timer.base;
+ int cpu = clock_base->cpu_base->cpu;
+ if (!cpu_active(cpu) && cpu != smp_processor_id())
+ hrtimer_start_expires(&cfs_b->period_timer,
HRTIMER_MODE_ABS_PINNED);
return;
+ }

cfs_b->period_active = 1;
-
hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
}

Thanks,
Xiongfeng

2023-06-26 08:48:37

by Xiongfeng Wang

[permalink] [raw]
Subject: Re: [Question] report a race condition between CPU hotplug state machine and hrtimer 'sched_cfs_period_timer' for cfs bandwidth throttling

Hi,

Kindly ping~
Could you please take a look at this issue and the below temporary fix ?

Thanks,
Xiongfeng

On 2023/6/12 20:49, Xiongfeng Wang wrote:
>
>
> On 2023/6/9 22:55, Thomas Gleixner wrote:
>> On Fri, Jun 09 2023 at 19:24, Xiongfeng Wang wrote:
>>
>> Cc+ scheduler people, leave context intact
>>
>>> Hello,
>>> When I do some low power tests, the following hung task is printed.
>>>
>>> Call trace:
>>> __switch_to+0xd4/0x160
>>> __schedule+0x38c/0x8c4
>>> __cond_resched+0x24/0x50
>>> unmap_kernel_range_noflush+0x210/0x240
>>> kretprobe_trampoline+0x0/0xc8
>>> __vunmap+0x70/0x31c
>>> __vfree+0x34/0x8c
>>> vfree+0x40/0x58
>>> free_vm_stack_cache+0x44/0x74
>>> cpuhp_invoke_callback+0xc4/0x71c
>>> _cpu_down+0x108/0x284
>>> kretprobe_trampoline+0x0/0xc8
>>> suspend_enter+0xd8/0x8ec
>>> suspend_devices_and_enter+0x1f0/0x360
>>> pm_suspend.part.1+0x428/0x53c
>>> pm_suspend+0x3c/0xa0
>>> devdrv_suspend_proc+0x148/0x248 [drv_devmng]
>>> devdrv_manager_set_power_state+0x140/0x680 [drv_devmng]
>>> devdrv_manager_ioctl+0xcc/0x210 [drv_devmng]
>>> drv_ascend_intf_ioctl+0x84/0x248 [drv_davinci_intf]
>>> __arm64_sys_ioctl+0xb4/0xf0
>>> el0_svc_common.constprop.0+0x140/0x374
>>> do_el0_svc+0x80/0xa0
>>> el0_svc+0x1c/0x28
>>> el0_sync_handler+0x90/0xf0
>>> el0_sync+0x168/0x180
>>>
>>> After some analysis, I found it is caused by the following race condition.
>>>
>>> 1. A task running on CPU1 is throttled for cfs bandwidth. CPU1 starts the
>>> hrtimer cfs_bandwidth 'period_timer' and enqueue the hrtimer on CPU1's rbtree.
>>> 2. Then the task is migrated to CPU2 and starts to offline CPU1. CPU1 starts
>>> CPUHP AP steps, and then the hrtimer 'period_timer' expires and re-enqueued on CPU1.
>>> 3. CPU1 runs to take_cpu_down() and disable irq. After CPU1 finished CPUHP AP
>>> steps, CPU2 starts the rest CPUHP step.
>>> 4. When CPU2 runs to free_vm_stack_cache(), it is sched out in __vunmap()
>>> because it run out of CPU quota. start_cfs_bandwidth() does not restart the
>>> hrtimer because 'cfs_b->period_active' is set.
>>> 5. The task waits the hrtimer 'period_timer' to expire to wake itself up, but
>>> CPU1 has disabled irq and the hrtimer won't expire until it is migrated to CPU2
>>> in hrtimers_dead_cpu(). But the task is blocked and cannot proceed to
>>> hrtimers_dead_cpu() step. So the task hungs.
>>>
>>> CPU1 CPU2
>>> Task set cfs_quota
>>> start hrtimer cfs_bandwidth 'period_timer'
>>> start to offline CPU1
>>> CPU1 start CPUHP AP step
>>> ...
>>> 'period_timer' expired and re-enqueued on CPU1
>>> ...
>>> disable irq in take_cpu_down()
>>> ...
>>> CPU2 start the rest CPUHP steps
>>> ...
>>> sched out in free_vm_stack_cache()
>>> wait for 'period_timer' expires
>>>
>>>
>>> Appreciate it a lot if anyone can give some suggestion on how fix this problem !
>>>
>>> Thanks,
>>> Xiongfeng
>> .
>>
>
> Test script:
> taskset -cp 1 $$
> mkdir /sys/fs/cgroup/cpu/test
> echo $$ > /sys/fs/cgroup/cpu/test/tasks
> echo 80000 > /sys/fs/cgroup/cpu/test/cpu.cfs_quota_us
> echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_period_us
> taskset -cp 2 $$
> echo 0 > /sys/devices/system/cpu/cpu1/online
>
>
> Tests show that the following modification can solve the problem of above test
> scripts. But I am not sure if there exists some other issues.
>
> diff --cc kernel/sched/fair.c
> index d9d6519fae01,bd6624353608..000000000000
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@@ -5411,10 -5411,16 +5411,15 @@@ void start_cfs_bandwidth(struct cfs_ban
> {
> lockdep_assert_held(&cfs_b->lock);
>
> - if (cfs_b->period_active)
> + if (cfs_b->period_active) {
> + struct hrtimer_clock_base *clock_base = cfs_b->period_timer.base;
> + int cpu = clock_base->cpu_base->cpu;
> + if (!cpu_active(cpu) && cpu != smp_processor_id())
> + hrtimer_start_expires(&cfs_b->period_timer,
> HRTIMER_MODE_ABS_PINNED);
> return;
> + }
>
> cfs_b->period_active = 1;
> -
> hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
> hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
> }
>
> Thanks,
> Xiongfeng
>
> .
>

2023-06-27 17:38:19

by Vincent Guittot

[permalink] [raw]
Subject: Re: [Question] report a race condition between CPU hotplug state machine and hrtimer 'sched_cfs_period_timer' for cfs bandwidth throttling

On Mon, 26 Jun 2023 at 10:23, Xiongfeng Wang <[email protected]> wrote:
>
> Hi,
>
> Kindly ping~
> Could you please take a look at this issue and the below temporary fix ?
>
> Thanks,
> Xiongfeng
>
> On 2023/6/12 20:49, Xiongfeng Wang wrote:
> >
> >
> > On 2023/6/9 22:55, Thomas Gleixner wrote:
> >> On Fri, Jun 09 2023 at 19:24, Xiongfeng Wang wrote:
> >>
> >> Cc+ scheduler people, leave context intact
> >>
> >>> Hello,
> >>> When I do some low power tests, the following hung task is printed.
> >>>
> >>> Call trace:
> >>> __switch_to+0xd4/0x160
> >>> __schedule+0x38c/0x8c4
> >>> __cond_resched+0x24/0x50
> >>> unmap_kernel_range_noflush+0x210/0x240
> >>> kretprobe_trampoline+0x0/0xc8
> >>> __vunmap+0x70/0x31c
> >>> __vfree+0x34/0x8c
> >>> vfree+0x40/0x58
> >>> free_vm_stack_cache+0x44/0x74
> >>> cpuhp_invoke_callback+0xc4/0x71c
> >>> _cpu_down+0x108/0x284
> >>> kretprobe_trampoline+0x0/0xc8
> >>> suspend_enter+0xd8/0x8ec
> >>> suspend_devices_and_enter+0x1f0/0x360
> >>> pm_suspend.part.1+0x428/0x53c
> >>> pm_suspend+0x3c/0xa0
> >>> devdrv_suspend_proc+0x148/0x248 [drv_devmng]
> >>> devdrv_manager_set_power_state+0x140/0x680 [drv_devmng]
> >>> devdrv_manager_ioctl+0xcc/0x210 [drv_devmng]
> >>> drv_ascend_intf_ioctl+0x84/0x248 [drv_davinci_intf]
> >>> __arm64_sys_ioctl+0xb4/0xf0
> >>> el0_svc_common.constprop.0+0x140/0x374
> >>> do_el0_svc+0x80/0xa0
> >>> el0_svc+0x1c/0x28
> >>> el0_sync_handler+0x90/0xf0
> >>> el0_sync+0x168/0x180
> >>>
> >>> After some analysis, I found it is caused by the following race condition.
> >>>
> >>> 1. A task running on CPU1 is throttled for cfs bandwidth. CPU1 starts the
> >>> hrtimer cfs_bandwidth 'period_timer' and enqueue the hrtimer on CPU1's rbtree.
> >>> 2. Then the task is migrated to CPU2 and starts to offline CPU1. CPU1 starts
> >>> CPUHP AP steps, and then the hrtimer 'period_timer' expires and re-enqueued on CPU1.
> >>> 3. CPU1 runs to take_cpu_down() and disable irq. After CPU1 finished CPUHP AP
> >>> steps, CPU2 starts the rest CPUHP step.
> >>> 4. When CPU2 runs to free_vm_stack_cache(), it is sched out in __vunmap()
> >>> because it run out of CPU quota. start_cfs_bandwidth() does not restart the
> >>> hrtimer because 'cfs_b->period_active' is set.
> >>> 5. The task waits the hrtimer 'period_timer' to expire to wake itself up, but
> >>> CPU1 has disabled irq and the hrtimer won't expire until it is migrated to CPU2
> >>> in hrtimers_dead_cpu(). But the task is blocked and cannot proceed to
> >>> hrtimers_dead_cpu() step. So the task hungs.
> >>>
> >>> CPU1 CPU2
> >>> Task set cfs_quota
> >>> start hrtimer cfs_bandwidth 'period_timer'
> >>> start to offline CPU1
> >>> CPU1 start CPUHP AP step
> >>> ...
> >>> 'period_timer' expired and re-enqueued on CPU1
> >>> ...
> >>> disable irq in take_cpu_down()
> >>> ...
> >>> CPU2 start the rest CPUHP steps
> >>> ...
> >>> sched out in free_vm_stack_cache()
> >>> wait for 'period_timer' expires
> >>>
> >>>
> >>> Appreciate it a lot if anyone can give some suggestion on how fix this problem !
> >>>
> >>> Thanks,
> >>> Xiongfeng
> >> .
> >>
> >
> > Test script:
> > taskset -cp 1 $$
> > mkdir /sys/fs/cgroup/cpu/test
> > echo $$ > /sys/fs/cgroup/cpu/test/tasks
> > echo 80000 > /sys/fs/cgroup/cpu/test/cpu.cfs_quota_us
> > echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_period_us
> > taskset -cp 2 $$
> > echo 0 > /sys/devices/system/cpu/cpu1/online
> >
> >
> > Tests show that the following modification can solve the problem of above test
> > scripts. But I am not sure if there exists some other issues.
> >
> > diff --cc kernel/sched/fair.c
> > index d9d6519fae01,bd6624353608..000000000000
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@@ -5411,10 -5411,16 +5411,15 @@@ void start_cfs_bandwidth(struct cfs_ban
> > {
> > lockdep_assert_held(&cfs_b->lock);
> >
> > - if (cfs_b->period_active)
> > + if (cfs_b->period_active) {
> > + struct hrtimer_clock_base *clock_base = cfs_b->period_timer.base;
> > + int cpu = clock_base->cpu_base->cpu;
> > + if (!cpu_active(cpu) && cpu != smp_processor_id())
> > + hrtimer_start_expires(&cfs_b->period_timer,
> > HRTIMER_MODE_ABS_PINNED);
> > return;
> > + }

I have been able to reproduce your problem and run your fix on top. I
still wonder if there is a
Could we have a helper from hrtimer to get the cpu of the clock_base ?


> >
> > cfs_b->period_active = 1;
> > -
> > hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
> > hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
> > }
> >
> > Thanks,
> > Xiongfeng
> >
> > .
> >

2023-06-28 12:25:10

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [Question] report a race condition between CPU hotplug state machine and hrtimer 'sched_cfs_period_timer' for cfs bandwidth throttling

On Tue, Jun 27 2023 at 18:46, Vincent Guittot wrote:
> On Mon, 26 Jun 2023 at 10:23, Xiongfeng Wang <[email protected]> wrote:
>> > diff --cc kernel/sched/fair.c
>> > index d9d6519fae01,bd6624353608..000000000000
>> > --- a/kernel/sched/fair.c
>> > +++ b/kernel/sched/fair.c
>> > @@@ -5411,10 -5411,16 +5411,15 @@@ void start_cfs_bandwidth(struct cfs_ban
>> > {
>> > lockdep_assert_held(&cfs_b->lock);
>> >
>> > - if (cfs_b->period_active)
>> > + if (cfs_b->period_active) {
>> > + struct hrtimer_clock_base *clock_base = cfs_b->period_timer.base;
>> > + int cpu = clock_base->cpu_base->cpu;
>> > + if (!cpu_active(cpu) && cpu != smp_processor_id())
>> > + hrtimer_start_expires(&cfs_b->period_timer,
>> > HRTIMER_MODE_ABS_PINNED);
>> > return;
>> > + }
>
> I have been able to reproduce your problem and run your fix on top. I
> still wonder if there is a
> Could we have a helper from hrtimer to get the cpu of the clock_base ?

No, because this is fundamentally wrong.

If the CPU is on the way out, then the scheduler hotplug machinery
has to handle the period timer so that the problem Xiongfeng analyzed
does not happen in the first place.

sched_cpu_wait_empty() would be the obvious place to cleanup armed CFS
timers, but let me look into whether we can migrate hrtimers early in
general.

Aside of that the above is wrong by itself.

if (cfs_b->period_active)
hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);

This only ends up on the outgoing CPU if either:

1) The code runs on the outgoing CPU

or

2) The hrtimer is concurrently executing the hrtimer callback on the
outgoing CPU.

So this:

if (cfs_b->period_active) {
struct hrtimer_clock_base *clock_base = cfs_b->period_timer.base;
int cpu = clock_base->cpu_base->cpu;

if (!cpu_active(cpu) && cpu != smp_processor_id())
hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
return;
}

only works, if

1) The code runs _not_ on the outgoing CPU

and

2) The hrtimer is _not_ concurrently executing the hrtimer callback on
the outgoing CPU.

If the callback is executing (it spins on cfs_b->lock), then the
timer is requeued on the outgoing CPU. Not what you want, right?

Plus accessing hrtimer->clock_base->cpu_base->cpu lockless is fragile at
best.

Thanks,

tglx

2023-06-28 12:46:23

by Vincent Guittot

[permalink] [raw]
Subject: Re: [Question] report a race condition between CPU hotplug state machine and hrtimer 'sched_cfs_period_timer' for cfs bandwidth throttling

On Wed, 28 Jun 2023 at 14:03, Thomas Gleixner <[email protected]> wrote:
>
> On Tue, Jun 27 2023 at 18:46, Vincent Guittot wrote:
> > On Mon, 26 Jun 2023 at 10:23, Xiongfeng Wang <[email protected]> wrote:
> >> > diff --cc kernel/sched/fair.c
> >> > index d9d6519fae01,bd6624353608..000000000000
> >> > --- a/kernel/sched/fair.c
> >> > +++ b/kernel/sched/fair.c
> >> > @@@ -5411,10 -5411,16 +5411,15 @@@ void start_cfs_bandwidth(struct cfs_ban
> >> > {
> >> > lockdep_assert_held(&cfs_b->lock);
> >> >
> >> > - if (cfs_b->period_active)
> >> > + if (cfs_b->period_active) {
> >> > + struct hrtimer_clock_base *clock_base = cfs_b->period_timer.base;
> >> > + int cpu = clock_base->cpu_base->cpu;
> >> > + if (!cpu_active(cpu) && cpu != smp_processor_id())
> >> > + hrtimer_start_expires(&cfs_b->period_timer,
> >> > HRTIMER_MODE_ABS_PINNED);
> >> > return;
> >> > + }
> >
> > I have been able to reproduce your problem and run your fix on top. I
> > still wonder if there is a
> > Could we have a helper from hrtimer to get the cpu of the clock_base ?
>
> No, because this is fundamentally wrong.
>
> If the CPU is on the way out, then the scheduler hotplug machinery
> has to handle the period timer so that the problem Xiongfeng analyzed
> does not happen in the first place.

But the hrtimer was enqueued before it starts to offline the cpu
Then, hrtimers_dead_cpu should take care of migrating the hrtimer out
of the outgoing cpu but :
- it must run on another target cpu to migrate the hrtimer.
- it runs in the context of the caller which can be throttled.

>
> sched_cpu_wait_empty() would be the obvious place to cleanup armed CFS
> timers, but let me look into whether we can migrate hrtimers early in
> general.

but for that we must check if the timer is enqueued on the outgoing
cpu and we then need to choose a target cpu

>
> Aside of that the above is wrong by itself.
>
> if (cfs_b->period_active)
> hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
>
> This only ends up on the outgoing CPU if either:
>
> 1) The code runs on the outgoing CPU
>
> or
>
> 2) The hrtimer is concurrently executing the hrtimer callback on the
> outgoing CPU.
>
> So this:
>
> if (cfs_b->period_active) {
> struct hrtimer_clock_base *clock_base = cfs_b->period_timer.base;
> int cpu = clock_base->cpu_base->cpu;
>
> if (!cpu_active(cpu) && cpu != smp_processor_id())
> hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
> return;
> }
>
> only works, if
>
> 1) The code runs _not_ on the outgoing CPU
>
> and
>
> 2) The hrtimer is _not_ concurrently executing the hrtimer callback on
> the outgoing CPU.
>
> If the callback is executing (it spins on cfs_b->lock), then the
> timer is requeued on the outgoing CPU. Not what you want, right?
>
> Plus accessing hrtimer->clock_base->cpu_base->cpu lockless is fragile at
> best.
>
> Thanks,
>
> tglx

2023-06-28 13:43:51

by Vincent Guittot

[permalink] [raw]
Subject: Re: [Question] report a race condition between CPU hotplug state machine and hrtimer 'sched_cfs_period_timer' for cfs bandwidth throttling

On Tue, 27 Jun 2023 at 18:46, Vincent Guittot
<[email protected]> wrote:
>
> On Mon, 26 Jun 2023 at 10:23, Xiongfeng Wang <[email protected]> wrote:
> >
> > Hi,
> >
> > Kindly ping~
> > Could you please take a look at this issue and the below temporary fix ?
> >
> > Thanks,
> > Xiongfeng
> >
> > On 2023/6/12 20:49, Xiongfeng Wang wrote:
> > >
> > >
> > > On 2023/6/9 22:55, Thomas Gleixner wrote:
> > >> On Fri, Jun 09 2023 at 19:24, Xiongfeng Wang wrote:
> > >>
> > >> Cc+ scheduler people, leave context intact
> > >>
> > >>> Hello,
> > >>> When I do some low power tests, the following hung task is printed.
> > >>>
> > >>> Call trace:
> > >>> __switch_to+0xd4/0x160
> > >>> __schedule+0x38c/0x8c4
> > >>> __cond_resched+0x24/0x50
> > >>> unmap_kernel_range_noflush+0x210/0x240
> > >>> kretprobe_trampoline+0x0/0xc8
> > >>> __vunmap+0x70/0x31c
> > >>> __vfree+0x34/0x8c
> > >>> vfree+0x40/0x58
> > >>> free_vm_stack_cache+0x44/0x74
> > >>> cpuhp_invoke_callback+0xc4/0x71c
> > >>> _cpu_down+0x108/0x284
> > >>> kretprobe_trampoline+0x0/0xc8
> > >>> suspend_enter+0xd8/0x8ec
> > >>> suspend_devices_and_enter+0x1f0/0x360
> > >>> pm_suspend.part.1+0x428/0x53c
> > >>> pm_suspend+0x3c/0xa0
> > >>> devdrv_suspend_proc+0x148/0x248 [drv_devmng]
> > >>> devdrv_manager_set_power_state+0x140/0x680 [drv_devmng]
> > >>> devdrv_manager_ioctl+0xcc/0x210 [drv_devmng]
> > >>> drv_ascend_intf_ioctl+0x84/0x248 [drv_davinci_intf]
> > >>> __arm64_sys_ioctl+0xb4/0xf0
> > >>> el0_svc_common.constprop.0+0x140/0x374
> > >>> do_el0_svc+0x80/0xa0
> > >>> el0_svc+0x1c/0x28
> > >>> el0_sync_handler+0x90/0xf0
> > >>> el0_sync+0x168/0x180
> > >>>
> > >>> After some analysis, I found it is caused by the following race condition.
> > >>>
> > >>> 1. A task running on CPU1 is throttled for cfs bandwidth. CPU1 starts the
> > >>> hrtimer cfs_bandwidth 'period_timer' and enqueue the hrtimer on CPU1's rbtree.
> > >>> 2. Then the task is migrated to CPU2 and starts to offline CPU1. CPU1 starts
> > >>> CPUHP AP steps, and then the hrtimer 'period_timer' expires and re-enqueued on CPU1.
> > >>> 3. CPU1 runs to take_cpu_down() and disable irq. After CPU1 finished CPUHP AP
> > >>> steps, CPU2 starts the rest CPUHP step.
> > >>> 4. When CPU2 runs to free_vm_stack_cache(), it is sched out in __vunmap()
> > >>> because it run out of CPU quota. start_cfs_bandwidth() does not restart the
> > >>> hrtimer because 'cfs_b->period_active' is set.
> > >>> 5. The task waits the hrtimer 'period_timer' to expire to wake itself up, but
> > >>> CPU1 has disabled irq and the hrtimer won't expire until it is migrated to CPU2
> > >>> in hrtimers_dead_cpu(). But the task is blocked and cannot proceed to
> > >>> hrtimers_dead_cpu() step. So the task hungs.
> > >>>
> > >>> CPU1 CPU2
> > >>> Task set cfs_quota
> > >>> start hrtimer cfs_bandwidth 'period_timer'
> > >>> start to offline CPU1
> > >>> CPU1 start CPUHP AP step
> > >>> ...
> > >>> 'period_timer' expired and re-enqueued on CPU1
> > >>> ...
> > >>> disable irq in take_cpu_down()
> > >>> ...
> > >>> CPU2 start the rest CPUHP steps
> > >>> ...
> > >>> sched out in free_vm_stack_cache()
> > >>> wait for 'period_timer' expires
> > >>>
> > >>>
> > >>> Appreciate it a lot if anyone can give some suggestion on how fix this problem !
> > >>>
> > >>> Thanks,
> > >>> Xiongfeng
> > >> .
> > >>
> > >
> > > Test script:
> > > taskset -cp 1 $$
> > > mkdir /sys/fs/cgroup/cpu/test
> > > echo $$ > /sys/fs/cgroup/cpu/test/tasks
> > > echo 80000 > /sys/fs/cgroup/cpu/test/cpu.cfs_quota_us
> > > echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_period_us
> > > taskset -cp 2 $$
> > > echo 0 > /sys/devices/system/cpu/cpu1/online
> > >
> > >
> > > Tests show that the following modification can solve the problem of above test
> > > scripts. But I am not sure if there exists some other issues.
> > >
> > > diff --cc kernel/sched/fair.c
> > > index d9d6519fae01,bd6624353608..000000000000
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@@ -5411,10 -5411,16 +5411,15 @@@ void start_cfs_bandwidth(struct cfs_ban
> > > {
> > > lockdep_assert_held(&cfs_b->lock);
> > >
> > > - if (cfs_b->period_active)
> > > + if (cfs_b->period_active) {
> > > + struct hrtimer_clock_base *clock_base = cfs_b->period_timer.base;
> > > + int cpu = clock_base->cpu_base->cpu;
> > > + if (!cpu_active(cpu) && cpu != smp_processor_id())
> > > + hrtimer_start_expires(&cfs_b->period_timer,
> > > HRTIMER_MODE_ABS_PINNED);
> > > return;
> > > + }
>
> I have been able to reproduce your problem and run your fix on top. I
> still wonder if there is a

Looks like I have been preempted and never finished the sentence. The
full sentence is:
I still wonder if there is a race condition where the hang can still
happen but i haven't been able to find one so far

> Could we have a helper from hrtimer to get the cpu of the clock_base ?
>
>
> > >
> > > cfs_b->period_active = 1;
> > > -
> > > hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
> > > hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
> > > }
> > >
> > > Thanks,
> > > Xiongfeng
> > >
> > > .
> > >

2023-06-28 21:16:02

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [Question] report a race condition between CPU hotplug state machine and hrtimer 'sched_cfs_period_timer' for cfs bandwidth throttling

On Wed, Jun 28 2023 at 15:30, Vincent Guittot wrote:
> On Tue, 27 Jun 2023 at 18:46, Vincent Guittot
>> > > + struct hrtimer_clock_base *clock_base = cfs_b->period_timer.base;
>> > > + int cpu = clock_base->cpu_base->cpu;
>> > > + if (!cpu_active(cpu) && cpu != smp_processor_id())
>> > > + hrtimer_start_expires(&cfs_b->period_timer,
>> > > HRTIMER_MODE_ABS_PINNED);
>> > > return;
>> > > + }

Can you please trim your replies?

>> I have been able to reproduce your problem and run your fix on top. I
>> still wonder if there is a
>
> Looks like I have been preempted and never finished the sentence. The
> full sentence is:
> I still wonder if there is a race condition where the hang can still
> happen but i haven't been able to find one so far

As I explained before. Assume the timer fires on the outgoing CPU and
the other CPU tries to rearm it concurrently. It will stay on the
outgoing CPU and not move over.

Thanks,

tglx

2023-06-28 22:44:50

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [Question] report a race condition between CPU hotplug state machine and hrtimer 'sched_cfs_period_timer' for cfs bandwidth throttling

On Wed, Jun 28 2023 at 14:35, Vincent Guittot wrote:
> On Wed, 28 Jun 2023 at 14:03, Thomas Gleixner <[email protected]> wrote:
>> No, because this is fundamentally wrong.
>>
>> If the CPU is on the way out, then the scheduler hotplug machinery
>> has to handle the period timer so that the problem Xiongfeng analyzed
>> does not happen in the first place.
>
> But the hrtimer was enqueued before it starts to offline the cpu

It does not really matter when it was enqueued. The important point is
that it was enqueued on that outgoing CPU for whatever reason.

> Then, hrtimers_dead_cpu should take care of migrating the hrtimer out
> of the outgoing cpu but :
> - it must run on another target cpu to migrate the hrtimer.
> - it runs in the context of the caller which can be throttled.

Sure. I completely understand the problem. The hrtimer hotplug callback
does not run because the task is stuck and waits for the timer to
expire. Circular dependency.

>> sched_cpu_wait_empty() would be the obvious place to cleanup armed CFS
>> timers, but let me look into whether we can migrate hrtimers early in
>> general.
>
> but for that we must check if the timer is enqueued on the outgoing
> cpu and we then need to choose a target cpu.

You're right. I somehow assumed that cfs knows where it queued stuff,
but obviously it does not.

I think we can avoid all that by simply taking that user space task out
of the picture completely, which avoids debating whether there are other
possible weird conditions to consider alltogether.

Something like the untested below should just work.

Thanks,

tglx
---
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -1490,6 +1490,13 @@ static int cpu_down(unsigned int cpu, en
return err;
}

+static long __cpu_device_down(void *arg)
+{
+ struct device *dev = arg;
+
+ return cpu_down(dev->id, CPUHP_OFFLINE);
+}
+
/**
* cpu_device_down - Bring down a cpu device
* @dev: Pointer to the cpu device to offline
@@ -1502,7 +1509,12 @@ static int cpu_down(unsigned int cpu, en
*/
int cpu_device_down(struct device *dev)
{
- return cpu_down(dev->id, CPUHP_OFFLINE);
+ unsigned int cpu = cpumask_any_but(cpu_online_mask, dev->id);
+
+ if (cpu >= nr_cpu_ids)
+ return -EBUSY;
+
+ return work_on_cpu(cpu, __cpu_device_down, dev);
}

int remove_cpu(unsigned int cpu)

2023-06-29 01:46:24

by Xiongfeng Wang

[permalink] [raw]
Subject: Re: [Question] report a race condition between CPU hotplug state machine and hrtimer 'sched_cfs_period_timer' for cfs bandwidth throttling



On 2023/6/28 0:46, Vincent Guittot wrote:
> On Mon, 26 Jun 2023 at 10:23, Xiongfeng Wang <[email protected]> wrote:
>>
>> Hi,
>>
>> Kindly ping~
>> Could you please take a look at this issue and the below temporary fix ?
>>
>> Thanks,
>> Xiongfeng
>>
>> On 2023/6/12 20:49, Xiongfeng Wang wrote:
>>>
>>>
>>> On 2023/6/9 22:55, Thomas Gleixner wrote:
>>>> On Fri, Jun 09 2023 at 19:24, Xiongfeng Wang wrote:
>>>>
>>>> Cc+ scheduler people, leave context intact
>>>>
>>>>> Hello,
>>>>> When I do some low power tests, the following hung task is printed.
>>>>>
>>>>> Call trace:
>>>>> __switch_to+0xd4/0x160
>>>>> __schedule+0x38c/0x8c4
>>>>> __cond_resched+0x24/0x50
>>>>> unmap_kernel_range_noflush+0x210/0x240
>>>>> kretprobe_trampoline+0x0/0xc8
>>>>> __vunmap+0x70/0x31c
>>>>> __vfree+0x34/0x8c
>>>>> vfree+0x40/0x58
>>>>> free_vm_stack_cache+0x44/0x74
>>>>> cpuhp_invoke_callback+0xc4/0x71c
>>>>> _cpu_down+0x108/0x284
>>>>> kretprobe_trampoline+0x0/0xc8
>>>>> suspend_enter+0xd8/0x8ec
>>>>> suspend_devices_and_enter+0x1f0/0x360
>>>>> pm_suspend.part.1+0x428/0x53c
>>>>> pm_suspend+0x3c/0xa0
>>>>> devdrv_suspend_proc+0x148/0x248 [drv_devmng]
>>>>> devdrv_manager_set_power_state+0x140/0x680 [drv_devmng]
>>>>> devdrv_manager_ioctl+0xcc/0x210 [drv_devmng]
>>>>> drv_ascend_intf_ioctl+0x84/0x248 [drv_davinci_intf]
>>>>> __arm64_sys_ioctl+0xb4/0xf0
>>>>> el0_svc_common.constprop.0+0x140/0x374
>>>>> do_el0_svc+0x80/0xa0
>>>>> el0_svc+0x1c/0x28
>>>>> el0_sync_handler+0x90/0xf0
>>>>> el0_sync+0x168/0x180
>>>>>
>>>>> After some analysis, I found it is caused by the following race condition.
>>>>>
>>>>> 1. A task running on CPU1 is throttled for cfs bandwidth. CPU1 starts the
>>>>> hrtimer cfs_bandwidth 'period_timer' and enqueue the hrtimer on CPU1's rbtree.
>>>>> 2. Then the task is migrated to CPU2 and starts to offline CPU1. CPU1 starts
>>>>> CPUHP AP steps, and then the hrtimer 'period_timer' expires and re-enqueued on CPU1.
>>>>> 3. CPU1 runs to take_cpu_down() and disable irq. After CPU1 finished CPUHP AP
>>>>> steps, CPU2 starts the rest CPUHP step.
>>>>> 4. When CPU2 runs to free_vm_stack_cache(), it is sched out in __vunmap()
>>>>> because it run out of CPU quota. start_cfs_bandwidth() does not restart the
>>>>> hrtimer because 'cfs_b->period_active' is set.
>>>>> 5. The task waits the hrtimer 'period_timer' to expire to wake itself up, but
>>>>> CPU1 has disabled irq and the hrtimer won't expire until it is migrated to CPU2
>>>>> in hrtimers_dead_cpu(). But the task is blocked and cannot proceed to
>>>>> hrtimers_dead_cpu() step. So the task hungs.
>>>>>
>>>>> CPU1 CPU2
>>>>> Task set cfs_quota
>>>>> start hrtimer cfs_bandwidth 'period_timer'
>>>>> start to offline CPU1
>>>>> CPU1 start CPUHP AP step
>>>>> ...
>>>>> 'period_timer' expired and re-enqueued on CPU1
>>>>> ...
>>>>> disable irq in take_cpu_down()
>>>>> ...
>>>>> CPU2 start the rest CPUHP steps
>>>>> ...
>>>>> sched out in free_vm_stack_cache()
>>>>> wait for 'period_timer' expires
>>>>>
>>>>>
>>>>> Appreciate it a lot if anyone can give some suggestion on how fix this problem !
>>>>>
>>>>> Thanks,
>>>>> Xiongfeng
>>>> .
>>>>
>>>
>>> Test script:
>>> taskset -cp 1 $$
>>> mkdir /sys/fs/cgroup/cpu/test
>>> echo $$ > /sys/fs/cgroup/cpu/test/tasks
>>> echo 80000 > /sys/fs/cgroup/cpu/test/cpu.cfs_quota_us
>>> echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_period_us
>>> taskset -cp 2 $$
>>> echo 0 > /sys/devices/system/cpu/cpu1/online
>>>
>>>
>>> Tests show that the following modification can solve the problem of above test
>>> scripts. But I am not sure if there exists some other issues.
>>>
>>> diff --cc kernel/sched/fair.c
>>> index d9d6519fae01,bd6624353608..000000000000
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@@ -5411,10 -5411,16 +5411,15 @@@ void start_cfs_bandwidth(struct cfs_ban
>>> {
>>> lockdep_assert_held(&cfs_b->lock);
>>>
>>> - if (cfs_b->period_active)
>>> + if (cfs_b->period_active) {
>>> + struct hrtimer_clock_base *clock_base = cfs_b->period_timer.base;
>>> + int cpu = clock_base->cpu_base->cpu;
>>> + if (!cpu_active(cpu) && cpu != smp_processor_id())
>>> + hrtimer_start_expires(&cfs_b->period_timer,
>>> HRTIMER_MODE_ABS_PINNED);
>>> return;
>>> + }
>
> I have been able to reproduce your problem and run your fix on top. I
> still wonder if there is a

Sorry, I forgot to provide the kernel modification to help reproduce the issue.
At first, the issue can only be reproduced on the product environment with
product stress testcase. After firguring out the reason, I add the following
modification. It make sure the process ran out cfs quota and can be sched out in
free_vm_stack_cache. Although the real schedule point is in __vunmap(), this can
also show the issue exists.

diff --git a/kernel/fork.c b/kernel/fork.c
index 0fb86b65ae60..3b2d83fb407a 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -110,6 +110,8 @@
#define CREATE_TRACE_POINTS
#include <trace/events/task.h>

+#include <linux/delay.h>
+
/*
* Minimum number of threads to boot the kernel
*/
@@ -199,6 +201,9 @@ static int free_vm_stack_cache(unsigned int cpu)
struct vm_struct **cached_vm_stacks = per_cpu_ptr(cached_stacks, cpu);
int i;

+ mdelay(2000);
+ cond_resched();
+
for (i = 0; i < NR_CACHED_STACKS; i++) {
struct vm_struct *vm_stack = cached_vm_stacks[i];

Thanks,
Xiongfeng

> Could we have a helper from hrtimer to get the cpu of the clock_base ?
>
>
>>>
>>> cfs_b->period_active = 1;
>>> -
>>> hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
>>> hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
>>> }
>>>
>>> Thanks,
>>> Xiongfeng
>>>
>>> .
>>>
> .
>

2023-06-29 02:01:49

by Xiongfeng Wang

[permalink] [raw]
Subject: Re: [Question] report a race condition between CPU hotplug state machine and hrtimer 'sched_cfs_period_timer' for cfs bandwidth throttling



On 2023/6/29 6:01, Thomas Gleixner wrote:
> On Wed, Jun 28 2023 at 14:35, Vincent Guittot wrote:
>> On Wed, 28 Jun 2023 at 14:03, Thomas Gleixner <[email protected]> wrote:
>>> No, because this is fundamentally wrong.
>>>
>>> If the CPU is on the way out, then the scheduler hotplug machinery
>>> has to handle the period timer so that the problem Xiongfeng analyzed
>>> does not happen in the first place.
>>
>> But the hrtimer was enqueued before it starts to offline the cpu
>
> It does not really matter when it was enqueued. The important point is
> that it was enqueued on that outgoing CPU for whatever reason.
>
>> Then, hrtimers_dead_cpu should take care of migrating the hrtimer out
>> of the outgoing cpu but :
>> - it must run on another target cpu to migrate the hrtimer.
>> - it runs in the context of the caller which can be throttled.
>
> Sure. I completely understand the problem. The hrtimer hotplug callback
> does not run because the task is stuck and waits for the timer to
> expire. Circular dependency.
>
>>> sched_cpu_wait_empty() would be the obvious place to cleanup armed CFS
>>> timers, but let me look into whether we can migrate hrtimers early in
>>> general.
>>
>> but for that we must check if the timer is enqueued on the outgoing
>> cpu and we then need to choose a target cpu.
>
> You're right. I somehow assumed that cfs knows where it queued stuff,
> but obviously it does not.
>
> I think we can avoid all that by simply taking that user space task out
> of the picture completely, which avoids debating whether there are other
> possible weird conditions to consider alltogether.
>
> Something like the untested below should just work.
>
> Thanks,
>
> tglx
> ---
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -1490,6 +1490,13 @@ static int cpu_down(unsigned int cpu, en
> return err;
> }
>
> +static long __cpu_device_down(void *arg)
> +{
> + struct device *dev = arg;
> +
> + return cpu_down(dev->id, CPUHP_OFFLINE);
> +}
> +
> /**
> * cpu_device_down - Bring down a cpu device
> * @dev: Pointer to the cpu device to offline
> @@ -1502,7 +1509,12 @@ static int cpu_down(unsigned int cpu, en
> */
> int cpu_device_down(struct device *dev)
> {
> - return cpu_down(dev->id, CPUHP_OFFLINE);
> + unsigned int cpu = cpumask_any_but(cpu_online_mask, dev->id);
> +
> + if (cpu >= nr_cpu_ids)
> + return -EBUSY;
> +
> + return work_on_cpu(cpu, __cpu_device_down, dev);
> }
>
> int remove_cpu(unsigned int cpu)
> .
>

Test with the following kernel modification which helps reproduce the issue. The
hang task does not happen any more. Thanks a lot.

Thanks,
Xiongfeng

--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -110,6 +110,8 @@
#define CREATE_TRACE_POINTS
#include <trace/events/task.h>

+#include <linux/delay.h>
+
/*
* Minimum number of threads to boot the kernel
*/
@@ -199,6 +201,9 @@ static int free_vm_stack_cache(unsigned int cpu)
struct vm_struct **cached_vm_stacks = per_cpu_ptr(cached_stacks, cpu);
int i;

+ mdelay(2000);
+ cond_resched();
+
for (i = 0; i < NR_CACHED_STACKS; i++) {
struct vm_struct *vm_stack = cached_vm_stacks[i];


2023-06-29 08:41:49

by Vincent Guittot

[permalink] [raw]
Subject: Re: [Question] report a race condition between CPU hotplug state machine and hrtimer 'sched_cfs_period_timer' for cfs bandwidth throttling

On Thu, 29 Jun 2023 at 00:01, Thomas Gleixner <[email protected]> wrote:
>
> On Wed, Jun 28 2023 at 14:35, Vincent Guittot wrote:
> > On Wed, 28 Jun 2023 at 14:03, Thomas Gleixner <[email protected]> wrote:
> >> No, because this is fundamentally wrong.
> >>
> >> If the CPU is on the way out, then the scheduler hotplug machinery
> >> has to handle the period timer so that the problem Xiongfeng analyzed
> >> does not happen in the first place.
> >
> > But the hrtimer was enqueued before it starts to offline the cpu
>
> It does not really matter when it was enqueued. The important point is
> that it was enqueued on that outgoing CPU for whatever reason.
>
> > Then, hrtimers_dead_cpu should take care of migrating the hrtimer out
> > of the outgoing cpu but :
> > - it must run on another target cpu to migrate the hrtimer.
> > - it runs in the context of the caller which can be throttled.
>
> Sure. I completely understand the problem. The hrtimer hotplug callback
> does not run because the task is stuck and waits for the timer to
> expire. Circular dependency.
>
> >> sched_cpu_wait_empty() would be the obvious place to cleanup armed CFS
> >> timers, but let me look into whether we can migrate hrtimers early in
> >> general.
> >
> > but for that we must check if the timer is enqueued on the outgoing
> > cpu and we then need to choose a target cpu.
>
> You're right. I somehow assumed that cfs knows where it queued stuff,
> but obviously it does not.

scheduler doesn't know where hrtimer enqueues the timer

>
> I think we can avoid all that by simply taking that user space task out
> of the picture completely, which avoids debating whether there are other
> possible weird conditions to consider alltogether.

yes, the offline sequence should not be impacted by the caller context

>
> Something like the untested below should just work.
>
> Thanks,
>
> tglx
> ---
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -1490,6 +1490,13 @@ static int cpu_down(unsigned int cpu, en
> return err;
> }
>
> +static long __cpu_device_down(void *arg)
> +{
> + struct device *dev = arg;
> +
> + return cpu_down(dev->id, CPUHP_OFFLINE);
> +}
> +
> /**
> * cpu_device_down - Bring down a cpu device
> * @dev: Pointer to the cpu device to offline
> @@ -1502,7 +1509,12 @@ static int cpu_down(unsigned int cpu, en
> */
> int cpu_device_down(struct device *dev)
> {
> - return cpu_down(dev->id, CPUHP_OFFLINE);
> + unsigned int cpu = cpumask_any_but(cpu_online_mask, dev->id);
> +
> + if (cpu >= nr_cpu_ids)
> + return -EBUSY;
> +
> + return work_on_cpu(cpu, __cpu_device_down, dev);

The comment for work_on_cpu :

* It is up to the caller to ensure that the cpu doesn't go offline.
* The caller must not hold any locks which would prevent @fn from completing.

make me wonder if this should be done only once the hotplug lock is
taken so the selected cpu will not go offline

> }
>
> int remove_cpu(unsigned int cpu)

2023-06-29 08:53:23

by Vincent Guittot

[permalink] [raw]
Subject: Re: [Question] report a race condition between CPU hotplug state machine and hrtimer 'sched_cfs_period_timer' for cfs bandwidth throttling

On Thu, 29 Jun 2023 at 03:26, Xiongfeng Wang <[email protected]> wrote:
>
>
>
> On 2023/6/28 0:46, Vincent Guittot wrote:
> > On Mon, 26 Jun 2023 at 10:23, Xiongfeng Wang <[email protected]> wrote:
> >>
> >> Hi,
> >>
> >> Kindly ping~
> >> Could you please take a look at this issue and the below temporary fix ?
> >>
> >> Thanks,
> >> Xiongfeng
> >>
> >> On 2023/6/12 20:49, Xiongfeng Wang wrote:
> >>>
> >>>
> >>> On 2023/6/9 22:55, Thomas Gleixner wrote:
> >>>> On Fri, Jun 09 2023 at 19:24, Xiongfeng Wang wrote:
> >>>>
> >>>> Cc+ scheduler people, leave context intact
> >>>>
> >>>>> Hello,
> >>>>> When I do some low power tests, the following hung task is printed.

[...]

> >>> diff --cc kernel/sched/fair.c
> >>> index d9d6519fae01,bd6624353608..000000000000
> >>> --- a/kernel/sched/fair.c
> >>> +++ b/kernel/sched/fair.c
> >>> @@@ -5411,10 -5411,16 +5411,15 @@@ void start_cfs_bandwidth(struct cfs_ban
> >>> {
> >>> lockdep_assert_held(&cfs_b->lock);
> >>>
> >>> - if (cfs_b->period_active)
> >>> + if (cfs_b->period_active) {
> >>> + struct hrtimer_clock_base *clock_base = cfs_b->period_timer.base;
> >>> + int cpu = clock_base->cpu_base->cpu;
> >>> + if (!cpu_active(cpu) && cpu != smp_processor_id())
> >>> + hrtimer_start_expires(&cfs_b->period_timer,
> >>> HRTIMER_MODE_ABS_PINNED);
> >>> return;
> >>> + }
> >
> > I have been able to reproduce your problem and run your fix on top. I
> > still wonder if there is a
>
> Sorry, I forgot to provide the kernel modification to help reproduce the issue.
> At first, the issue can only be reproduced on the product environment with
> product stress testcase. After firguring out the reason, I add the following
> modification. It make sure the process ran out cfs quota and can be sched out in
> free_vm_stack_cache. Although the real schedule point is in __vunmap(), this can
> also show the issue exists.

I have been able to reproduce the problem ( or at least something
similar) without your change below with a shorter cfs_quota_us and
other tasks always running in the cgroup

>
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 0fb86b65ae60..3b2d83fb407a 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -110,6 +110,8 @@
> #define CREATE_TRACE_POINTS
> #include <trace/events/task.h>
>
> +#include <linux/delay.h>
> +
> /*
> * Minimum number of threads to boot the kernel
> */
> @@ -199,6 +201,9 @@ static int free_vm_stack_cache(unsigned int cpu)
> struct vm_struct **cached_vm_stacks = per_cpu_ptr(cached_stacks, cpu);
> int i;
>
> + mdelay(2000);
> + cond_resched();
> +
> for (i = 0; i < NR_CACHED_STACKS; i++) {
> struct vm_struct *vm_stack = cached_vm_stacks[i];
>
> Thanks,
> Xiongfeng
>
> > Could we have a helper from hrtimer to get the cpu of the clock_base ?
> >
> >
> >>>
> >>> cfs_b->period_active = 1;
> >>> -
> >>> hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
> >>> hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
> >>> }
> >>>
> >>> Thanks,
> >>> Xiongfeng
> >>>
> >>> .
> >>>
> > .
> >

2023-08-24 09:19:32

by Yu Liao

[permalink] [raw]
Subject: Re: [Question] report a race condition between CPU hotplug state machine and hrtimer 'sched_cfs_period_timer' for cfs bandwidth throttling

On 2023/8/23 18:14, Thomas Gleixner wrote:
> Subject: cpu/hotplug: Prevent self deadlock on CPU hot-unplug
> From: Thomas Gleixner <[email protected]>
> Date: Wed, 23 Aug 2023 10:47:02 +0200
>
> Xiongfeng reported and debugged a self deadlock of the task which initiates
> and controls a CPU hot-unplug operation vs. the CFS bandwidth timer.
>
> CPU1 CPU2
>
> T1 sets cfs_quota
> starts hrtimer cfs_bandwidth 'period_timer'
> T1 is migrated to CPU2
> T1 initiates offlining of CPU1
> Hotplug operation starts
> ...
> 'period_timer' expires and is re-enqueued on CPU1
> ...
> take_cpu_down()
> CPU1 shuts down and does not handle timers
> anymore. They have to be migrated in the
> post dead hotplug steps by the control task.
>
> T1 runs the post dead offline operation
> T1 is scheduled out
> T1 waits for 'period_timer' to expire
>
> T1 waits there forever if it is scheduled out before it can execute the hrtimer
> offline callback hrtimers_dead_cpu().
>
> Cure this by delegating the hotplug control operation to a worker thread on
> an online CPU. This takes the initiating user space task, which might be
> affected by the bandwidth timer, completely out of the picture.
>
> Reported-by: Xiongfeng Wang <[email protected]>
> Signed-off-by: Thomas Gleixner <[email protected]>
> Link: https://lore.kernel.org/lkml/[email protected]
> ---
> kernel/cpu.c | 24 +++++++++++++++++++++++-
> 1 file changed, 23 insertions(+), 1 deletion(-)
>
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -1467,8 +1467,22 @@ static int __ref _cpu_down(unsigned int
> return ret;
> }
>
> +struct cpu_down_work {
> + unsigned int cpu;
> + enum cpuhp_state target;
> +};
> +
> +static long __cpu_down_maps_locked(void *arg)
> +{
> + struct cpu_down_work *work = arg;
> +
> + return _cpu_down(work->cpu, 0, work->target);
> +}
> +
> static int cpu_down_maps_locked(unsigned int cpu, enum cpuhp_state target)
> {
> + struct cpu_down_work work = { .cpu = cpu, .target = target, };
> +
> /*
> * If the platform does not support hotplug, report it explicitly to
> * differentiate it from a transient offlining failure.
> @@ -1477,7 +1491,15 @@ static int cpu_down_maps_locked(unsigned
> return -EOPNOTSUPP;
> if (cpu_hotplug_disabled)
> return -EBUSY;
> - return _cpu_down(cpu, 0, target);
> +
> + /*
> + * Ensure that the control task does not run on the to be offlined
> + * CPU to prevent a deadlock against cfs_b->period_timer.
> + */
> + cpu = cpumask_any_but(cpu_online_mask, cpu);
> + if (cpu >= nr_cpu_ids)
> + return -EBUSY;
> + return work_on_cpu(cpu, __cpu_down_maps_locked, &work);
> }
>
> static int cpu_down(unsigned int cpu, enum cpuhp_state target)

Thanks for the patch. Tested in v6.5-rc5 with test script provided by
Xiongfeng, this patch works.

Tested-by: Yu Liao <[email protected]>

Best regards,
Yu