Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
MIME-Version: 1.0
References: <8e785777-03aa-99e1-d20e-e956f5685be6@huawei.com>
 <87mt18it1y.ffs@tglx> <68baeac9-9fa7-5594-b5e7-4baf8ac86b77@huawei.com>
 <ba352e83-b8b1-d900-9c1f-56b8c8a8b8fb@huawei.com> <CAKfTPtBoe_jRn-EMsQxssQ4BcveT+Qcd+GmsRbQEXQDGfzFOMg@mail.gmail.com>
 <a0c76a11-9c18-6473-b0be-fd5ffa864599@huawei.com>
In-Reply-To: <a0c76a11-9c18-6473-b0be-fd5ffa864599@huawei.com>
From:   Vincent Guittot <vincent.guittot@linaro.org>
Date:   Thu, 29 Jun 2023 10:33:49 +0200
Message-ID: <CAKfTPtC6eVhnQh+SeiPLqVCWDLW5PnXmT+7LkZ8iPDh8_QvUeA@mail.gmail.com>
Subject: Re: [Question] report a race condition between CPU hotplug state
 machine and hrtimer 'sched_cfs_period_timer' for cfs bandwidth throttling
To:     Xiongfeng Wang <wangxiongfeng2@huawei.com>
Cc:     Thomas Gleixner <tglx@linutronix.de>, vschneid@redhat.com,
        Phil Auld <pauld@redhat.com>, vdonnefort@google.com,
        Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
        Wei Li <liwei391@huawei.com>,
        "liaoyu (E)" <liaoyu15@huawei.com>, zhangqiao22@huawei.com,
        Peter Zijlstra <peterz@infradead.org>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Ingo Molnar <mingo@kernel.org>
Content-Type: text/plain; charset="UTF-8"
Precedence: bulk

On Thu, 29 Jun 2023 at 03:26, Xiongfeng Wang <wangxiongfeng2@huawei.com> wrote:
>
>
>
> On 2023/6/28 0:46, Vincent Guittot wrote:
> > On Mon, 26 Jun 2023 at 10:23, Xiongfeng Wang <wangxiongfeng2@huawei.com> wrote:
> >>
> >> Hi,
> >>
> >> Kindly ping~
> >> Could you please take a look at this issue and the below temporary fix ?
> >>
> >> Thanks,
> >> Xiongfeng
> >>
> >> On 2023/6/12 20:49, Xiongfeng Wang wrote:
> >>>
> >>>
> >>> On 2023/6/9 22:55, Thomas Gleixner wrote:
> >>>> On Fri, Jun 09 2023 at 19:24, Xiongfeng Wang wrote:
> >>>>
> >>>> Cc+ scheduler people, leave context intact
> >>>>
> >>>>> Hello,
> >>>>>  When I do some low power tests, the following hung task is printed.

[...]

> >>> diff --cc kernel/sched/fair.c
> >>> index d9d6519fae01,bd6624353608..000000000000
> >>> --- a/kernel/sched/fair.c
> >>> +++ b/kernel/sched/fair.c
> >>> @@@ -5411,10 -5411,16 +5411,15 @@@ void start_cfs_bandwidth(struct cfs_ban
> >>>   {
> >>>         lockdep_assert_held(&cfs_b->lock);
> >>>
> >>> -       if (cfs_b->period_active)
> >>> +       if (cfs_b->period_active) {
> >>> +               struct hrtimer_clock_base *clock_base = cfs_b->period_timer.base;
> >>> +               int cpu = clock_base->cpu_base->cpu;
> >>> +               if (!cpu_active(cpu) && cpu != smp_processor_id())
> >>> +                       hrtimer_start_expires(&cfs_b->period_timer,
> >>> HRTIMER_MODE_ABS_PINNED);
> >>>                 return;
> >>> +       }
> >
> > I have been able to reproduce your problem and run your fix on top. I
> > still wonder if there is a
>
> Sorry, I forgot to provide the kernel modification to help reproduce the issue.
> At first, the issue can only be reproduced on the product environment with
> product stress testcase. After firguring out the reason, I add the following
> modification. It make sure the process ran out cfs quota and can be sched out in
> free_vm_stack_cache. Although the real schedule point is in __vunmap(), this can
> also show the issue exists.

I have been able to reproduce the problem ( or at least something
similar) without your change below with a shorter cfs_quota_us and
other tasks always running in the cgroup

>
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 0fb86b65ae60..3b2d83fb407a 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -110,6 +110,8 @@
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/task.h>
>
> +#include <linux/delay.h>
> +
>  /*
>   * Minimum number of threads to boot the kernel
>   */
> @@ -199,6 +201,9 @@ static int free_vm_stack_cache(unsigned int cpu)
>         struct vm_struct **cached_vm_stacks = per_cpu_ptr(cached_stacks, cpu);
>         int i;
>
> +       mdelay(2000);
> +       cond_resched();
> +
>         for (i = 0; i < NR_CACHED_STACKS; i++) {
>                 struct vm_struct *vm_stack = cached_vm_stacks[i];
>
> Thanks,
> Xiongfeng
>
> > Could we have a helper from hrtimer to get the cpu of the clock_base ?
> >
> >
> >>>
> >>>         cfs_b->period_active = 1;
> >>>  -
> >>>         hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
> >>>         hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
> >>>   }
> >>>
> >>> Thanks,
> >>> Xiongfeng
> >>>
> >>> .
> >>>
> > .
> >