Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
MIME-Version: 1.0
References: <8e785777-03aa-99e1-d20e-e956f5685be6@huawei.com>
 <87mt18it1y.ffs@tglx> <68baeac9-9fa7-5594-b5e7-4baf8ac86b77@huawei.com>
 <ba352e83-b8b1-d900-9c1f-56b8c8a8b8fb@huawei.com> <CAKfTPtBoe_jRn-EMsQxssQ4BcveT+Qcd+GmsRbQEXQDGfzFOMg@mail.gmail.com>
 <875y774wvp.ffs@tglx> <CAKfTPtAzTy4KPrBNRA4cMeTonxn5EKLEAg0b9iH5ecJkAMEStw@mail.gmail.com>
 <87pm5f2qm2.ffs@tglx>
In-Reply-To: <87pm5f2qm2.ffs@tglx>
From:   Vincent Guittot <vincent.guittot@linaro.org>
Date:   Thu, 29 Jun 2023 10:30:44 +0200
Message-ID: <CAKfTPtBSx7h1caR9g8wEK5GG2JMfSBRqSzLgjRUjrnp1Zc-ssg@mail.gmail.com>
Subject: Re: [Question] report a race condition between CPU hotplug state
 machine and hrtimer 'sched_cfs_period_timer' for cfs bandwidth throttling
To:     Thomas Gleixner <tglx@linutronix.de>
Cc:     Xiongfeng Wang <wangxiongfeng2@huawei.com>, vschneid@redhat.com,
        Phil Auld <pauld@redhat.com>, vdonnefort@google.com,
        Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
        Wei Li <liwei391@huawei.com>,
        "liaoyu (E)" <liaoyu15@huawei.com>, zhangqiao22@huawei.com,
        Peter Zijlstra <peterz@infradead.org>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Ingo Molnar <mingo@kernel.org>
Content-Type: text/plain; charset="UTF-8"
Precedence: bulk

On Thu, 29 Jun 2023 at 00:01, Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Wed, Jun 28 2023 at 14:35, Vincent Guittot wrote:
> > On Wed, 28 Jun 2023 at 14:03, Thomas Gleixner <tglx@linutronix.de> wrote:
> >> No, because this is fundamentally wrong.
> >>
> >> If the CPU is on the way out, then the scheduler hotplug machinery
> >> has to handle the period timer so that the problem Xiongfeng analyzed
> >> does not happen in the first place.
> >
> > But the hrtimer was enqueued before it starts to offline the cpu
>
> It does not really matter when it was enqueued. The important point is
> that it was enqueued on that outgoing CPU for whatever reason.
>
> > Then, hrtimers_dead_cpu should take care of migrating the hrtimer out
> > of the outgoing cpu but :
> > - it must run on another target cpu to migrate the hrtimer.
> > - it runs in the context of the caller which can be throttled.
>
> Sure. I completely understand the problem. The hrtimer hotplug callback
> does not run because the task is stuck and waits for the timer to
> expire. Circular dependency.
>
> >> sched_cpu_wait_empty() would be the obvious place to cleanup armed CFS
> >> timers, but let me look into whether we can migrate hrtimers early in
> >> general.
> >
> > but for that we must check if the timer is enqueued on the outgoing
> > cpu and we then need to choose a target cpu.
>
> You're right. I somehow assumed that cfs knows where it queued stuff,
> but obviously it does not.

scheduler doesn't know where hrtimer enqueues the timer

>
> I think we can avoid all that by simply taking that user space task out
> of the picture completely, which avoids debating whether there are other
> possible weird conditions to consider alltogether.

yes, the offline sequence should not be impacted by the caller context

>
> Something like the untested below should just work.
>
> Thanks,
>
>         tglx
> ---
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -1490,6 +1490,13 @@ static int cpu_down(unsigned int cpu, en
>         return err;
>  }
>
> +static long __cpu_device_down(void *arg)
> +{
> +       struct device *dev = arg;
> +
> +       return cpu_down(dev->id, CPUHP_OFFLINE);
> +}
> +
>  /**
>   * cpu_device_down - Bring down a cpu device
>   * @dev: Pointer to the cpu device to offline
> @@ -1502,7 +1509,12 @@ static int cpu_down(unsigned int cpu, en
>   */
>  int cpu_device_down(struct device *dev)
>  {
> -       return cpu_down(dev->id, CPUHP_OFFLINE);
> +       unsigned int cpu = cpumask_any_but(cpu_online_mask, dev->id);
> +
> +       if (cpu >= nr_cpu_ids)
> +               return -EBUSY;
> +
> +       return work_on_cpu(cpu, __cpu_device_down, dev);

The comment for work_on_cpu :

 * It is up to the caller to ensure that the cpu doesn't go offline.
 * The caller must not hold any locks which would prevent @fn from completing.

make me wonder if this should be done only once the hotplug lock is
taken so the selected cpu will not go offline

>  }
>
>  int remove_cpu(unsigned int cpu)