Received-SPF: pass (google.com: domain of linux-kernel+bounces-34896-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) client-ip=2604:1380:45d1:ec00::1;
Precedence: bulk
MIME-Version: 1.0
References: <20240105222014.1025040-1-qyousef@layalina.io> <20240105222014.1025040-2-qyousef@layalina.io>
 <213f94df-cc36-4281-805d-9f56cbfef796@arm.com>
In-Reply-To: <213f94df-cc36-4281-805d-9f56cbfef796@arm.com>
From: Vincent Guittot <vincent.guittot@linaro.org>
Date: Tue, 23 Jan 2024 09:32:17 +0100
Message-ID: <CAKfTPtCfYcD_zPr7PqgL5hRYny=n3KW8hr6GY8q7zkoyRN7gQg@mail.gmail.com>
Subject: Re: [PATCH v4 1/2] sched/fair: Check a task has a fitting cpu when
 updating misfit
To: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Qais Yousef <qyousef@layalina.io>, Ingo Molnar <mingo@kernel.org>, 
	Peter Zijlstra <peterz@infradead.org>, linux-kernel@vger.kernel.org, 
	Pierre Gondois <Pierre.Gondois@arm.com>
Content-Type: text/plain; charset="UTF-8"

On Mon, 22 Jan 2024 at 10:59, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> On 05/01/2024 23:20, Qais Yousef wrote:
> > From: Qais Yousef <qais.yousef@arm.com>
> >
> > If a misfit task is affined to a subset of the possible cpus, we need to
> > verify that one of these cpus can fit it. Otherwise the load balancer
> > code will continuously trigger needlessly leading the balance_interval
> > to increase in return and eventually end up with a situation where real
> > imbalances take a long time to address because of this impossible
> > imbalance situation.
> >
> > This can happen in Android world where it's common for background tasks
> > to be restricted to little cores.
> >
> > Similarly if we can't fit the biggest core, triggering misfit is
> > pointless as it is the best we can ever get on this system.
> >
> > To be able to detect that; we use asym_cap_list to iterate through
> > capacities in the system to see if the task is able to run at a higher
> > capacity level based on its p->cpus_ptr. To do so safely, we convert the
> > list to be RCU protected.
> >
> > To be able to iterate through capacity levels, export asym_cap_list to
> > allow for fast traversal of all available capacity levels in the system.
> >
> > Test:
> > =====
> >
> > Add
> >
> >       trace_printk("balance_interval = %lu\n", interval)
> >
> > in get_sd_balance_interval().
> >
> > run
> >       if [ "$MASK" != "0" ]; then
> >               adb shell "taskset -a $MASK cat /dev/zero > /dev/null"
> >       fi
> >       sleep 10
> >       // parse ftrace buffer counting the occurrence of each valaue
> >
> > Where MASK is either:
> >
> >       * 0: no busy task running
>
> ... no busy task stands for no misfit scenario?
>
> >       * 1: busy task is pinned to 1 cpu; handled today to not cause
> >         misfit
> >       * f: busy task pinned to little cores, simulates busy background
> >         task, demonstrates the problem to be fixed
> >
>
> [...]
>
> > +     /*
> > +      * If the task affinity is not set to default, make sure it is not
> > +      * restricted to a subset where no CPU can ever fit it. Triggering
> > +      * misfit in this case is pointless as it has no where better to move
> > +      * to. And it can lead to balance_interval to grow too high as we'll
> > +      * continuously fail to move it anywhere.
> > +      */
> > +     if (!cpumask_equal(p->cpus_ptr, cpu_possible_mask)) {
>
> Shouldn't this be cpu_active_mask ?
>
> include/linux/cpumask.h
>
>  * cpu_possible_mask- has bit 'cpu' set iff cpu is populatable
>  * cpu_present_mask - has bit 'cpu' set iff cpu is populated
>  * cpu_online_mask  - has bit 'cpu' set iff cpu available to scheduler
>  * cpu_active_mask  - has bit 'cpu' set iff cpu available to migration
>
>
> > +             unsigned long clamped_util = clamp(util, uclamp_min, uclamp_max);
> > +             bool has_fitting_cpu = false;
> > +             struct asym_cap_data *entry;
> > +
> > +             rcu_read_lock();
> > +             list_for_each_entry_rcu(entry, &asym_cap_list, link) {
> > +                     if (entry->capacity > cpu_cap) {
> > +                             cpumask_t *cpumask;
> > +
> > +                             if (clamped_util > entry->capacity)
> > +                                     continue;
> > +
> > +                             cpumask = cpu_capacity_span(entry);
> > +                             if (!cpumask_intersects(p->cpus_ptr, cpumask))
> > +                                     continue;
> > +
> > +                             has_fitting_cpu = true;
> > +                             break;
> > +                     }
> > +             }
>
> What happen when we hotplug out all CPUs of one CPU capacity value?
> IMHO, we don't call asym_cpu_capacity_scan() with !new_topology
> (partition_sched_domains_locked()).
>
> > +             rcu_read_unlock();
> > +
> > +             if (!has_fitting_cpu)
> > +                     goto out;
> >       }
> >
> >       /*
> > @@ -5083,6 +5127,9 @@ static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
> >        * task_h_load() returns 0.
> >        */
> >       rq->misfit_task_load = max_t(unsigned long, task_h_load(p), 1);
> > +     return;
> > +out:
> > +     rq->misfit_task_load = 0;
> >  }
> >
> >  #else /* CONFIG_SMP */
> > @@ -9583,9 +9630,7 @@ check_cpu_capacity(struct rq *rq, struct sched_domain *sd)
> >   */
> >  static inline int check_misfit_status(struct rq *rq, struct sched_domain *sd)
> >  {
> > -     return rq->misfit_task_load &&
> > -             (arch_scale_cpu_capacity(rq->cpu) < rq->rd->max_cpu_capacity ||
> > -              check_cpu_capacity(rq, sd));
> > +     return rq->misfit_task_load && check_cpu_capacity(rq, sd);
>
> You removed 'arch_scale_cpu_capacity(rq->cpu) <
> rq->rd->max_cpu_capacity' here. Why? I can see that with the standard
> setup (max CPU capacity equal 1024) which is what we probably use 100%
> of the time now. It might get useful again when Vincent will introduce
> his 'user space system pressure' implementation?

That's interesting because I'm doing the opposite in the user space
system pressure that I'm preparing:
I keep something similar to (arch_scale_cpu_capacity(rq->cpu) <
rq->rd->max_cpu_capacity but I remove check_cpu_capacity(rq, sd) which
seems to be useless because it's already used earlier in
nohz_balancer_kick()

>
> >  }
>
> [...]
>
> > @@ -1423,8 +1418,8 @@ static void asym_cpu_capacity_scan(void)
> >
> >       list_for_each_entry_safe(entry, next, &asym_cap_list, link) {
> >               if (cpumask_empty(cpu_capacity_span(entry))) {
> > -                     list_del(&entry->link);
> > -                     kfree(entry);
> > +                     list_del_rcu(&entry->link);
> > +                     call_rcu(&entry->rcu, free_asym_cap_entry);
>
> Looks like there could be brief moments in which one CPU capacity group
> of CPUs could be twice in asym_cap_list. I'm thinking about initial
> startup + max CPU frequency related adjustment of CPU capacity
> (init_cpu_capacity_callback()) for instance. Not sure if this is really
> an issue?
>
> [...]
>