On Tue, 1 Oct 2019 at 19:47, Valentin Schneider
<[email protected]> wrote:
>
> On 19/09/2019 08:33, Vincent Guittot wrote:
>
> [...]
>
> > @@ -8283,69 +8363,133 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> > */
> > static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *sds)
> > {
> > - unsigned long max_pull, load_above_capacity = ~0UL;
> > struct sg_lb_stats *local, *busiest;
> >
> > local = &sds->local_stat;
> > busiest = &sds->busiest_stat;
> >
> > - if (busiest->group_asym_packing) {
> > + if (busiest->group_type == group_misfit_task) {
> > + /* Set imbalance to allow misfit task to be balanced. */
> > + env->balance_type = migrate_misfit;
> > + env->imbalance = busiest->group_misfit_task_load;
> > + return;
> > + }
> > +
> > + if (busiest->group_type == group_asym_packing) {
> > + /*
> > + * In case of asym capacity, we will try to migrate all load to
> > + * the preferred CPU.
> > + */
> > + env->balance_type = migrate_load;
> > env->imbalance = busiest->group_load;
> > return;
> > }
> >
> > + if (busiest->group_type == group_imbalanced) {
> > + /*
> > + * In the group_imb case we cannot rely on group-wide averages
> > + * to ensure CPU-load equilibrium, try to move any task to fix
> > + * the imbalance. The next load balance will take care of
> > + * balancing back the system.
> > + */
> > + env->balance_type = migrate_task;
> > + env->imbalance = 1;
> > + return;
> > + }
> > +
> > /*
> > - * Avg load of busiest sg can be less and avg load of local sg can
> > - * be greater than avg load across all sgs of sd because avg load
> > - * factors in sg capacity and sgs with smaller group_type are
> > - * skipped when updating the busiest sg:
> > + * Try to use spare capacity of local group without overloading it or
> > + * emptying busiest
> > */
> > - if (busiest->group_type != group_misfit_task &&
> > - (busiest->avg_load <= sds->avg_load ||
> > - local->avg_load >= sds->avg_load)) {
> > - env->imbalance = 0;
> > + if (local->group_type == group_has_spare) {
> > + if (busiest->group_type > group_fully_busy) {
> > + /*
> > + * If busiest is overloaded, try to fill spare
> > + * capacity. This might end up creating spare capacity
> > + * in busiest or busiest still being overloaded but
> > + * there is no simple way to directly compute the
> > + * amount of load to migrate in order to balance the
> > + * system.
> > + */
> > + env->balance_type = migrate_util;
> > + env->imbalance = max(local->group_capacity, local->group_util) -
> > + local->group_util;
> > + return;
> > + }
> > +
> > + if (busiest->group_weight == 1 || sds->prefer_sibling) {
> > + /*
> > + * When prefer sibling, evenly spread running tasks on
> > + * groups.
> > + */
> > + env->balance_type = migrate_task;
> > + env->imbalance = (busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1;
>
> Isn't that one somewhat risky?
>
> Say both groups are classified group_has_spare and we do prefer_sibling.
> We'd select busiest as the one with the maximum number of busy CPUs, but it
> could be so that busiest.sum_h_nr_running < local.sum_h_nr_running (because
> pinned tasks or wakeup failed to properly spread stuff).
>
> The thing should be unsigned so at least we save ourselves from right
> shifting a negative value, but we still end up with a gygornous imbalance
> (which we then store into env.imbalance which *is* signed... Urgh).

so it's not clear what happen with a right shift on negative signed
value and this seems to be compiler dependent so even
max_t(long, 0, (local->idle_cpus - busiest->idle_cpus) >> 1) might be wrong

I'm going to update it

>
> [...]

2019-10-02 08:58:46

by Vincent Guittot

[permalink] [raw]

On 10/16/19 5:26 PM, Vincent Guittot wrote:
> On Wed, 16 Oct 2019 at 09:21, Parth Shah <[email protected]> wrote:
>>
>>
>>
>> On 9/19/19 1:03 PM, Vincent Guittot wrote:
>>
>> [...]
>>
>>> Signed-off-by: Vincent Guittot <[email protected]>
>>> ---
>>> kernel/sched/fair.c | 585 ++++++++++++++++++++++++++++++++++------------------
>>> 1 file changed, 380 insertions(+), 205 deletions(-)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 017aad0..d33379c 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -7078,11 +7078,26 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10;
>>>
>>> enum fbq_type { regular, remote, all };
>>>
>>> +/*
>>> + * group_type describes the group of CPUs at the moment of the load balance.
>>> + * The enum is ordered by pulling priority, with the group with lowest priority
>>> + * first so the groupe_type can be simply compared when selecting the busiest
>>> + * group. see update_sd_pick_busiest().
>>> + */
>>> enum group_type {
>>> - group_other = 0,
>>> + group_has_spare = 0,
>>> + group_fully_busy,
>>> group_misfit_task,
>>> + group_asym_packing,
>>> group_imbalanced,
>>> - group_overloaded,
>>> + group_overloaded
>>> +};
>>> +
>>> +enum migration_type {
>>> + migrate_load = 0,
>>> + migrate_util,
>>> + migrate_task,
>>> + migrate_misfit
>>> };
>>>
>>> #define LBF_ALL_PINNED 0x01
>>> @@ -7115,7 +7130,7 @@ struct lb_env {
>>> unsigned int loop_max;
>>>
>>> enum fbq_type fbq_type;
>>> - enum group_type src_grp_type;
>>> + enum migration_type balance_type;
>>> struct list_head tasks;
>>> };
>>>
>>> @@ -7347,7 +7362,7 @@ static int detach_tasks(struct lb_env *env)
>>> {
>>> struct list_head *tasks = &env->src_rq->cfs_tasks;
>>> struct task_struct *p;
>>> - unsigned long load;
>>> + unsigned long util, load;
>>> int detached = 0;
>>>
>>> lockdep_assert_held(&env->src_rq->lock);
>>> @@ -7380,19 +7395,53 @@ static int detach_tasks(struct lb_env *env)
>>> if (!can_migrate_task(p, env))
>>> goto next;
>>>
>>> - load = task_h_load(p);
>>> + switch (env->balance_type) {
>>> + case migrate_load:
>>> + load = task_h_load(p);
>>>
>>> - if (sched_feat(LB_MIN) && load < 16 && !env->sd->nr_balance_failed)
>>> - goto next;
>>> + if (sched_feat(LB_MIN) &&
>>> + load < 16 && !env->sd->nr_balance_failed)
>>> + goto next;
>>>
>>> - if ((load / 2) > env->imbalance)
>>> - goto next;
>>> + if ((load / 2) > env->imbalance)
>>> + goto next;
>>> +
>>> + env->imbalance -= load;
>>> + break;
>>> +
>>> + case migrate_util:
>>> + util = task_util_est(p);
>>> +
>>> + if (util > env->imbalance)
>>
>> Can you please explain what would happen for
>> `if (util/2 > env->imbalance)` ?
>> just like when migrating load, even util shouldn't be migrated if
>> env->imbalance is just near the utilization of the task being moved, isn't it?
>
> I have chosen uti and not util/2 to be conservative because
> migrate_util is used to fill spare capacity.
> With `if (util/2 > env->imbalance)`, we can more easily overload the
> local group or pick too much utilization from the overloaded group.
>

fair enough. I missed the point that unlike migrate_load, with
migrate_util, env->imbalance is just spare capacity of the local group.

Thanks,
Parth

>>
>>> + goto next;
>>> +
>>> + env->imbalance -= util;
>>> + break;
>>> +[ ... ]
>>
>> Thanks,
>> Parth
>>