We've seen cases while running geekbench that an idle little core never
pulls a task from a bigger overloaded cluster for 100s of ms and
sometimes over a second.
It turned out that the load balance identifies this as a migrate_util
type since the local group (little cluster) has a spare capacity and
will try to pull a task. But the little cluster capacity is very small
nowadays (around 200 or less) and if two busy tasks are stuck on a mid
core which has a capacity of over 700, this means the util of each tasks
will be around 350+ range. Which is always bigger than the spare
capacity of the little group with a single idle core.
When trying to detach_tasks() we bail out then because of the comparison
of:
if (util > env->imbalance)
goto next;
In calculate_imbalance() we convert a migrate_util into migrate_task
type if the CPU trying to do the pull is idle. But we only do this if
env->imbalance is 0; which I can't understand. AFAICT env->imbalance
contains the local group's spare capacity. If it is 0, this means it's
fully busy.
Removing this condition fixes the problem, but since I can't fully
understand why it checks for 0, sending this as RFC. It could be a typo
and meant to check for
env->imbalance != 0
instead?
Signed-off-by: Qais Yousef (Google) <[email protected]>
---
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a80a73909dc2..682d9d6a8691 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10288,7 +10288,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
* waiting task in this overloaded busiest group. Let's
* try to pull it.
*/
- if (env->idle != CPU_NOT_IDLE && env->imbalance == 0) {
+ if (env->idle != CPU_NOT_IDLE) {
env->migration_type = migrate_task;
env->imbalance = 1;
}
--
2.25.1
Le dimanche 16 juil. 2023 ? 02:41:25 (+0100), Qais Yousef a ?crit :
> We've seen cases while running geekbench that an idle little core never
> pulls a task from a bigger overloaded cluster for 100s of ms and
> sometimes over a second.
>
> It turned out that the load balance identifies this as a migrate_util
> type since the local group (little cluster) has a spare capacity and
> will try to pull a task. But the little cluster capacity is very small
> nowadays (around 200 or less) and if two busy tasks are stuck on a mid
> core which has a capacity of over 700, this means the util of each tasks
> will be around 350+ range. Which is always bigger than the spare
> capacity of the little group with a single idle core.
>
> When trying to detach_tasks() we bail out then because of the comparison
> of:
>
> if (util > env->imbalance)
> goto next;
>
> In calculate_imbalance() we convert a migrate_util into migrate_task
> type if the CPU trying to do the pull is idle. But we only do this if
> env->imbalance is 0; which I can't understand. AFAICT env->imbalance
> contains the local group's spare capacity. If it is 0, this means it's
> fully busy.
>
> Removing this condition fixes the problem, but since I can't fully
> understand why it checks for 0, sending this as RFC. It could be a typo
> and meant to check for
>
> env->imbalance != 0
>
> instead?
>
> Signed-off-by: Qais Yousef (Google) <[email protected]>
> ---
> kernel/sched/fair.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a80a73909dc2..682d9d6a8691 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10288,7 +10288,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> * waiting task in this overloaded busiest group. Let's
> * try to pull it.
> */
> - if (env->idle != CPU_NOT_IDLE && env->imbalance == 0) {
> + if (env->idle != CPU_NOT_IDLE) {
With this change you completely?skip migrate_util for idle and newly idle case
and this would be too aggressive.
We can do something similar to migrate_load in detach_tasks():
---
kernel/sched/fair.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d3df5b1642a6..64111ac7e137 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8834,7 +8834,13 @@ static int detach_tasks(struct lb_env *env)
case migrate_util:
util = task_util_est(p);
- if (util > env->imbalance)
+ /*
+ * Make sure that we don't migrate too much utilization.
+ * Nevertheless, let relax the constraint if
+ * scheduler fails to find a good waiting task to
+ * migrate.
+ */
+ if (shr_bound(util, env->sd->nr_balance_failed) > env->imbalance)
goto next;
env->imbalance -= util;
--
> env->migration_type = migrate_task;
> env->imbalance = 1;
> }
> --
> 2.25.1
>
On 07/18/23 14:48, Vincent Guittot wrote:
> Le dimanche 16 juil. 2023 à 02:41:25 (+0100), Qais Yousef a écrit :
> > We've seen cases while running geekbench that an idle little core never
> > pulls a task from a bigger overloaded cluster for 100s of ms and
> > sometimes over a second.
> >
> > It turned out that the load balance identifies this as a migrate_util
> > type since the local group (little cluster) has a spare capacity and
> > will try to pull a task. But the little cluster capacity is very small
> > nowadays (around 200 or less) and if two busy tasks are stuck on a mid
> > core which has a capacity of over 700, this means the util of each tasks
> > will be around 350+ range. Which is always bigger than the spare
> > capacity of the little group with a single idle core.
> >
> > When trying to detach_tasks() we bail out then because of the comparison
> > of:
> >
> > if (util > env->imbalance)
> > goto next;
> >
> > In calculate_imbalance() we convert a migrate_util into migrate_task
> > type if the CPU trying to do the pull is idle. But we only do this if
> > env->imbalance is 0; which I can't understand. AFAICT env->imbalance
> > contains the local group's spare capacity. If it is 0, this means it's
> > fully busy.
> >
> > Removing this condition fixes the problem, but since I can't fully
> > understand why it checks for 0, sending this as RFC. It could be a typo
> > and meant to check for
> >
> > env->imbalance != 0
> >
> > instead?
> >
> > Signed-off-by: Qais Yousef (Google) <[email protected]>
> > ---
> > kernel/sched/fair.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index a80a73909dc2..682d9d6a8691 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -10288,7 +10288,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> > * waiting task in this overloaded busiest group. Let's
> > * try to pull it.
> > */
> > - if (env->idle != CPU_NOT_IDLE && env->imbalance == 0) {
> > + if (env->idle != CPU_NOT_IDLE) {
>
> With this change you completely skip migrate_util for idle and newly idle case
> and this would be too aggressive.
Yeah I didn't have great confidence in it to be honest.
Could you help me understand the meaning of env->imbalance == 0 though? At this
stage its value is
env->imbalance = max(local->group_capacity, local->group_util) - local->group_util;
which AFAICT is calculating the _spare_ capacity, right? So when we check
env->imbalance == 0 we say if this_cpu is (idle OR newly idle) AND the local
group is fully utilized? Why it must be fully utilized to do the pull? It's
counter intuitive to me. I'm probably misinterpreting something but can't see
it.
>
> We can do something similar to migrate_load in detach_tasks():
>
> ---
> kernel/sched/fair.c | 8 +++++++-
> 1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d3df5b1642a6..64111ac7e137 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8834,7 +8834,13 @@ static int detach_tasks(struct lb_env *env)
> case migrate_util:
> util = task_util_est(p);
>
> - if (util > env->imbalance)
> + /*
> + * Make sure that we don't migrate too much utilization.
> + * Nevertheless, let relax the constraint if
> + * scheduler fails to find a good waiting task to
> + * migrate.
> + */
> + if (shr_bound(util, env->sd->nr_balance_failed) > env->imbalance)
> goto next;
Thanks! This looks better but I still see a 100 or 200 ms delay sometimes.
Still debugging it but I _think_ it's a combination of two things:
1. nr_balance_failed doesn't increment as fast - I see a lot of 0s with
occasional 1s and less frequent 2s
2. something might wake up briefly on that cpu in between load balance,
and given how small the littles are they make the required
nr_balance_failed to tip the scale even higher
Thanks
--
Qais Yousef
>
> env->imbalance -= util;
> --
>
>
>
> > env->migration_type = migrate_task;
> > env->imbalance = 1;
> > }
> > --
> > 2.25.1
> >
On Tue, 18 Jul 2023 at 18:18, Qais Yousef <[email protected]> wrote:
>
> On 07/18/23 14:48, Vincent Guittot wrote:
> > Le dimanche 16 juil. 2023 à 02:41:25 (+0100), Qais Yousef a écrit :
> > > We've seen cases while running geekbench that an idle little core never
> > > pulls a task from a bigger overloaded cluster for 100s of ms and
> > > sometimes over a second.
> > >
> > > It turned out that the load balance identifies this as a migrate_util
> > > type since the local group (little cluster) has a spare capacity and
> > > will try to pull a task. But the little cluster capacity is very small
> > > nowadays (around 200 or less) and if two busy tasks are stuck on a mid
> > > core which has a capacity of over 700, this means the util of each tasks
> > > will be around 350+ range. Which is always bigger than the spare
> > > capacity of the little group with a single idle core.
> > >
> > > When trying to detach_tasks() we bail out then because of the comparison
> > > of:
> > >
> > > if (util > env->imbalance)
> > > goto next;
> > >
> > > In calculate_imbalance() we convert a migrate_util into migrate_task
> > > type if the CPU trying to do the pull is idle. But we only do this if
> > > env->imbalance is 0; which I can't understand. AFAICT env->imbalance
> > > contains the local group's spare capacity. If it is 0, this means it's
> > > fully busy.
> > >
> > > Removing this condition fixes the problem, but since I can't fully
> > > understand why it checks for 0, sending this as RFC. It could be a typo
> > > and meant to check for
> > >
> > > env->imbalance != 0
> > >
> > > instead?
> > >
> > > Signed-off-by: Qais Yousef (Google) <[email protected]>
> > > ---
> > > kernel/sched/fair.c | 2 +-
> > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index a80a73909dc2..682d9d6a8691 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -10288,7 +10288,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> > > * waiting task in this overloaded busiest group. Let's
> > > * try to pull it.
> > > */
> > > - if (env->idle != CPU_NOT_IDLE && env->imbalance == 0) {
> > > + if (env->idle != CPU_NOT_IDLE) {
> >
> > With this change you completely skip migrate_util for idle and newly idle case
> > and this would be too aggressive.
>
> Yeah I didn't have great confidence in it to be honest.
>
> Could you help me understand the meaning of env->imbalance == 0 though? At this
> stage its value is
>
> env->imbalance = max(local->group_capacity, local->group_util) - local->group_util;
>
> which AFAICT is calculating the _spare_ capacity, right? So when we check
> env->imbalance == 0 we say if this_cpu is (idle OR newly idle) AND the local
> group is fully utilized? Why it must be fully utilized to do the pull? It's
> counter intuitive to me. I'm probably misinterpreting something but can't see
This is a special case. We have some situations where group_util is
higher than capacity because of tasks newly migrated to this group for
example so the spare capacity is null but one cpu is idle or newly
idle. In this case we try to pull a task with the risk that this group
becomes overloaded. That's why we do not try to pull a task every
time.
But that might be good choice all the time
> it.
>
> >
> > We can do something similar to migrate_load in detach_tasks():
> >
> > ---
> > kernel/sched/fair.c | 8 +++++++-
> > 1 file changed, 7 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index d3df5b1642a6..64111ac7e137 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -8834,7 +8834,13 @@ static int detach_tasks(struct lb_env *env)
> > case migrate_util:
> > util = task_util_est(p);
> >
> > - if (util > env->imbalance)
> > + /*
> > + * Make sure that we don't migrate too much utilization.
> > + * Nevertheless, let relax the constraint if
> > + * scheduler fails to find a good waiting task to
> > + * migrate.
> > + */
> > + if (shr_bound(util, env->sd->nr_balance_failed) > env->imbalance)
> > goto next;
>
> Thanks! This looks better but I still see a 100 or 200 ms delay sometimes.
> Still debugging it but I _think_ it's a combination of two things:
>
> 1. nr_balance_failed doesn't increment as fast - I see a lot of 0s with
> occasional 1s and less frequent 2s
> 2. something might wake up briefly on that cpu in between load balance,
> and given how small the littles are they make the required
> nr_balance_failed to tip the scale even higher
>
>
> Thanks
>
> --
> Qais Yousef
>
> >
> > env->imbalance -= util;
> > --
> >
> >
> >
> > > env->migration_type = migrate_task;
> > > env->imbalance = 1;
> > > }
> > > --
> > > 2.25.1
> > >
On 07/18/23 18:31, Vincent Guittot wrote:
> On Tue, 18 Jul 2023 at 18:18, Qais Yousef <[email protected]> wrote:
> >
> > On 07/18/23 14:48, Vincent Guittot wrote:
> > > Le dimanche 16 juil. 2023 à 02:41:25 (+0100), Qais Yousef a écrit :
> > > > We've seen cases while running geekbench that an idle little core never
> > > > pulls a task from a bigger overloaded cluster for 100s of ms and
> > > > sometimes over a second.
> > > >
> > > > It turned out that the load balance identifies this as a migrate_util
> > > > type since the local group (little cluster) has a spare capacity and
> > > > will try to pull a task. But the little cluster capacity is very small
> > > > nowadays (around 200 or less) and if two busy tasks are stuck on a mid
> > > > core which has a capacity of over 700, this means the util of each tasks
> > > > will be around 350+ range. Which is always bigger than the spare
> > > > capacity of the little group with a single idle core.
> > > >
> > > > When trying to detach_tasks() we bail out then because of the comparison
> > > > of:
> > > >
> > > > if (util > env->imbalance)
> > > > goto next;
> > > >
> > > > In calculate_imbalance() we convert a migrate_util into migrate_task
> > > > type if the CPU trying to do the pull is idle. But we only do this if
> > > > env->imbalance is 0; which I can't understand. AFAICT env->imbalance
> > > > contains the local group's spare capacity. If it is 0, this means it's
> > > > fully busy.
> > > >
> > > > Removing this condition fixes the problem, but since I can't fully
> > > > understand why it checks for 0, sending this as RFC. It could be a typo
> > > > and meant to check for
> > > >
> > > > env->imbalance != 0
> > > >
> > > > instead?
> > > >
> > > > Signed-off-by: Qais Yousef (Google) <[email protected]>
> > > > ---
> > > > kernel/sched/fair.c | 2 +-
> > > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index a80a73909dc2..682d9d6a8691 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -10288,7 +10288,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> > > > * waiting task in this overloaded busiest group. Let's
> > > > * try to pull it.
> > > > */
> > > > - if (env->idle != CPU_NOT_IDLE && env->imbalance == 0) {
> > > > + if (env->idle != CPU_NOT_IDLE) {
> > >
> > > With this change you completely skip migrate_util for idle and newly idle case
> > > and this would be too aggressive.
> >
> > Yeah I didn't have great confidence in it to be honest.
> >
> > Could you help me understand the meaning of env->imbalance == 0 though? At this
> > stage its value is
> >
> > env->imbalance = max(local->group_capacity, local->group_util) - local->group_util;
> >
> > which AFAICT is calculating the _spare_ capacity, right? So when we check
> > env->imbalance == 0 we say if this_cpu is (idle OR newly idle) AND the local
> > group is fully utilized? Why it must be fully utilized to do the pull? It's
> > counter intuitive to me. I'm probably misinterpreting something but can't see
>
> This is a special case. We have some situations where group_util is
> higher than capacity because of tasks newly migrated to this group for
> example so the spare capacity is null but one cpu is idle or newly
> idle. In this case we try to pull a task with the risk that this group
> becomes overloaded. That's why we do not try to pull a task every
> time.
> But that might be good choice all the time
So on misfit, I do see that a bigger cpu will pull the task quickly as soon as
a bigger cpu gets idle.
This scenario is the opposite. Maybe the exception in my case is that the
little cpu has spare capacity as it's mostly idle all the time. It's just
unlucky circumstances at wake up ended up putting two tasks on bigger core.
Specifically, at the start of some of the sub-tests, there's a good chance that
we have simultaneous wake ups and there's a limitation/race in EAS because of
the gap between select_task_rq_fair() and enqueue_task_fair(). If two task wake
up simultaneously, select_task_rq_fair() could be called twice before the
enqueue_task_fair() and end up selecting the same CPU for both tasks not
realizing one of them is just waiting to be enqueued. IOW, EAS will not take
into account the updated util of one of the CPUs because of the (short) delay
to enqueue it.
This should be fixed (the wake up race), but this is a different story and
a bit trickier.
The risk of pulling always is:
1. Risk force migrating prev task if it woke up shortly after the pull.
Which is no worse IMHO than misfit going almost immediately to
bigger core.
2. Not sure of not being too smart about which task to pull. I can
envisage other scenarios where one of the two tasks is better to
pull. In geekbench both tasks are equal. But maybe in other use
cases one of them less impactful. For example if one of them has
a low uclamp_max but the other doesn't. But this case is unsupported
feature at the moment. My plan (hope) to treat these uclamp_max as
misfit migration. Which I think is the better path in general to
treat special cases. So for migration_util this behavior might be
sensible all the time. We are working too hard, let's use all of our
resources and make use all of idle cpus. If prev_task wakes up,
there's no harm; I doubt the cache hotness is a problem even given
two tasks are busy all the time trashing L1 anyway.
3. Not sure what will happen in cases we have nu_running > nr_cpus and
some tasks happen to sleep for brief period of times. Two tasks
stuck on little core is worse than two tasks stuck on mid or big
core. But maybe migrate_util will pull it back again given how
little capacity they have?
Cheers
--
Qais Yousef
>
> > it.
> >
> > >
> > > We can do something similar to migrate_load in detach_tasks():
> > >
> > > ---
> > > kernel/sched/fair.c | 8 +++++++-
> > > 1 file changed, 7 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index d3df5b1642a6..64111ac7e137 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -8834,7 +8834,13 @@ static int detach_tasks(struct lb_env *env)
> > > case migrate_util:
> > > util = task_util_est(p);
> > >
> > > - if (util > env->imbalance)
> > > + /*
> > > + * Make sure that we don't migrate too much utilization.
> > > + * Nevertheless, let relax the constraint if
> > > + * scheduler fails to find a good waiting task to
> > > + * migrate.
> > > + */
> > > + if (shr_bound(util, env->sd->nr_balance_failed) > env->imbalance)
> > > goto next;
> >
> > Thanks! This looks better but I still see a 100 or 200 ms delay sometimes.
> > Still debugging it but I _think_ it's a combination of two things:
> >
> > 1. nr_balance_failed doesn't increment as fast - I see a lot of 0s with
> > occasional 1s and less frequent 2s
> > 2. something might wake up briefly on that cpu in between load balance,
> > and given how small the littles are they make the required
> > nr_balance_failed to tip the scale even higher
> >
> >
> > Thanks
> >
> > --
> > Qais Yousef
> >
> > >
> > > env->imbalance -= util;
> > > --
> > >
> > >
> > >
> > > > env->migration_type = migrate_task;
> > > > env->imbalance = 1;
> > > > }
> > > > --
> > > > 2.25.1
> > > >
On Tue, 18 Jul 2023 at 19:25, Qais Yousef <[email protected]> wrote:
>
> On 07/18/23 18:31, Vincent Guittot wrote:
> > On Tue, 18 Jul 2023 at 18:18, Qais Yousef <[email protected]> wrote:
> > >
> > > On 07/18/23 14:48, Vincent Guittot wrote:
> > > > Le dimanche 16 juil. 2023 à 02:41:25 (+0100), Qais Yousef a écrit :
> > > > > We've seen cases while running geekbench that an idle little core never
> > > > > pulls a task from a bigger overloaded cluster for 100s of ms and
> > > > > sometimes over a second.
> > > > >
> > > > > It turned out that the load balance identifies this as a migrate_util
> > > > > type since the local group (little cluster) has a spare capacity and
> > > > > will try to pull a task. But the little cluster capacity is very small
> > > > > nowadays (around 200 or less) and if two busy tasks are stuck on a mid
> > > > > core which has a capacity of over 700, this means the util of each tasks
> > > > > will be around 350+ range. Which is always bigger than the spare
> > > > > capacity of the little group with a single idle core.
> > > > >
> > > > > When trying to detach_tasks() we bail out then because of the comparison
> > > > > of:
> > > > >
> > > > > if (util > env->imbalance)
> > > > > goto next;
> > > > >
> > > > > In calculate_imbalance() we convert a migrate_util into migrate_task
> > > > > type if the CPU trying to do the pull is idle. But we only do this if
> > > > > env->imbalance is 0; which I can't understand. AFAICT env->imbalance
> > > > > contains the local group's spare capacity. If it is 0, this means it's
> > > > > fully busy.
> > > > >
> > > > > Removing this condition fixes the problem, but since I can't fully
> > > > > understand why it checks for 0, sending this as RFC. It could be a typo
> > > > > and meant to check for
> > > > >
> > > > > env->imbalance != 0
> > > > >
> > > > > instead?
> > > > >
> > > > > Signed-off-by: Qais Yousef (Google) <[email protected]>
> > > > > ---
> > > > > kernel/sched/fair.c | 2 +-
> > > > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > > index a80a73909dc2..682d9d6a8691 100644
> > > > > --- a/kernel/sched/fair.c
> > > > > +++ b/kernel/sched/fair.c
> > > > > @@ -10288,7 +10288,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> > > > > * waiting task in this overloaded busiest group. Let's
> > > > > * try to pull it.
> > > > > */
> > > > > - if (env->idle != CPU_NOT_IDLE && env->imbalance == 0) {
> > > > > + if (env->idle != CPU_NOT_IDLE) {
> > > >
> > > > With this change you completely skip migrate_util for idle and newly idle case
> > > > and this would be too aggressive.
> > >
> > > Yeah I didn't have great confidence in it to be honest.
> > >
> > > Could you help me understand the meaning of env->imbalance == 0 though? At this
> > > stage its value is
> > >
> > > env->imbalance = max(local->group_capacity, local->group_util) - local->group_util;
> > >
> > > which AFAICT is calculating the _spare_ capacity, right? So when we check
> > > env->imbalance == 0 we say if this_cpu is (idle OR newly idle) AND the local
> > > group is fully utilized? Why it must be fully utilized to do the pull? It's
> > > counter intuitive to me. I'm probably misinterpreting something but can't see
> >
> > This is a special case. We have some situations where group_util is
> > higher than capacity because of tasks newly migrated to this group for
> > example so the spare capacity is null but one cpu is idle or newly
> > idle. In this case we try to pull a task with the risk that this group
> > becomes overloaded. That's why we do not try to pull a task every
> > time.
> > But that might be good choice all the time
>
> So on misfit, I do see that a bigger cpu will pull the task quickly as soon as
> a bigger cpu gets idle.
>
> This scenario is the opposite. Maybe the exception in my case is that the
> little cpu has spare capacity as it's mostly idle all the time. It's just
> unlucky circumstances at wake up ended up putting two tasks on bigger core.
I was trying to reproduce the behavior but I was failing until I
realized that this code path is used when the 2 groups are not sharing
their cache. Which topology do you use ? I thought that dynamiQ and
shares cache between all 8 cpus was the norm for arm64 embedded device
now
Also when you say "the little cluster capacity is very small nowadays
(around 200 or less)", it is the capacity of 1 core or the cluster ?
>
> Specifically, at the start of some of the sub-tests, there's a good chance that
> we have simultaneous wake ups and there's a limitation/race in EAS because of
> the gap between select_task_rq_fair() and enqueue_task_fair(). If two task wake
> up simultaneously, select_task_rq_fair() could be called twice before the
> enqueue_task_fair() and end up selecting the same CPU for both tasks not
> realizing one of them is just waiting to be enqueued. IOW, EAS will not take
> into account the updated util of one of the CPUs because of the (short) delay
> to enqueue it.
>
> This should be fixed (the wake up race), but this is a different story and
> a bit trickier.
>
> The risk of pulling always is:
>
> 1. Risk force migrating prev task if it woke up shortly after the pull.
> Which is no worse IMHO than misfit going almost immediately to
> bigger core.
>
> 2. Not sure of not being too smart about which task to pull. I can
> envisage other scenarios where one of the two tasks is better to
> pull. In geekbench both tasks are equal. But maybe in other use
> cases one of them less impactful. For example if one of them has
> a low uclamp_max but the other doesn't. But this case is unsupported
> feature at the moment. My plan (hope) to treat these uclamp_max as
> misfit migration. Which I think is the better path in general to
> treat special cases. So for migration_util this behavior might be
> sensible all the time. We are working too hard, let's use all of our
> resources and make use all of idle cpus. If prev_task wakes up,
> there's no harm; I doubt the cache hotness is a problem even given
> two tasks are busy all the time trashing L1 anyway.
>
> 3. Not sure what will happen in cases we have nu_running > nr_cpus and
> some tasks happen to sleep for brief period of times. Two tasks
> stuck on little core is worse than two tasks stuck on mid or big
> core. But maybe migrate_util will pull it back again given how
> little capacity they have?
>
> Cheers
>
> --
> Qais Yousef
>
> >
> > > it.
> > >
> > > >
> > > > We can do something similar to migrate_load in detach_tasks():
> > > >
> > > > ---
> > > > kernel/sched/fair.c | 8 +++++++-
> > > > 1 file changed, 7 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index d3df5b1642a6..64111ac7e137 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -8834,7 +8834,13 @@ static int detach_tasks(struct lb_env *env)
> > > > case migrate_util:
> > > > util = task_util_est(p);
> > > >
> > > > - if (util > env->imbalance)
> > > > + /*
> > > > + * Make sure that we don't migrate too much utilization.
> > > > + * Nevertheless, let relax the constraint if
> > > > + * scheduler fails to find a good waiting task to
> > > > + * migrate.
> > > > + */
> > > > + if (shr_bound(util, env->sd->nr_balance_failed) > env->imbalance)
> > > > goto next;
> > >
> > > Thanks! This looks better but I still see a 100 or 200 ms delay sometimes.
> > > Still debugging it but I _think_ it's a combination of two things:
> > >
> > > 1. nr_balance_failed doesn't increment as fast - I see a lot of 0s with
> > > occasional 1s and less frequent 2s
> > > 2. something might wake up briefly on that cpu in between load balance,
> > > and given how small the littles are they make the required
> > > nr_balance_failed to tip the scale even higher
> > >
> > >
> > > Thanks
> > >
> > > --
> > > Qais Yousef
> > >
> > > >
> > > > env->imbalance -= util;
> > > > --
> > > >
> > > >
> > > >
> > > > > env->migration_type = migrate_task;
> > > > > env->imbalance = 1;
> > > > > }
> > > > > --
> > > > > 2.25.1
> > > > >
On 07/20/23 14:31, Vincent Guittot wrote:
> I was trying to reproduce the behavior but I was failing until I
> realized that this code path is used when the 2 groups are not sharing
> their cache. Which topology do you use ? I thought that dynamiQ and
> shares cache between all 8 cpus was the norm for arm64 embedded device
> now
Hmm good question. phantom domains didn't die which I think is what causing
this. I can look if this is for a good reason or just historical artifact.
>
> Also when you say "the little cluster capacity is very small nowadays
> (around 200 or less)", it is the capacity of 1 core or the cluster ?
I meant one core. So in my case all the littles were busy except for one that
was mostly idle and never pulled a task from mid where two tasks were stuck on
a CPU there. And the logs I have added were showing me that the env->imbalance
was on 150+ range but the task we pull was in the 350+ range.
I should have mentioned that I'm on 5.15 - sorry with Android it's hard to run
mainline on products :( But this code as far as I can tell hasn't changed much.
I can try to find something that runs mainline and reproduce there if you think
my description of the problem is not clear or applicable.
Thanks
--
Qais Yousef
Le vendredi 21 juil. 2023 ? 11:57:11 (+0100), Qais Yousef a ?crit :
> On 07/20/23 14:31, Vincent Guittot wrote:
>
> > I was trying to reproduce the behavior but I was failing until I
> > realized that this code path is used when the 2 groups are not sharing
> > their cache. Which topology do you use ? I thought that dynamiQ and
> > shares cache between all 8 cpus was the norm for arm64 embedded device
> > now
>
> Hmm good question. phantom domains didn't die which I think is what causing
> this. I can look if this is for a good reason or just historical artifact.
>
> >
> > Also when you say "the little cluster capacity is very small nowadays
> > (around 200 or less)", it is the capacity of 1 core or the cluster ?
>
> I meant one core. So in my case all the littles were busy except for one that
> was mostly idle and never pulled a task from mid where two tasks were stuck on
> a CPU there. And the logs I have added were showing me that the env->imbalance
> was on 150+ range but the task we pull was in the 350+ range.
I'm not able to reproduce your problem with v6.5-rc2 and without phantom domain,
which is expected because we share cache and weight is 1 so we use the path
if (busiest->group_weight == 1 || sds->prefer_sibling) {
/*
* When prefer sibling, evenly spread running tasks on
* groups.
*/
env->migration_type = migrate_task;
env->imbalance = sibling_imbalance(env, sds, busiest, local);
} else {
>
> I should have mentioned that I'm on 5.15 - sorry with Android it's hard to run
> mainline on products :( But this code as far as I can tell hasn't changed much.
>
> I can try to find something that runs mainline and reproduce there if you think
> my description of the problem is not clear or applicable.
>
>
> Thanks
>
> --
> Qais Yousef
On 07/21/23 15:52, Vincent Guittot wrote:
> Le vendredi 21 juil. 2023 à 11:57:11 (+0100), Qais Yousef a écrit :
> > On 07/20/23 14:31, Vincent Guittot wrote:
> >
> > > I was trying to reproduce the behavior but I was failing until I
> > > realized that this code path is used when the 2 groups are not sharing
> > > their cache. Which topology do you use ? I thought that dynamiQ and
> > > shares cache between all 8 cpus was the norm for arm64 embedded device
> > > now
> >
> > Hmm good question. phantom domains didn't die which I think is what causing
> > this. I can look if this is for a good reason or just historical artifact.
> >
> > >
> > > Also when you say "the little cluster capacity is very small nowadays
> > > (around 200 or less)", it is the capacity of 1 core or the cluster ?
> >
> > I meant one core. So in my case all the littles were busy except for one that
> > was mostly idle and never pulled a task from mid where two tasks were stuck on
> > a CPU there. And the logs I have added were showing me that the env->imbalance
> > was on 150+ range but the task we pull was in the 350+ range.
>
> I'm not able to reproduce your problem with v6.5-rc2 and without phantom domain,
> which is expected because we share cache and weight is 1 so we use the path
>
> if (busiest->group_weight == 1 || sds->prefer_sibling) {
> /*
> * When prefer sibling, evenly spread running tasks on
> * groups.
> */
> env->migration_type = migrate_task;
> env->imbalance = sibling_imbalance(env, sds, busiest, local);
> } else {
>
I missed the deps on topology. So yes you're right, this needs to be addressed
first. I seem to remember Sudeep merged some stuff that will flatten these
topologies.
Let me chase this topology thing out first.
Thanks!
--
Qais Yousef
> >
> > I should have mentioned that I'm on 5.15 - sorry with Android it's hard to run
> > mainline on products :( But this code as far as I can tell hasn't changed much.
> >
> > I can try to find something that runs mainline and reproduce there if you think
> > my description of the problem is not clear or applicable.
> >
> >
> > Thanks
> >
> > --
> > Qais Yousef
On 22/07/2023 00:04, Qais Yousef wrote:
> On 07/21/23 15:52, Vincent Guittot wrote:
>> Le vendredi 21 juil. 2023 à 11:57:11 (+0100), Qais Yousef a écrit :
>>> On 07/20/23 14:31, Vincent Guittot wrote:
>>>
>>>> I was trying to reproduce the behavior but I was failing until I
>>>> realized that this code path is used when the 2 groups are not sharing
>>>> their cache. Which topology do you use ? I thought that dynamiQ and
>>>> shares cache between all 8 cpus was the norm for arm64 embedded device
>>>> now
>>>
>>> Hmm good question. phantom domains didn't die which I think is what causing
>>> this. I can look if this is for a good reason or just historical artifact.
>>>
>>>>
>>>> Also when you say "the little cluster capacity is very small nowadays
>>>> (around 200 or less)", it is the capacity of 1 core or the cluster ?
>>>
>>> I meant one core. So in my case all the littles were busy except for one that
>>> was mostly idle and never pulled a task from mid where two tasks were stuck on
>>> a CPU there. And the logs I have added were showing me that the env->imbalance
>>> was on 150+ range but the task we pull was in the 350+ range.
>>
>> I'm not able to reproduce your problem with v6.5-rc2 and without phantom domain,
>> which is expected because we share cache and weight is 1 so we use the path
>>
>> if (busiest->group_weight == 1 || sds->prefer_sibling) {
>> /*
>> * When prefer sibling, evenly spread running tasks on
>> * groups.
>> */
>> env->migration_type = migrate_task;
>> env->imbalance = sibling_imbalance(env, sds, busiest, local);
>> } else {
>>
>
> I missed the deps on topology. So yes you're right, this needs to be addressed
> first. I seem to remember Sudeep merged some stuff that will flatten these
> topologies.
>
> Let me chase this topology thing out first.
Sudeeps patches align topology cpumasks with cache cpumasks.
tip/sched/core:
root@juno:~# cat /sys/devices/system/cpu/cpu*/topology/package_cpus
3f
3f
3f
3f
3f
3f
v5.9:
root@juno:~# cat /sys/devices/system/cpu/cpu*/topology/package_cpus
39
06
06
39
39
39
So Android userspace won't be able to detect uArch boundaries via
`package_cpus` any longer.
The phantom domain (DIE) in Android is a legacy decision from within
Android. Pre-mainline Energy Model was attached to the sched domain
topology hierarchy. And then IMHO other Android functionality start to
rely on this. It could be removed regardless of Sudeeps patches in case
Android would be OK with it.
The phantom domain is probably set up via DT cpu_map entry:
cpu-map {
cluster0 { <-- enforce phantom domain
core0 {
cpu = <&CPU0>;
};
...
core3 {
cpu = <&CPU1>;
};
cluster1 {
...
Juno (arch/arm64/boot/dts/arm/juno.dts) also uses cpu-map to enforce
uarch boundaries on DIE group boundary congruence.
tip/sched/core:
# cat /proc/schedstat | awk '{ print $1 " " $2}' | head -5
...
cpu0 0
domain0 39
domain1 3f
v5.9:
# cat /proc/schedstat | awk '{ print $1 " " $2}' | head -5
...
cpu0 0
domain0 39
domain1 3f
We had a talk at LPC '22 about the influence of the patch-set and the
phantom domain legacy issue:
https://lpc.events/event/16/contributions/1342/attachments/962/1883/LPC-2022-Android-MC-Phantom-Domains.pdf
[...]
On 07/24/23 14:58, Dietmar Eggemann wrote:
> On 22/07/2023 00:04, Qais Yousef wrote:
> > On 07/21/23 15:52, Vincent Guittot wrote:
> >> Le vendredi 21 juil. 2023 à 11:57:11 (+0100), Qais Yousef a écrit :
> >>> On 07/20/23 14:31, Vincent Guittot wrote:
> >>>
> >>>> I was trying to reproduce the behavior but I was failing until I
> >>>> realized that this code path is used when the 2 groups are not sharing
> >>>> their cache. Which topology do you use ? I thought that dynamiQ and
> >>>> shares cache between all 8 cpus was the norm for arm64 embedded device
> >>>> now
> >>>
> >>> Hmm good question. phantom domains didn't die which I think is what causing
> >>> this. I can look if this is for a good reason or just historical artifact.
> >>>
> >>>>
> >>>> Also when you say "the little cluster capacity is very small nowadays
> >>>> (around 200 or less)", it is the capacity of 1 core or the cluster ?
> >>>
> >>> I meant one core. So in my case all the littles were busy except for one that
> >>> was mostly idle and never pulled a task from mid where two tasks were stuck on
> >>> a CPU there. And the logs I have added were showing me that the env->imbalance
> >>> was on 150+ range but the task we pull was in the 350+ range.
> >>
> >> I'm not able to reproduce your problem with v6.5-rc2 and without phantom domain,
> >> which is expected because we share cache and weight is 1 so we use the path
> >>
> >> if (busiest->group_weight == 1 || sds->prefer_sibling) {
> >> /*
> >> * When prefer sibling, evenly spread running tasks on
> >> * groups.
> >> */
> >> env->migration_type = migrate_task;
> >> env->imbalance = sibling_imbalance(env, sds, busiest, local);
> >> } else {
> >>
> >
> > I missed the deps on topology. So yes you're right, this needs to be addressed
> > first. I seem to remember Sudeep merged some stuff that will flatten these
> > topologies.
> >
> > Let me chase this topology thing out first.
>
> Sudeeps patches align topology cpumasks with cache cpumasks.
>
> tip/sched/core:
>
> root@juno:~# cat /sys/devices/system/cpu/cpu*/topology/package_cpus
> 3f
> 3f
> 3f
> 3f
> 3f
> 3f
>
> v5.9:
>
> root@juno:~# cat /sys/devices/system/cpu/cpu*/topology/package_cpus
> 39
> 06
> 06
> 39
> 39
> 39
>
> So Android userspace won't be able to detect uArch boundaries via
> `package_cpus` any longer.
>
> The phantom domain (DIE) in Android is a legacy decision from within
> Android. Pre-mainline Energy Model was attached to the sched domain
> topology hierarchy. And then IMHO other Android functionality start to
> rely on this. It could be removed regardless of Sudeeps patches in case
> Android would be OK with it.
>
> The phantom domain is probably set up via DT cpu_map entry:
>
> cpu-map {
> cluster0 { <-- enforce phantom domain
> core0 {
> cpu = <&CPU0>;
> };
> ...
> core3 {
> cpu = <&CPU1>;
> };
> cluster1 {
> ...
>
> Juno (arch/arm64/boot/dts/arm/juno.dts) also uses cpu-map to enforce
> uarch boundaries on DIE group boundary congruence.
>
> tip/sched/core:
>
> # cat /proc/schedstat | awk '{ print $1 " " $2}' | head -5
> ...
> cpu0 0
> domain0 39
> domain1 3f
>
> v5.9:
>
> # cat /proc/schedstat | awk '{ print $1 " " $2}' | head -5
> ...
> cpu0 0
> domain0 39
> domain1 3f
>
> We had a talk at LPC '22 about the influence of the patch-set and the
> phantom domain legacy issue:
>
> https://lpc.events/event/16/contributions/1342/attachments/962/1883/LPC-2022-Android-MC-Phantom-Domains.pdf
>
> [...]
Thanks Dietmar!
So I actually moved everything to a single cluster and this indeed solves the
lb() issue. But then when I tried to look at DT mainline I saw that the DTs
still define separate cluster for each uArch, and this got me confused whether
I did the right thing or not. And made me wonder whether the fix is to change
DT or port Sudeep's/Ionela's patch?
I did some digging and I think the DT, like the ones in mainline by the look of
it, stayed the way it was historically defined.
So IIUC the impacts are on system pre-simplified EM (should have been phased
out AFAIK). And on different presentation on sysfs topology which can
potentially break userspace deps, right? I think this is not a problem too, but
can be famous last words as usual :-)
Thanks
--
Qais Yousef
On 24/07/2023 18:10, Qais Yousef wrote:
> On 07/24/23 14:58, Dietmar Eggemann wrote:
>> On 22/07/2023 00:04, Qais Yousef wrote:
>>> On 07/21/23 15:52, Vincent Guittot wrote:
>>>> Le vendredi 21 juil. 2023 à 11:57:11 (+0100), Qais Yousef a écrit :
>>>>> On 07/20/23 14:31, Vincent Guittot wrote:
[...]
> So I actually moved everything to a single cluster and this indeed solves the
> lb() issue. But then when I tried to look at DT mainline I saw that the DTs
> still define separate cluster for each uArch, and this got me confused whether
> I did the right thing or not. And made me wonder whether the fix is to change
> DT or port Sudeep's/Ionela's patch?
IMHO, you have to change DT.
> I did some digging and I think the DT, like the ones in mainline by the look of
> it, stayed the way it was historically defined.
This would be a "mistake" for Arm DynamIQ based systems. We use QC RB5
in our testing and this board schedules only within a MC sched domain (I
guess it's: arch/arm64/boot/dts/qcom/qrb5165-rb5.dts -> sm8250.dtsi)
> So IIUC the impacts are on system pre-simplified EM (should have been phased
> out AFAIK). And on different presentation on sysfs topology which can
> potentially break userspace deps, right? I think this is not a problem too, but
> can be famous last words as usual :-)
The only thing I remember was when we hinted at this issue to Android
folks a couple of years ago, they said they have to stay with the
phantom domain due to dependencies from vendor specific code other than
related to the EM.
IMHO, for Pixel6 the DT cpu-map information is in:
private/gs-google/arch/arm64/boot/dts/google/gs101-cpu.dtsi
of the android-kernel.
[...]
On 07/24/23 19:54, Dietmar Eggemann wrote:
> On 24/07/2023 18:10, Qais Yousef wrote:
> > On 07/24/23 14:58, Dietmar Eggemann wrote:
> >> On 22/07/2023 00:04, Qais Yousef wrote:
> >>> On 07/21/23 15:52, Vincent Guittot wrote:
> >>>> Le vendredi 21 juil. 2023 à 11:57:11 (+0100), Qais Yousef a écrit :
> >>>>> On 07/20/23 14:31, Vincent Guittot wrote:
>
> [...]
>
> > So I actually moved everything to a single cluster and this indeed solves the
> > lb() issue. But then when I tried to look at DT mainline I saw that the DTs
> > still define separate cluster for each uArch, and this got me confused whether
> > I did the right thing or not. And made me wonder whether the fix is to change
> > DT or port Sudeep's/Ionela's patch?
>
> IMHO, you have to change DT.
>
> > I did some digging and I think the DT, like the ones in mainline by the look of
> > it, stayed the way it was historically defined.
>
> This would be a "mistake" for Arm DynamIQ based systems. We use QC RB5
> in our testing and this board schedules only within a MC sched domain (I
> guess it's: arch/arm64/boot/dts/qcom/qrb5165-rb5.dts -> sm8250.dtsi)
I did find some qcom platforms that define a single cluster after I sent my
email. Not sure if the ones I've looked at were having wrong definitions or
not.
Anyway, hope this discussion will enlighten others too ;-)
Cheers
--
Qais Yousef