LinuxLists.cc - [RFC PATCH] sched: fair: reset task_group.load

2023-12-15 05:27:42

Subject: [RFC PATCH] sched: fair: reset task_group.load_avg when there are no running tasks.

It has been found that sometimes a task_group has some residual
load_avg even though the load average at each of its owned queues
i.e task_group.cfs_rq[cpu].avg.load_avg and task_group.cfs_rq[cpu].
tg_load_avg_contrib have become 0 for a long time.
Under this scenario if another task starts running in this task_group,
it does not get proper time share on CPU since pre-existing
load average of task group inversely impacts the new task's CPU share
on each CPU.

This change looks for the condition when a task_group has no running
tasks and sets the task_group's load average to 0 in such cases, so
that tasks that run in future under this task_group get the CPU time
in accordance with the current load.

Signed-off-by: Imran Khan <[email protected]>
---

So far I have encountered only one way of reproducing this issue which is
as follows:

1. Take a kvm guest (booted with 4 CPUs and can be scaled up to 124 CPUs) and
create 2 custom cgroups: /sys/fs/cgroup/cpu/test_group_1 and /sys/fs/cgroup/
cpu/test_group_2

2. Assign a CPU intensive workload to each of these cgroups and start the
workload.

For my tests I am using following app:

int main(int argc, char *argv[])
{
unsigned long count, i, val;
if (argc != 2) {
printf("usage: ./a.out <number of random nums to generate> \n");
return 0;
}

count = strtoul(argv[1], NULL, 10);

printf("Generating %lu random numbers \n", count);
for (i = 0; i < count; i++) {
val = rand();
val = val % 2;
//usleep(1);
}
printf("Generated %lu random numbers \n", count);
return 0;
}
Also since the system is booted with 4 CPUs, in order to completely load the
system I am also launching 4 instances of same test app under:
/sys/fs/cgroup/cpu/

3. We can see that both of the cgroups get similar CPU time:

# systemd-cgtop --depth 1
Path Tasks %CPU Memory Input/s Output/s
/ 659 - 5.5G - -
/system.slice - - 5.7G - -
/test_group_1 4 - - - -
/test_group_2 3 - - - -
/user.slice 31 - 56.5M - -

Path Tasks %CPU Memory Input/s Output/s
/ 659 394.6 5.5G - -
/test_group_2 3 65.7 - - -
/user.slice 29 55.1 48.0M - -
/test_group_1 4 47.3 - - -
/system.slice - 2.2 5.7G - -

Path Tasks %CPU Memory Input/s Output/s
/ 659 394.8 5.5G - -
/test_group_1 4 62.9 - - -
/user.slice 28 44.9 54.2M - -
/test_group_2 3 44.7 - - -
/system.slice - 0.9 5.7G - -

Path Tasks %CPU Memory Input/s Output/s
/ 659 394.4 5.5G - -
/test_group_2 3 58.8 - - -
/test_group_1 4 51.9 - - -
/user.slice 30 39.3 59.6M - -
/system.slice - 1.9 5.7G - -

Path Tasks %CPU Memory Input/s Output/s
/ 659 394.7 5.5G - -
/test_group_1 4 60.9 - - -
/test_group_2 3 57.9 - - -
/user.slice 28 43.5 36.9M - -
/system.slice - 3.0 5.7G - -

Path Tasks %CPU Memory Input/s Output/s
/ 659 395.0 5.5G - -
/test_group_1 4 66.8 - - -
/test_group_2 3 56.3 - - -
/user.slice 29 43.1 51.8M - -
/system.slice - 0.7 5.7G - -

4. Now move systemd-udevd to one of these test groups, say test_group_1, and
perform scale up to 124 CPUs followed by scale down back to 4 CPUs from the
host side.

5. Run the same workload i.e 4 instances of CPU hogger under /sys/fs/cgroup/cpu
and one instance of CPU hogger each in /sys/fs/cgroup/cpu/test_group_1 and
/sys/fs/cgroup/test_group_2.
It is seen that test_group_1 (the one where systemd-udevd was moved) is getting
much less CPU than the test_group_2, even though at this point of time both of
these groups have only CPU hogger running.

#systemd-cgtop --depth 1
Path Tasks %CPU Memory Input/s Output/s
/ 1219 - 5.4G - -
/system.slice - - 5.6G - -
/test_group_1 4 - - - -
/test_group_2 3 - - - -
/user.slice 26 - 91.3M - -

Path Tasks %CPU Memory Input/s Output/s
/ 1221 394.3 5.4G - -
/test_group_2 3 82.7 - - -
/test_group_1 4 14.3 - - -
/system.slice - 0.8 5.6G - -
/user.slice 26 0.4 91.2M - -

Path Tasks %CPU Memory Input/s Output/s
/ 1221 394.6 5.4G - -
/test_group_2 3 67.4 - - -
/system.slice - 24.6 5.6G - -
/test_group_1 4 12.5 - - -
/user.slice 26 0.4 91.2M - -

Path Tasks %CPU Memory Input/s Output/s
/ 1221 395.2 5.4G - -
/test_group_2 3 60.9 - - -
/system.slice - 27.9 5.6G - -
/test_group_1 4 12.2 - - -
/user.slice 26 0.4 91.2M - -

Path Tasks %CPU Memory Input/s Output/s
/ 1221 395.2 5.4G - -
/test_group_2 3 69.4 - - -
/test_group_1 4 13.9 - - -
/user.slice 28 1.6 92.0M - -
/system.slice - 1.0 5.6G - -

Path Tasks %CPU Memory Input/s Output/s
/ 1221 395.6 5.4G - -
/test_group_2 3 59.3 - - -
/test_group_1 4 14.1 - - -
/user.slice 28 1.3 92.2M - -
/system.slice - 0.7 5.6G - -

Path Tasks %CPU Memory Input/s Output/s
/ 1221 395.5 5.4G - -
/test_group_2 3 67.2 - - -
/test_group_1 4 11.5 - - -
/user.slice 28 1.3 92.5M - -
/system.slice - 0.6 5.6G - -

Path Tasks %CPU Memory Input/s Output/s
/ 1221 395.1 5.4G - -
/test_group_2 3 76.8 - - -
/test_group_1 4 12.9 - - -
/user.slice 28 1.3 92.8M - -
/system.slice - 1.2 5.6G - -

From sched_debug data it is seen that in bad case the load.weight of per-cpu
sched entities corresponding to test_group_1 has reduced significantly and
also load_avg of test_group_1 remains much higher than that of test_group_2,
even though systemd-udevd stopped running long time back and at this point of
time both cgroups just have the cpu hogger app as running entity.

I put some traces in update_tg_load_avg to check why load_avg was much higher
for test_group_1 and from there, I saw that even after systemd-udevd has
stopped and we are yet to launch the CPU hoggers, test_group_1 has got some
load average like shown in following snippet (I have changed the format of
trace to limit the number of columns here):

<idle>-0 [000] d.s. 1824.372593: update_tg_load_avg.constprop.124:
cfs_rq->avg.load_avg=0, cfs_rq->tg_load_avg_contrib=0 tg->load_avg=4664
<idle>-0 [000] d.s. 1824.380572: update_tg_load_avg.constprop.124:
tg_path = /test_group_1
<idle>-0 [000] d.s. 1824.380573: update_tg_load_avg.constprop.124:
cfs_rq->avg.load_avg=0, cfs_rq->tg_load_avg_contrib=0 tg->load_avg=4664
<idle>-0 [000] d.s. 1824.384563: update_tg_load_avg.constprop.124:
tg_path = /test_group_1
<idle>-0 [000] d.s. 1824.384563: update_tg_load_avg.constprop.124:
cfs_rq->avg.load_avg=0, cfs_rq->tg_load_avg_contrib=0 tg->load_avg=4664
<idle>-0 [000] d.s. 1824.391564: update_tg_load_avg.constprop.124:
tg_path = /test_group_1
<idle>-0 [000] d.s. 1824.391564: update_tg_load_avg.constprop.124:
cfs_rq->avg.load_avg=0, cfs_rq->tg_load_avg_contrib=0 tg->load_avg=4664
<idle>-0 [000] d.s. 1824.398569: update_tg_load_avg.constprop.124:
tg_path = /test_group_1
<idle>-0 [000] d.s. 1824.398569: update_tg_load_avg.constprop.124:
cfs_rq->avg.load_avg=0, cfs_rq->tg_load_avg_contrib=0 tg->load_avg=4664
.....................
.....................
xfsaild/dm-3-1623 [001] d... 1824.569615: update_tg_load_avg.constprop.124:
tg_path = /test_group_1
xfsaild/dm-3-1623 [001] d... 1824.569617: update_tg_load_avg.constprop.124:
cfs_rq->avg.load_avg=0, cfs_rq->tg_load_avg_contrib=0 tg->load_avg=4664
...................
...................
sh-195050 [000] d.s. 1824.756563: update_tg_load_avg.constprop.124:
cfs_rq->avg.load_avg=0, cfs_rq->tg_load_avg_contrib=0 tg->load_avg=4664
<idle>-0 [000] d.s. 1824.757570: update_tg_load_avg.constprop.124:
tg_path = /test_group_1
<idle>-0 [000] d.s. 1824.757571: update_tg_load_avg.constprop.124:
cfs_rq->avg.load_avg=0, cfs_rq->tg_load_avg_contrib=0 tg->load_avg=4664
<idle>-0 [000] d.s. 1824.761566: update_tg_load_avg.constprop.124:
tg_path = /test_group_1
<idle>-0 [000] d.s. 1824.761566: update_tg_load_avg.constprop.124:
cfs_rq->avg.load_avg=0, cfs_rq->tg_load_avg_contrib=0 tg->load_avg=4664
<idle>-0 [000] d.s. 1824.765567: update_tg_load_avg.constprop.124:
tg_path = /test_group_1
<idle>-0 [000] d.s. 1824.765568: update_tg_load_avg.constprop.124:
cfs_rq->avg.load_avg=0, cfs_rq->tg_load_avg_contrib=0 tg->load_avg=4664
<idle>-0 [000] d.s. 1824.769562: update_tg_load_avg.constprop.124:
tg_path = /test_group_1
<idle>-0 [000] d.s. 1824.769563: update_tg_load_avg.constprop.124:
cfs_rq->avg.load_avg=0, cfs_rq->tg_load_avg_contrib=0 tg->load_avg=4664

Now when CPU hogger is launched again in test_group_1, this pre-existing
tg->load_avg gets added to load contribution of cpu hogger and results in a
situation where both cgroups are running the same work load but very different
task_group.load_avg which in turn results in these cgroups getting very
different amount of time on CPU.

Apologies for this long description of the issue, but I have found only one way
of reproducing this issue and using this way issue can be reproduced with 100%
strike rate, so I thought of providing as much info as possible.

I have kept the change as RFC as I was not sure if the load_avg should be
dragged down to 0 in one go (as being done in this change) or should it be
pulled back in steps like making it half of the previous value each time.

kernel/sched/fair.c | 17 +++++++++++++++++
kernel/sched/sched.h | 1 +
2 files changed, 18 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d7a3c63a2171..c34d9e7abb96 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3572,6 +3572,7 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)

account_numa_enqueue(rq, task_of(se));
list_add(&se->group_node, &rq->cfs_tasks);
+ atomic_long_inc(&task_group(task_of(se))->cfs_nr_running);
}
#endif
cfs_rq->nr_running++;
@@ -3587,6 +3588,7 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
if (entity_is_task(se)) {
account_numa_dequeue(rq_of(cfs_rq), task_of(se));
list_del_init(&se->group_node);
+ atomic_long_dec(&task_group(task_of(se))->cfs_nr_running);
}
#endif
cfs_rq->nr_running--;
@@ -4104,6 +4106,20 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
if (now - cfs_rq->last_update_tg_load_avg < NSEC_PER_MSEC)
return;

+ /*
+ * If number of running tasks for a task_group is 0, pull back its
+ * load_avg to 0 as well.
+ * This avoids the situation where a freshly forked task in a cgroup,
+ * with some residual load_avg but with no running tasks, gets less
+ * CPU time because of pre-existing load_avg of task_group.
+ */
+ if (!atomic_long_read(&cfs_rq->tg->cfs_nr_running)
+ && atomic_long_read(&cfs_rq->tg->load_avg)) {
+ atomic_long_set(&cfs_rq->tg->load_avg, 0);
+ cfs_rq->last_update_tg_load_avg = now;
+ return;
+ }
+
delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
atomic_long_add(delta, &cfs_rq->tg->load_avg);
@@ -12808,6 +12824,7 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
goto err;

tg->shares = NICE_0_LOAD;
+ atomic_long_set(&tg->cfs_nr_running, 0);

init_cfs_bandwidth(tg_cfs_bandwidth(tg), tg_cfs_bandwidth(parent));

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 2e5a95486a42..c3390bdb8400 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -370,6 +370,7 @@ struct task_group {
*/
atomic_long_t load_avg ____cacheline_aligned;
#endif
+ atomic_long_t cfs_nr_running ____cacheline_aligned;
#endif

#ifdef CONFIG_RT_GROUP_SCHED
--
2.34.1

2023-12-15 09:12:35

by Vincent Guittot

[permalink] [raw]

Subject: Re: [RFC PATCH] sched: fair: reset task_group.load_avg when there are no running tasks.

On Fri, 15 Dec 2023 at 06:27, Imran Khan <[email protected]> wrote:
>
> It has been found that sometimes a task_group has some residual
> load_avg even though the load average at each of its owned queues
> i.e task_group.cfs_rq[cpu].avg.load_avg and task_group.cfs_rq[cpu].
> tg_load_avg_contrib have become 0 for a long time.
> Under this scenario if another task starts running in this task_group,
> it does not get proper time share on CPU since pre-existing
> load average of task group inversely impacts the new task's CPU share
> on each CPU.
>
> This change looks for the condition when a task_group has no running
> tasks and sets the task_group's load average to 0 in such cases, so
> that tasks that run in future under this task_group get the CPU time
> in accordance with the current load.
>
> Signed-off-by: Imran Khan <[email protected]>
> ---
>

[...]

>
> 4. Now move systemd-udevd to one of these test groups, say test_group_1, and
> perform scale up to 124 CPUs followed by scale down back to 4 CPUs from the
> host side.

Could it be the root cause of your problem ?

The cfs_rq->tg_load_avg_contrib of the 120 CPUs that have been plugged
then unplugged, have not been correctly removed from tg->load_avg. If
the cfs_rq->tg_load_avg_contrib of the 4 remaining CPUs is 0 then
tg->load_avg should be 0 too.

Could you track that the cfs_rq->tg_load_avg_contrib is correctly
removed from tg->load_avg when you unplug the CPUs ? I can easily
imagine that the rate limit can skip some update of tg- >load_avg
while offlining the cpu

> 5. Run the same workload i.e 4 instances of CPU hogger under /sys/fs/cgroup/cpu
> and one instance of CPU hogger each in /sys/fs/cgroup/cpu/test_group_1 and
> /sys/fs/cgroup/test_group_2.
> It is seen that test_group_1 (the one where systemd-udevd was moved) is getting
> much less CPU than the test_group_2, even though at this point of time both of
> these groups have only CPU hogger running.
>

[...]

> Apologies for this long description of the issue, but I have found only one way
> of reproducing this issue and using this way issue can be reproduced with 100%
> strike rate, so I thought of providing as much info as possible.
>
> I have kept the change as RFC as I was not sure if the load_avg should be
> dragged down to 0 in one go (as being done in this change) or should it be
> pulled back in steps like making it half of the previous value each time.
>
>
> kernel/sched/fair.c | 17 +++++++++++++++++
> kernel/sched/sched.h | 1 +
> 2 files changed, 18 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d7a3c63a2171..c34d9e7abb96 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3572,6 +3572,7 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
>
> account_numa_enqueue(rq, task_of(se));
> list_add(&se->group_node, &rq->cfs_tasks);
> + atomic_long_inc(&task_group(task_of(se))->cfs_nr_running);

We have rate limited the access to the atomic tg->load_avg because it
was impacting significantly the performance. We don't want to add a
new one at every enqueue/dequeue

> }
> #endif
> cfs_rq->nr_running++;
> @@ -3587,6 +3588,7 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
> if (entity_is_task(se)) {
> account_numa_dequeue(rq_of(cfs_rq), task_of(se));
> list_del_init(&se->group_node);
> + atomic_long_dec(&task_group(task_of(se))->cfs_nr_running);
> }
> #endif
> cfs_rq->nr_running--;
> @@ -4104,6 +4106,20 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
> if (now - cfs_rq->last_update_tg_load_avg < NSEC_PER_MSEC)
> return;
>
> + /*
> + * If number of running tasks for a task_group is 0, pull back its
> + * load_avg to 0 as well.
> + * This avoids the situation where a freshly forked task in a cgroup,
> + * with some residual load_avg but with no running tasks, gets less
> + * CPU time because of pre-existing load_avg of task_group.
> + */
> + if (!atomic_long_read(&cfs_rq->tg->cfs_nr_running)
> + && atomic_long_read(&cfs_rq->tg->load_avg)) {
> + atomic_long_set(&cfs_rq->tg->load_avg, 0);
> + cfs_rq->last_update_tg_load_avg = now;
> + return;
> + }
> +
> delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
> if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
> atomic_long_add(delta, &cfs_rq->tg->load_avg);
> @@ -12808,6 +12824,7 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
> goto err;
>
> tg->shares = NICE_0_LOAD;
> + atomic_long_set(&tg->cfs_nr_running, 0);
>
> init_cfs_bandwidth(tg_cfs_bandwidth(tg), tg_cfs_bandwidth(parent));
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 2e5a95486a42..c3390bdb8400 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -370,6 +370,7 @@ struct task_group {
> */
> atomic_long_t load_avg ____cacheline_aligned;
> #endif
> + atomic_long_t cfs_nr_running ____cacheline_aligned;
> #endif
>
> #ifdef CONFIG_RT_GROUP_SCHED
> --
> 2.34.1
>

2023-12-15 10:00:45

by Imran Khan

[permalink] [raw]

Subject: Re: [RFC PATCH] sched: fair: reset task_group.load_avg when there are no running tasks.

Hello Vincent,
Thanks a lot for having a look and getting back.

On 15/12/2023 7:11 pm, Vincent Guittot wrote:
> On Fri, 15 Dec 2023 at 06:27, Imran Khan <[email protected]> wrote:
>>
>> It has been found that sometimes a task_group has some residual
>> load_avg even though the load average at each of its owned queues
>> i.e task_group.cfs_rq[cpu].avg.load_avg and task_group.cfs_rq[cpu].
>> tg_load_avg_contrib have become 0 for a long time.
>> Under this scenario if another task starts running in this task_group,
>> it does not get proper time share on CPU since pre-existing
>> load average of task group inversely impacts the new task's CPU share
>> on each CPU.
>>
>> This change looks for the condition when a task_group has no running
>> tasks and sets the task_group's load average to 0 in such cases, so
>> that tasks that run in future under this task_group get the CPU time
>> in accordance with the current load.
>>
>> Signed-off-by: Imran Khan <[email protected]>
>> ---
>>
>
> [...]
>
>>
>> 4. Now move systemd-udevd to one of these test groups, say test_group_1, and
>> perform scale up to 124 CPUs followed by scale down back to 4 CPUs from the
>> host side.
>
> Could it be the root cause of your problem ?
>
> The cfs_rq->tg_load_avg_contrib of the 120 CPUs that have been plugged
> then unplugged, have not been correctly removed from tg->load_avg. If
> the cfs_rq->tg_load_avg_contrib of the 4 remaining CPUs is 0 then
> tg->load_avg should be 0 too.
>
Agree and this was my understanding as well. The issue only happens
with large number of CPUs. For example if I go from 4 to 8 and back to
4 , the issue does not happen and even if it happens the residual load
avg is very little.

> Could you track that the cfs_rq->tg_load_avg_contrib is correctly
> removed from tg->load_avg when you unplug the CPUs ? I can easily
> imagine that the rate limit can skip some update of tg- >load_avg
> while offlining the cpu
>

I will try to trace it but just so you know this issue is happening on other
kernel versions (which don't have rate limit feature) as well. I started
with v4.14.x but have tested and found it on v5.4.x and v5.15.x as well.

Thanks,
Imran

2023-12-19 06:42:34

by Imran Khan

[permalink] [raw]

Subject: Re: [RFC PATCH] sched: fair: reset task_group.load_avg when there are no running tasks.

Hello Vincent,

On 15/12/2023 8:59 pm, Imran Khan wrote:
> Hello Vincent,
> Thanks a lot for having a look and getting back.
>
> On 15/12/2023 7:11 pm, Vincent Guittot wrote:
>> On Fri, 15 Dec 2023 at 06:27, Imran Khan <[email protected]> wrote:
>>>
>>> It has been found that sometimes a task_group has some residual
>>> load_avg even though the load average at each of its owned queues
>>> i.e task_group.cfs_rq[cpu].avg.load_avg and task_group.cfs_rq[cpu].
>>> tg_load_avg_contrib have become 0 for a long time.
>>> Under this scenario if another task starts running in this task_group,
>>> it does not get proper time share on CPU since pre-existing
>>> load average of task group inversely impacts the new task's CPU share
>>> on each CPU.
>>>
>>> This change looks for the condition when a task_group has no running
>>> tasks and sets the task_group's load average to 0 in such cases, so
>>> that tasks that run in future under this task_group get the CPU time
>>> in accordance with the current load.
>>>
>>> Signed-off-by: Imran Khan <[email protected]>
>>> ---
>>>
>>
>> [...]
>>
>>>
>>> 4. Now move systemd-udevd to one of these test groups, say test_group_1, and
>>> perform scale up to 124 CPUs followed by scale down back to 4 CPUs from the
>>> host side.
>>
>> Could it be the root cause of your problem ?
>>
>> The cfs_rq->tg_load_avg_contrib of the 120 CPUs that have been plugged
>> then unplugged, have not been correctly removed from tg->load_avg. If
>> the cfs_rq->tg_load_avg_contrib of the 4 remaining CPUs is 0 then
>> tg->load_avg should be 0 too.
>>
> Agree and this was my understanding as well. The issue only happens
> with large number of CPUs. For example if I go from 4 to 8 and back to
> 4 , the issue does not happen and even if it happens the residual load
> avg is very little.
>
>> Could you track that the cfs_rq->tg_load_avg_contrib is correctly
>> removed from tg->load_avg when you unplug the CPUs ? I can easily
>> imagine that the rate limit can skip some update of tg- >load_avg
>> while offlining the cpu
>>
>
> I will try to trace it but just so you know this issue is happening on other
> kernel versions (which don't have rate limit feature) as well. I started
> with v4.14.x but have tested and found it on v5.4.x and v5.15.x as well.
>
I collected some debug trace to understand the missing load avg
context better. From the traces it looks like during scale down,
the task_group.cfs_rq[cpu].avg.load_avg is not getting updated
properly for CPU(s) being hotplugged out.

For example if we look at following snippet (I have kept
only the relevant portion of trace in the mail), we can see that,
in the last invocation of update_tg_load_avg for task_group.cfs_rq[11]
both the load avg and contribution of this cfs_rq were 1024.
So delta was zero and this contribution eventually remains undeducted.
In this case scale down was done from 16 to 8 CPUs, so CPU 11 has been
offlined.

cpuhp/15-131605 [015] d... 6112.350658: update_tg_load_avg.constprop.124:
cfs_of_cpu=5 cfs_rq->avg.load_avg = 0, cfs_rq->tg_load_avg_contrib = 0
tg->load_avg = 0 delta = 0 ###
systemd-udevd-894 [005] d... 6112.351096: update_tg_load_avg.constprop.124:
cfs_of_cpu=11 cfs_rq->avg.load_avg = 1024, cfs_rq->tg_load_avg_contrib = 0
tg->load_avg = 0 delta = 1024 ###
systemd-udevd-894 [005] d... 6112.351165: update_tg_load_avg.constprop.124:
cfs_of_cpu=5 cfs_rq->avg.load_avg = 10, cfs_rq->tg_load_avg_contrib = 0
tg->load_avg = 1024 delta = 10 ###

.........................
.........................
cat-128667 [006] d... 6112.504633: update_tg_load_avg.constprop.124:
cfs_of_cpu=11 cfs_rq->avg.load_avg = 0, cfs_rq->tg_load_avg_contrib = 0
tg->load_avg = 3085 delta = 0 ###
.........................
sh-142414 [006] d... 6112.505392: update_tg_load_avg.constprop.124:
cfs_of_cpu=11 cfs_rq->avg.load_avg = 1024, cfs_rq->tg_load_avg_contrib = 0
tg->load_avg = 4041 delta = 1024 ###
.........................
cat-128667 [006] d... 6112.504633: update_tg_load_avg.constprop.124:
cfs_of_cpu=11 cfs_rq->avg.load_avg = 0, cfs_rq->tg_load_avg_contrib = 0
tg->load_avg = 3085 delta = 0 ###
..........................
sh-142414 [006] d... 6112.505392: update_tg_load_avg.constprop.124:
cfs_of_cpu=11 cfs_rq->avg.load_avg = 1024, cfs_rq->tg_load_avg_contrib = 0
tg->load_avg = 4041 delta = 1024 ###
..........................
systemd-run-142416 [011] d.h. 6112.506547: update_tg_load_avg.constprop.124:
cfs_of_cpu=11 cfs_rq->avg.load_avg = 1024, cfs_rq->tg_load_avg_contrib = 1024
tg->load_avg = 3010 delta = 0 ###
..........................
systemd-run-142416 [011] d.h. 6112.507546: update_tg_load_avg.constprop.124:
cfs_of_cpu=11 cfs_rq->avg.load_avg = 1024, cfs_rq->tg_load_avg_contrib = 1024
tg->load_avg = 3010 delta = 0 ### <-- last invocation for cfs_rq[11]

..........................
..........................
<idle>-0 [001] d.s. 6113.868542: update_tg_load_avg.constprop.124:
cfs_of_cpu=2 cfs_rq->avg.load_avg = 0, cfs_rq->tg_load_avg_contrib = 0
tg->load_avg = 1027 delta = 0 ###
<idle>-0 [001] d.s. 6113.869542: update_tg_load_avg.constprop.124:
cfs_of_cpu=2 cfs_rq->avg.load_avg = 0, cfs_rq->tg_load_avg_contrib = 0
tg->load_avg = 1027 delta = 0 ###
<idle>-0 [001] d.s. 6113.870541: update_tg_load_avg.constprop.124:
cfs_of_cpu=2 cfs_rq->avg.load_avg = 0, cfs_rq->tg_load_avg_contrib = 0
tg->load_avg = 1027 delta = 0 ###

If I understand correctly, when CPU 11 is offlined the task(s) on its cfs_rq
will be migrated and its cfs_rq.avg.load_avg will be updated accordingly. This
drop is cfs_rq.avg.load_avg will be detected by update_tg_load_avg and hence
the contribution of this cfs_rq will get deducted from tg->load_avg.
It looks like during hotplug load of one or more tasks, being migrated are
not getting accounted in the source cfs_rq and this is ending up as residual
load_avg at task_group (if these tasks are members of a task_group).

Moreover this looks racy and dependent on number of CPUs or some delay.
For example for scale down from 124 to 4 CPUs I always hit this issue but
for scale down from 16 to 4 CPUs I hit this issue 8-9 out of 10 times.
Also for the cases when residual load_avg at task group is low (like < 10),
I can see that both of my test cgroups get similar CPU times which further
proves that the unaccounted load avg ending up in a task_group is eventually
leading to uneven CPU allotment between task groups.

I am debugging it further but in the mean time if you have some suggestions or
need traces from some specific portion of sched code, please let me know.

Thanks,
Imran

> Thanks,
> Imran
>

2023-12-19 18:50:00

by Vincent Guittot

[permalink] [raw]

Subject: Re: [RFC PATCH] sched: fair: reset task_group.load_avg when there are no running tasks.

Hi Imram,

On Tue, 19 Dec 2023 at 07:42, Imran Khan <[email protected]> wrote:
>
> Hello Vincent,
>
>
> On 15/12/2023 8:59 pm, Imran Khan wrote:
> > Hello Vincent,
> > Thanks a lot for having a look and getting back.
> >
> > On 15/12/2023 7:11 pm, Vincent Guittot wrote:
> >> On Fri, 15 Dec 2023 at 06:27, Imran Khan <[email protected]> wrote:
> >>>
> >>> It has been found that sometimes a task_group has some residual
> >>> load_avg even though the load average at each of its owned queues
> >>> i.e task_group.cfs_rq[cpu].avg.load_avg and task_group.cfs_rq[cpu].
> >>> tg_load_avg_contrib have become 0 for a long time.
> >>> Under this scenario if another task starts running in this task_group,
> >>> it does not get proper time share on CPU since pre-existing
> >>> load average of task group inversely impacts the new task's CPU share
> >>> on each CPU.
> >>>
> >>> This change looks for the condition when a task_group has no running
> >>> tasks and sets the task_group's load average to 0 in such cases, so
> >>> that tasks that run in future under this task_group get the CPU time
> >>> in accordance with the current load.
> >>>
> >>> Signed-off-by: Imran Khan <[email protected]>
> >>> ---
> >>>
> >>
> >> [...]
> >>
> >>>
> >>> 4. Now move systemd-udevd to one of these test groups, say test_group_1, and
> >>> perform scale up to 124 CPUs followed by scale down back to 4 CPUs from the
> >>> host side.
> >>
> >> Could it be the root cause of your problem ?
> >>
> >> The cfs_rq->tg_load_avg_contrib of the 120 CPUs that have been plugged
> >> then unplugged, have not been correctly removed from tg->load_avg. If
> >> the cfs_rq->tg_load_avg_contrib of the 4 remaining CPUs is 0 then
> >> tg->load_avg should be 0 too.
> >>
> > Agree and this was my understanding as well. The issue only happens
> > with large number of CPUs. For example if I go from 4 to 8 and back to
> > 4 , the issue does not happen and even if it happens the residual load
> > avg is very little.
> >
> >> Could you track that the cfs_rq->tg_load_avg_contrib is correctly
> >> removed from tg->load_avg when you unplug the CPUs ? I can easily
> >> imagine that the rate limit can skip some update of tg- >load_avg
> >> while offlining the cpu
> >>
> >
> > I will try to trace it but just so you know this issue is happening on other
> > kernel versions (which don't have rate limit feature) as well. I started
> > with v4.14.x but have tested and found it on v5.4.x and v5.15.x as well.
> >
> I collected some debug trace to understand the missing load avg
> context better. From the traces it looks like during scale down,
> the task_group.cfs_rq[cpu].avg.load_avg is not getting updated
> properly for CPU(s) being hotplugged out.

Your traces are interesting and I think that
task_group.cfs_rq[cpu].avg.load_avg is updated correctly but we don't
call update_tg_load_avg() to reflect that in tg->load_avg.

Could you try the patch below ? It forces the scheduler to clear the
contribution of all cfs_rq of a CPU that becomes offline.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 466e01b2131f..e5da5eaab6ce 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4087,6 +4087,11 @@ static inline void update_tg_load_avg(struct
cfs_rq *cfs_rq)
if (cfs_rq->tg == &root_task_group)
return;

+
+ /* rq has been offline and doesn't contribute anymore to the share */
+ if (!cpu_active(cpu_of(rq_of(cfs_rq))))
+ return;
+
/*
* For migration heavy workloads, access to tg->load_avg can be
* unbound. Limit the update rate to at most once per ms.
@@ -4103,6 +4108,48 @@ static inline void update_tg_load_avg(struct
cfs_rq *cfs_rq)
}
}

+static inline void clear_tg_load_avg(struct cfs_rq *cfs_rq)
+{
+ long delta;
+ u64 now;
+
+ /*
+ * No need to update load_avg for root_task_group as it is not used.
+ */
+ if (cfs_rq->tg == &root_task_group)
+ return;
+
+ now = sched_clock_cpu(cpu_of(rq_of(cfs_rq)));
+ delta = 0 - cfs_rq->tg_load_avg_contrib;
+ atomic_long_add(delta, &cfs_rq->tg->load_avg);
+ cfs_rq->tg_load_avg_contrib = 0;
+ cfs_rq->last_update_tg_load_avg = now;
+}
+
+/* cpu offline callback */
+static void __maybe_unused clear_tg_offline_cfs_rqs(struct rq *rq)
+{
+ struct task_group *tg;
+
+ lockdep_assert_rq_held(rq);
+
+ /*
+ * The rq clock has already been updated in the
+ * set_rq_offline(), so we should skip updating
+ * the rq clock again in unthrottle_cfs_rq().
+ */
+ rq_clock_start_loop_update(rq);
+
+ rcu_read_lock();
+ list_for_each_entry_rcu(tg, &task_groups, list) {
+ struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
+ clear_tg_load_avg(cfs_rq);
+ }
+ rcu_read_unlock();
+
+ rq_clock_stop_loop_update(rq);
+}
+
/*
* Called within set_task_rq() right before setting a task's CPU. The
* caller only guarantees p->pi_lock is held; no other assumptions,
@@ -12414,6 +12461,9 @@ static void rq_offline_fair(struct rq *rq)

/* Ensure any throttled groups are reachable by pick_next_task */
unthrottle_offline_cfs_rqs(rq);
+
+ /* Ensure that we remove rq contribution to group share */
+ clear_tg_offline_cfs_rqs(rq);
}

#endif /* CONFIG_SMP */

>
> For example if we look at following snippet (I have kept
> only the relevant portion of trace in the mail), we can see that,
> in the last invocation of update_tg_load_avg for task_group.cfs_rq[11]
> both the load avg and contribution of this cfs_rq were 1024.
> So delta was zero and this contribution eventually remains undeducted.
> In this case scale down was done from 16 to 8 CPUs, so CPU 11 has been
> offlined.
>
>
> cpuhp/15-131605 [015] d... 6112.350658: update_tg_load_avg.constprop.124:
> cfs_of_cpu=5 cfs_rq->avg.load_avg = 0, cfs_rq->tg_load_avg_contrib = 0
> tg->load_avg = 0 delta = 0 ###
> systemd-udevd-894 [005] d... 6112.351096: update_tg_load_avg.constprop.124:
> cfs_of_cpu=11 cfs_rq->avg.load_avg = 1024, cfs_rq->tg_load_avg_contrib = 0
> tg->load_avg = 0 delta = 1024 ###
> systemd-udevd-894 [005] d... 6112.351165: update_tg_load_avg.constprop.124:
> cfs_of_cpu=5 cfs_rq->avg.load_avg = 10, cfs_rq->tg_load_avg_contrib = 0
> tg->load_avg = 1024 delta = 10 ###
>
> .........................
> .........................
> cat-128667 [006] d... 6112.504633: update_tg_load_avg.constprop.124:
> cfs_of_cpu=11 cfs_rq->avg.load_avg = 0, cfs_rq->tg_load_avg_contrib = 0
> tg->load_avg = 3085 delta = 0 ###
> .........................
> sh-142414 [006] d... 6112.505392: update_tg_load_avg.constprop.124:
> cfs_of_cpu=11 cfs_rq->avg.load_avg = 1024, cfs_rq->tg_load_avg_contrib = 0
> tg->load_avg = 4041 delta = 1024 ###
> .........................
> cat-128667 [006] d... 6112.504633: update_tg_load_avg.constprop.124:
> cfs_of_cpu=11 cfs_rq->avg.load_avg = 0, cfs_rq->tg_load_avg_contrib = 0
> tg->load_avg = 3085 delta = 0 ###
> ..........................
> sh-142414 [006] d... 6112.505392: update_tg_load_avg.constprop.124:
> cfs_of_cpu=11 cfs_rq->avg.load_avg = 1024, cfs_rq->tg_load_avg_contrib = 0
> tg->load_avg = 4041 delta = 1024 ###
> ..........................
> systemd-run-142416 [011] d.h. 6112.506547: update_tg_load_avg.constprop.124:
> cfs_of_cpu=11 cfs_rq->avg.load_avg = 1024, cfs_rq->tg_load_avg_contrib = 1024
> tg->load_avg = 3010 delta = 0 ###
> ..........................
> systemd-run-142416 [011] d.h. 6112.507546: update_tg_load_avg.constprop.124:
> cfs_of_cpu=11 cfs_rq->avg.load_avg = 1024, cfs_rq->tg_load_avg_contrib = 1024
> tg->load_avg = 3010 delta = 0 ### <-- last invocation for cfs_rq[11]
>
> ..........................
> ..........................
> <idle>-0 [001] d.s. 6113.868542: update_tg_load_avg.constprop.124:
> cfs_of_cpu=2 cfs_rq->avg.load_avg = 0, cfs_rq->tg_load_avg_contrib = 0
> tg->load_avg = 1027 delta = 0 ###
> <idle>-0 [001] d.s. 6113.869542: update_tg_load_avg.constprop.124:
> cfs_of_cpu=2 cfs_rq->avg.load_avg = 0, cfs_rq->tg_load_avg_contrib = 0
> tg->load_avg = 1027 delta = 0 ###
> <idle>-0 [001] d.s. 6113.870541: update_tg_load_avg.constprop.124:
> cfs_of_cpu=2 cfs_rq->avg.load_avg = 0, cfs_rq->tg_load_avg_contrib = 0
> tg->load_avg = 1027 delta = 0 ###
>
>
> If I understand correctly, when CPU 11 is offlined the task(s) on its cfs_rq
> will be migrated and its cfs_rq.avg.load_avg will be updated accordingly. This
> drop is cfs_rq.avg.load_avg will be detected by update_tg_load_avg and hence
> the contribution of this cfs_rq will get deducted from tg->load_avg.
> It looks like during hotplug load of one or more tasks, being migrated are
> not getting accounted in the source cfs_rq and this is ending up as residual
> load_avg at task_group (if these tasks are members of a task_group).
>
> Moreover this looks racy and dependent on number of CPUs or some delay.
> For example for scale down from 124 to 4 CPUs I always hit this issue but
> for scale down from 16 to 4 CPUs I hit this issue 8-9 out of 10 times.
> Also for the cases when residual load_avg at task group is low (like < 10),
> I can see that both of my test cgroups get similar CPU times which further
> proves that the unaccounted load avg ending up in a task_group is eventually
> leading to uneven CPU allotment between task groups.
>
>
> I am debugging it further but in the mean time if you have some suggestions or
> need traces from some specific portion of sched code, please let me know.
>
> Thanks,
> Imran
>
> > Thanks,
> > Imran
> >

2023-12-20 02:01:51

by Imran Khan

[permalink] [raw]

Subject: Re: [RFC PATCH] sched: fair: reset task_group.load_avg when there are no running tasks.

Hello Vincent,

On 20/12/2023 5:49 am, Vincent Guittot wrote:
> Hi Imram,
>
> On Tue, 19 Dec 2023 at 07:42, Imran Khan <[email protected]> wrote:
>>
>> Hello Vincent,
>>
>>
>> On 15/12/2023 8:59 pm, Imran Khan wrote:
>>> Hello Vincent,
>>> Thanks a lot for having a look and getting back.
>>>
>>> On 15/12/2023 7:11 pm, Vincent Guittot wrote:
>>>> On Fri, 15 Dec 2023 at 06:27, Imran Khan <[email protected]> wrote:
>>>>>
>>>>> It has been found that sometimes a task_group has some residual
>>>>> load_avg even though the load average at each of its owned queues
>>>>> i.e task_group.cfs_rq[cpu].avg.load_avg and task_group.cfs_rq[cpu].
>>>>> tg_load_avg_contrib have become 0 for a long time.
>>>>> Under this scenario if another task starts running in this task_group,
>>>>> it does not get proper time share on CPU since pre-existing
>>>>> load average of task group inversely impacts the new task's CPU share
>>>>> on each CPU.
>>>>>
>>>>> This change looks for the condition when a task_group has no running
>>>>> tasks and sets the task_group's load average to 0 in such cases, so
>>>>> that tasks that run in future under this task_group get the CPU time
>>>>> in accordance with the current load.
>>>>>
>>>>> Signed-off-by: Imran Khan <[email protected]>
>>>>> ---
>>>>>
>>>>
>>>> [...]
>>>>
>>>>>
>>>>> 4. Now move systemd-udevd to one of these test groups, say test_group_1, and
>>>>> perform scale up to 124 CPUs followed by scale down back to 4 CPUs from the
>>>>> host side.
>>>>
>>>> Could it be the root cause of your problem ?
>>>>
>>>> The cfs_rq->tg_load_avg_contrib of the 120 CPUs that have been plugged
>>>> then unplugged, have not been correctly removed from tg->load_avg. If
>>>> the cfs_rq->tg_load_avg_contrib of the 4 remaining CPUs is 0 then
>>>> tg->load_avg should be 0 too.
>>>>
>>> Agree and this was my understanding as well. The issue only happens
>>> with large number of CPUs. For example if I go from 4 to 8 and back to
>>> 4 , the issue does not happen and even if it happens the residual load
>>> avg is very little.
>>>
>>>> Could you track that the cfs_rq->tg_load_avg_contrib is correctly
>>>> removed from tg->load_avg when you unplug the CPUs ? I can easily
>>>> imagine that the rate limit can skip some update of tg- >load_avg
>>>> while offlining the cpu
>>>>
>>>
>>> I will try to trace it but just so you know this issue is happening on other
>>> kernel versions (which don't have rate limit feature) as well. I started
>>> with v4.14.x but have tested and found it on v5.4.x and v5.15.x as well.
>>>
>> I collected some debug trace to understand the missing load avg
>> context better. From the traces it looks like during scale down,
>> the task_group.cfs_rq[cpu].avg.load_avg is not getting updated
>> properly for CPU(s) being hotplugged out.
>
> Your traces are interesting and I think that
> task_group.cfs_rq[cpu].avg.load_avg is updated correctly but we don't
> call update_tg_load_avg() to reflect that in tg->load_avg.
>
> Could you try the patch below ? It forces the scheduler to clear the
> contribution of all cfs_rq of a CPU that becomes offline.
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 466e01b2131f..e5da5eaab6ce 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4087,6 +4087,11 @@ static inline void update_tg_load_avg(struct
> cfs_rq *cfs_rq)
> if (cfs_rq->tg == &root_task_group)
> return;
>
> +
> + /* rq has been offline and doesn't contribute anymore to the share */
> + if (!cpu_active(cpu_of(rq_of(cfs_rq))))
> + return;
> +
> /*
> * For migration heavy workloads, access to tg->load_avg can be
> * unbound. Limit the update rate to at most once per ms.
> @@ -4103,6 +4108,48 @@ static inline void update_tg_load_avg(struct
> cfs_rq *cfs_rq)
> }
> }
>
> +static inline void clear_tg_load_avg(struct cfs_rq *cfs_rq)
> +{
> + long delta;
> + u64 now;
> +
> + /*
> + * No need to update load_avg for root_task_group as it is not used.
> + */
> + if (cfs_rq->tg == &root_task_group)
> + return;
> +
> + now = sched_clock_cpu(cpu_of(rq_of(cfs_rq)));
> + delta = 0 - cfs_rq->tg_load_avg_contrib;
> + atomic_long_add(delta, &cfs_rq->tg->load_avg);
> + cfs_rq->tg_load_avg_contrib = 0;
> + cfs_rq->last_update_tg_load_avg = now;
> +}
> +
> +/* cpu offline callback */
> +static void __maybe_unused clear_tg_offline_cfs_rqs(struct rq *rq)
> +{
> + struct task_group *tg;
> +
> + lockdep_assert_rq_held(rq);
> +
> + /*
> + * The rq clock has already been updated in the
> + * set_rq_offline(), so we should skip updating
> + * the rq clock again in unthrottle_cfs_rq().
> + */
> + rq_clock_start_loop_update(rq);
> +
> + rcu_read_lock();
> + list_for_each_entry_rcu(tg, &task_groups, list) {
> + struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
> + clear_tg_load_avg(cfs_rq);
> + }
> + rcu_read_unlock();
> +
> + rq_clock_stop_loop_update(rq);
> +}
> +
> /*
> * Called within set_task_rq() right before setting a task's CPU. The
> * caller only guarantees p->pi_lock is held; no other assumptions,
> @@ -12414,6 +12461,9 @@ static void rq_offline_fair(struct rq *rq)
>
> /* Ensure any throttled groups are reachable by pick_next_task */
> unthrottle_offline_cfs_rqs(rq);
> +
> + /* Ensure that we remove rq contribution to group share */
> + clear_tg_offline_cfs_rqs(rq);
> }
>
> #endif /* CONFIG_SMP */
>
>

Thanks a lot for this suggestion. I have tested it in my local setup and it is
fixing the issue. With a little tweak (remove usage of
last_update_tg_load_avg), it works well for older kernel (v5.4.x) as well.
I have not yet tested for v4.14.x but should be fine there as well.

Thanks,
Imran

>>
>> For example if we look at following snippet (I have kept
>> only the relevant portion of trace in the mail), we can see that,
>> in the last invocation of update_tg_load_avg for task_group.cfs_rq[11]
>> both the load avg and contribution of this cfs_rq were 1024.
>> So delta was zero and this contribution eventually remains undeducted.
>> In this case scale down was done from 16 to 8 CPUs, so CPU 11 has been
>> offlined.
>>
>>
>> cpuhp/15-131605 [015] d... 6112.350658: update_tg_load_avg.constprop.124:
>> cfs_of_cpu=5 cfs_rq->avg.load_avg = 0, cfs_rq->tg_load_avg_contrib = 0
>> tg->load_avg = 0 delta = 0 ###
>> systemd-udevd-894 [005] d... 6112.351096: update_tg_load_avg.constprop.124:
>> cfs_of_cpu=11 cfs_rq->avg.load_avg = 1024, cfs_rq->tg_load_avg_contrib = 0
>> tg->load_avg = 0 delta = 1024 ###
>> systemd-udevd-894 [005] d... 6112.351165: update_tg_load_avg.constprop.124:
>> cfs_of_cpu=5 cfs_rq->avg.load_avg = 10, cfs_rq->tg_load_avg_contrib = 0
>> tg->load_avg = 1024 delta = 10 ###
>>
>> .........................
>> .........................
>> cat-128667 [006] d... 6112.504633: update_tg_load_avg.constprop.124:
>> cfs_of_cpu=11 cfs_rq->avg.load_avg = 0, cfs_rq->tg_load_avg_contrib = 0
>> tg->load_avg = 3085 delta = 0 ###
>> .........................
>> sh-142414 [006] d... 6112.505392: update_tg_load_avg.constprop.124:
>> cfs_of_cpu=11 cfs_rq->avg.load_avg = 1024, cfs_rq->tg_load_avg_contrib = 0
>> tg->load_avg = 4041 delta = 1024 ###
>> .........................
>> cat-128667 [006] d... 6112.504633: update_tg_load_avg.constprop.124:
>> cfs_of_cpu=11 cfs_rq->avg.load_avg = 0, cfs_rq->tg_load_avg_contrib = 0
>> tg->load_avg = 3085 delta = 0 ###
>> ..........................
>> sh-142414 [006] d... 6112.505392: update_tg_load_avg.constprop.124:
>> cfs_of_cpu=11 cfs_rq->avg.load_avg = 1024, cfs_rq->tg_load_avg_contrib = 0
>> tg->load_avg = 4041 delta = 1024 ###
>> ..........................
>> systemd-run-142416 [011] d.h. 6112.506547: update_tg_load_avg.constprop.124:
>> cfs_of_cpu=11 cfs_rq->avg.load_avg = 1024, cfs_rq->tg_load_avg_contrib = 1024
>> tg->load_avg = 3010 delta = 0 ###
>> ..........................
>> systemd-run-142416 [011] d.h. 6112.507546: update_tg_load_avg.constprop.124:
>> cfs_of_cpu=11 cfs_rq->avg.load_avg = 1024, cfs_rq->tg_load_avg_contrib = 1024
>> tg->load_avg = 3010 delta = 0 ### <-- last invocation for cfs_rq[11]
>>
>> ..........................
>> ..........................
>> <idle>-0 [001] d.s. 6113.868542: update_tg_load_avg.constprop.124:
>> cfs_of_cpu=2 cfs_rq->avg.load_avg = 0, cfs_rq->tg_load_avg_contrib = 0
>> tg->load_avg = 1027 delta = 0 ###
>> <idle>-0 [001] d.s. 6113.869542: update_tg_load_avg.constprop.124:
>> cfs_of_cpu=2 cfs_rq->avg.load_avg = 0, cfs_rq->tg_load_avg_contrib = 0
>> tg->load_avg = 1027 delta = 0 ###
>> <idle>-0 [001] d.s. 6113.870541: update_tg_load_avg.constprop.124:
>> cfs_of_cpu=2 cfs_rq->avg.load_avg = 0, cfs_rq->tg_load_avg_contrib = 0
>> tg->load_avg = 1027 delta = 0 ###
>>
>>
>> If I understand correctly, when CPU 11 is offlined the task(s) on its cfs_rq
>> will be migrated and its cfs_rq.avg.load_avg will be updated accordingly. This
>> drop is cfs_rq.avg.load_avg will be detected by update_tg_load_avg and hence
>> the contribution of this cfs_rq will get deducted from tg->load_avg.
>> It looks like during hotplug load of one or more tasks, being migrated are
>> not getting accounted in the source cfs_rq and this is ending up as residual
>> load_avg at task_group (if these tasks are members of a task_group).
>>
>> Moreover this looks racy and dependent on number of CPUs or some delay.
>> For example for scale down from 124 to 4 CPUs I always hit this issue but
>> for scale down from 16 to 4 CPUs I hit this issue 8-9 out of 10 times.
>> Also for the cases when residual load_avg at task group is low (like < 10),
>> I can see that both of my test cgroups get similar CPU times which further
>> proves that the unaccounted load avg ending up in a task_group is eventually
>> leading to uneven CPU allotment between task groups.
>>
>>
>> I am debugging it further but in the mean time if you have some suggestions or
>> need traces from some specific portion of sched code, please let me know.
>>
>> Thanks,
>> Imran
>>
>>> Thanks,
>>> Imran
>>>

2023-12-20 17:31:57

by Vincent Guittot

[permalink] [raw]

Subject: Re: [RFC PATCH] sched: fair: reset task_group.load_avg when there are no running tasks.

On Wed, 20 Dec 2023 at 03:01, Imran Khan <[email protected]> wrote:
>
> Hello Vincent,
>
> On 20/12/2023 5:49 am, Vincent Guittot wrote:
> > Hi Imram,
> >
> > On Tue, 19 Dec 2023 at 07:42, Imran Khan <[email protected]> wrote:
> >>
> >> Hello Vincent,
> >>
> >>
> >> On 15/12/2023 8:59 pm, Imran Khan wrote:
> >>> Hello Vincent,
> >>> Thanks a lot for having a look and getting back.
> >>>
> >>> On 15/12/2023 7:11 pm, Vincent Guittot wrote:
> >>>> On Fri, 15 Dec 2023 at 06:27, Imran Khan <[email protected]> wrote:
> >>>>>
> >>>>> It has been found that sometimes a task_group has some residual
> >>>>> load_avg even though the load average at each of its owned queues
> >>>>> i.e task_group.cfs_rq[cpu].avg.load_avg and task_group.cfs_rq[cpu].
> >>>>> tg_load_avg_contrib have become 0 for a long time.
> >>>>> Under this scenario if another task starts running in this task_group,
> >>>>> it does not get proper time share on CPU since pre-existing
> >>>>> load average of task group inversely impacts the new task's CPU share
> >>>>> on each CPU.
> >>>>>
> >>>>> This change looks for the condition when a task_group has no running
> >>>>> tasks and sets the task_group's load average to 0 in such cases, so
> >>>>> that tasks that run in future under this task_group get the CPU time
> >>>>> in accordance with the current load.
> >>>>>
> >>>>> Signed-off-by: Imran Khan <[email protected]>
> >>>>> ---
> >>>>>
> >>>>
> >>>> [...]
> >>>>
> >>>>>
> >>>>> 4. Now move systemd-udevd to one of these test groups, say test_group_1, and
> >>>>> perform scale up to 124 CPUs followed by scale down back to 4 CPUs from the
> >>>>> host side.
> >>>>
> >>>> Could it be the root cause of your problem ?
> >>>>
> >>>> The cfs_rq->tg_load_avg_contrib of the 120 CPUs that have been plugged
> >>>> then unplugged, have not been correctly removed from tg->load_avg. If
> >>>> the cfs_rq->tg_load_avg_contrib of the 4 remaining CPUs is 0 then
> >>>> tg->load_avg should be 0 too.
> >>>>
> >>> Agree and this was my understanding as well. The issue only happens
> >>> with large number of CPUs. For example if I go from 4 to 8 and back to
> >>> 4 , the issue does not happen and even if it happens the residual load
> >>> avg is very little.
> >>>
> >>>> Could you track that the cfs_rq->tg_load_avg_contrib is correctly
> >>>> removed from tg->load_avg when you unplug the CPUs ? I can easily
> >>>> imagine that the rate limit can skip some update of tg- >load_avg
> >>>> while offlining the cpu
> >>>>
> >>>
> >>> I will try to trace it but just so you know this issue is happening on other
> >>> kernel versions (which don't have rate limit feature) as well. I started
> >>> with v4.14.x but have tested and found it on v5.4.x and v5.15.x as well.
> >>>
> >> I collected some debug trace to understand the missing load avg
> >> context better. From the traces it looks like during scale down,
> >> the task_group.cfs_rq[cpu].avg.load_avg is not getting updated
> >> properly for CPU(s) being hotplugged out.
> >
> > Your traces are interesting and I think that
> > task_group.cfs_rq[cpu].avg.load_avg is updated correctly but we don't
> > call update_tg_load_avg() to reflect that in tg->load_avg.
> >
> > Could you try the patch below ? It forces the scheduler to clear the
> > contribution of all cfs_rq of a CPU that becomes offline.
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 466e01b2131f..e5da5eaab6ce 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4087,6 +4087,11 @@ static inline void update_tg_load_avg(struct
> > cfs_rq *cfs_rq)
> > if (cfs_rq->tg == &root_task_group)
> > return;
> >
> > +
> > + /* rq has been offline and doesn't contribute anymore to the share */
> > + if (!cpu_active(cpu_of(rq_of(cfs_rq))))
> > + return;
> > +
> > /*
> > * For migration heavy workloads, access to tg->load_avg can be
> > * unbound. Limit the update rate to at most once per ms.
> > @@ -4103,6 +4108,48 @@ static inline void update_tg_load_avg(struct
> > cfs_rq *cfs_rq)
> > }
> > }
> >
> > +static inline void clear_tg_load_avg(struct cfs_rq *cfs_rq)
> > +{
> > + long delta;
> > + u64 now;
> > +
> > + /*
> > + * No need to update load_avg for root_task_group as it is not used.
> > + */
> > + if (cfs_rq->tg == &root_task_group)
> > + return;
> > +
> > + now = sched_clock_cpu(cpu_of(rq_of(cfs_rq)));
> > + delta = 0 - cfs_rq->tg_load_avg_contrib;
> > + atomic_long_add(delta, &cfs_rq->tg->load_avg);
> > + cfs_rq->tg_load_avg_contrib = 0;
> > + cfs_rq->last_update_tg_load_avg = now;
> > +}
> > +
> > +/* cpu offline callback */
> > +static void __maybe_unused clear_tg_offline_cfs_rqs(struct rq *rq)
> > +{
> > + struct task_group *tg;
> > +
> > + lockdep_assert_rq_held(rq);
> > +
> > + /*
> > + * The rq clock has already been updated in the
> > + * set_rq_offline(), so we should skip updating
> > + * the rq clock again in unthrottle_cfs_rq().
> > + */
> > + rq_clock_start_loop_update(rq);
> > +
> > + rcu_read_lock();
> > + list_for_each_entry_rcu(tg, &task_groups, list) {
> > + struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
> > + clear_tg_load_avg(cfs_rq);
> > + }
> > + rcu_read_unlock();
> > +
> > + rq_clock_stop_loop_update(rq);
> > +}
> > +
> > /*
> > * Called within set_task_rq() right before setting a task's CPU. The
> > * caller only guarantees p->pi_lock is held; no other assumptions,
> > @@ -12414,6 +12461,9 @@ static void rq_offline_fair(struct rq *rq)
> >
> > /* Ensure any throttled groups are reachable by pick_next_task */
> > unthrottle_offline_cfs_rqs(rq);
> > +
> > + /* Ensure that we remove rq contribution to group share */
> > + clear_tg_offline_cfs_rqs(rq);
> > }
> >
> > #endif /* CONFIG_SMP */
> >
> >
>
> Thanks a lot for this suggestion. I have tested it in my local setup and it is
> fixing the issue. With a little tweak (remove usage of
> last_update_tg_load_avg), it works well for older kernel (v5.4.x) as well.
> I have not yet tested for v4.14.x but should be fine there as well.

Ok, so we have the root cause. My only concern with this fix is that
we have recently limited the access of the "global" tg->load_avg that
was creating perf regression because of contention when accessing it
too often and this patch adds a test on the "global" cpu_active_mask
at the same place. The cpu_active_mask is almost read only in this
case whereas tg->load_avg is written so that's maybe not a problem

Aaron,
Could you run the tests that you run for testing "Ratelimit update to
tg->load_avg" and check if this patch impacts your performance ?

Vincent

>
> Thanks,
> Imran
>
> >>
> >> For example if we look at following snippet (I have kept
> >> only the relevant portion of trace in the mail), we can see that,
> >> in the last invocation of update_tg_load_avg for task_group.cfs_rq[11]
> >> both the load avg and contribution of this cfs_rq were 1024.
> >> So delta was zero and this contribution eventually remains undeducted.
> >> In this case scale down was done from 16 to 8 CPUs, so CPU 11 has been
> >> offlined.
> >>
> >>
> >> cpuhp/15-131605 [015] d... 6112.350658: update_tg_load_avg.constprop.124:
> >> cfs_of_cpu=5 cfs_rq->avg.load_avg = 0, cfs_rq->tg_load_avg_contrib = 0
> >> tg->load_avg = 0 delta = 0 ###
> >> systemd-udevd-894 [005] d... 6112.351096: update_tg_load_avg.constprop.124:
> >> cfs_of_cpu=11 cfs_rq->avg.load_avg = 1024, cfs_rq->tg_load_avg_contrib = 0
> >> tg->load_avg = 0 delta = 1024 ###
> >> systemd-udevd-894 [005] d... 6112.351165: update_tg_load_avg.constprop.124:
> >> cfs_of_cpu=5 cfs_rq->avg.load_avg = 10, cfs_rq->tg_load_avg_contrib = 0
> >> tg->load_avg = 1024 delta = 10 ###
> >>
> >> .........................
> >> .........................
> >> cat-128667 [006] d... 6112.504633: update_tg_load_avg.constprop.124:
> >> cfs_of_cpu=11 cfs_rq->avg.load_avg = 0, cfs_rq->tg_load_avg_contrib = 0
> >> tg->load_avg = 3085 delta = 0 ###
> >> .........................
> >> sh-142414 [006] d... 6112.505392: update_tg_load_avg.constprop.124:
> >> cfs_of_cpu=11 cfs_rq->avg.load_avg = 1024, cfs_rq->tg_load_avg_contrib = 0
> >> tg->load_avg = 4041 delta = 1024 ###
> >> .........................
> >> cat-128667 [006] d... 6112.504633: update_tg_load_avg.constprop.124:
> >> cfs_of_cpu=11 cfs_rq->avg.load_avg = 0, cfs_rq->tg_load_avg_contrib = 0
> >> tg->load_avg = 3085 delta = 0 ###
> >> ..........................
> >> sh-142414 [006] d... 6112.505392: update_tg_load_avg.constprop.124:
> >> cfs_of_cpu=11 cfs_rq->avg.load_avg = 1024, cfs_rq->tg_load_avg_contrib = 0
> >> tg->load_avg = 4041 delta = 1024 ###
> >> ..........................
> >> systemd-run-142416 [011] d.h. 6112.506547: update_tg_load_avg.constprop.124:
> >> cfs_of_cpu=11 cfs_rq->avg.load_avg = 1024, cfs_rq->tg_load_avg_contrib = 1024
> >> tg->load_avg = 3010 delta = 0 ###
> >> ..........................
> >> systemd-run-142416 [011] d.h. 6112.507546: update_tg_load_avg.constprop.124:
> >> cfs_of_cpu=11 cfs_rq->avg.load_avg = 1024, cfs_rq->tg_load_avg_contrib = 1024
> >> tg->load_avg = 3010 delta = 0 ### <-- last invocation for cfs_rq[11]
> >>
> >> ..........................
> >> ..........................
> >> <idle>-0 [001] d.s. 6113.868542: update_tg_load_avg.constprop.124:
> >> cfs_of_cpu=2 cfs_rq->avg.load_avg = 0, cfs_rq->tg_load_avg_contrib = 0
> >> tg->load_avg = 1027 delta = 0 ###
> >> <idle>-0 [001] d.s. 6113.869542: update_tg_load_avg.constprop.124:
> >> cfs_of_cpu=2 cfs_rq->avg.load_avg = 0, cfs_rq->tg_load_avg_contrib = 0
> >> tg->load_avg = 1027 delta = 0 ###
> >> <idle>-0 [001] d.s. 6113.870541: update_tg_load_avg.constprop.124:
> >> cfs_of_cpu=2 cfs_rq->avg.load_avg = 0, cfs_rq->tg_load_avg_contrib = 0
> >> tg->load_avg = 1027 delta = 0 ###
> >>
> >>
> >> If I understand correctly, when CPU 11 is offlined the task(s) on its cfs_rq
> >> will be migrated and its cfs_rq.avg.load_avg will be updated accordingly. This
> >> drop is cfs_rq.avg.load_avg will be detected by update_tg_load_avg and hence
> >> the contribution of this cfs_rq will get deducted from tg->load_avg.
> >> It looks like during hotplug load of one or more tasks, being migrated are
> >> not getting accounted in the source cfs_rq and this is ending up as residual
> >> load_avg at task_group (if these tasks are members of a task_group).
> >>
> >> Moreover this looks racy and dependent on number of CPUs or some delay.
> >> For example for scale down from 124 to 4 CPUs I always hit this issue but
> >> for scale down from 16 to 4 CPUs I hit this issue 8-9 out of 10 times.
> >> Also for the cases when residual load_avg at task group is low (like < 10),
> >> I can see that both of my test cgroups get similar CPU times which further
> >> proves that the unaccounted load avg ending up in a task_group is eventually
> >> leading to uneven CPU allotment between task groups.
> >>
> >>
> >> I am debugging it further but in the mean time if you have some suggestions or
> >> need traces from some specific portion of sched code, please let me know.
> >>
> >> Thanks,
> >> Imran
> >>
> >>> Thanks,
> >>> Imran
> >>>

2023-12-21 08:21:27

by Aaron Lu

[permalink] [raw]

Subject: Re: [RFC PATCH] sched: fair: reset task_group.load_avg when there are no running tasks.

Hi Vincent,

Thanks for the heads up.

On Wed, Dec 20, 2023 at 06:31:08PM +0100, Vincent Guittot wrote:
> Aaron,
> Could you run the tests that you run for testing "Ratelimit update to
> tg->load_avg" and check if this patch impacts your performance ?

I run hackbench/netperf/postgres_sysbench with nr_thread=nr_cpu on a 2
sockets/120cores/240cpus Intel server and didn't notice any performance
change after applying your diff so I think it's not a problem.

Regards,
Aaron

2023-12-21 15:28:36

by Vincent Guittot

[permalink] [raw]

Subject: Re: [RFC PATCH] sched: fair: reset task_group.load_avg when there are no running tasks.

Hi Aaron,

On Thu, 21 Dec 2023 at 09:21, Aaron Lu <[email protected]> wrote:
>
> Hi Vincent,
>
> Thanks for the heads up.
>
> On Wed, Dec 20, 2023 at 06:31:08PM +0100, Vincent Guittot wrote:
> > Aaron,
> > Could you run the tests that you run for testing "Ratelimit update to
> > tg->load_avg" and check if this patch impacts your performance ?
>
> I run hackbench/netperf/postgres_sysbench with nr_thread=nr_cpu on a 2
> sockets/120cores/240cpus Intel server and didn't notice any performance
> change after applying your diff so I think it's not a problem.

Thanks for the tests.

I'm going to prepare a proper patch

>
> Regards,
> Aaron