2021-04-25 08:12:36

by Odin Ugedal

[permalink] [raw]
Subject: [PATCH 1/1] sched/fair: Fix unfairness caused by missing load decay

This fixes an issue where old load on a cfs_rq is not properly decayed,
resulting in strange behavior where fairness can decrease drastically.
Real workloads with equally weighted control groups have ended up
getting a respective 99% and 1%(!!) of cpu time.

When an idle task is attached to a cfs_rq by attaching a pid to a cgroup,
the old load of the task is attached to the new cfs_rq and sched_entity by
attach_entity_cfs_rq. If the task is then moved to another cpu (and
therefore cfs_rq) before being enqueued/woken up, the load will be moved
to cfs_rq->removed from the sched_entity. Such a move will happen when
enforcing a cpuset on the task (eg. via a cgroup) that force it to move.

The load will however not be removed from the task_group itself, making
it look like there is a constant load on that cfs_rq. This causes the
vruntime of tasks on other sibling cfs_rq's to increase faster than they
are supposed to; causing severe fairness issues. If no other task is
started on the given cfs_rq, and due to the cpuset it would not happen,
this load would never be properly unloaded. With this patch the load
will be properly removed inside update_blocked_averages. This also
applies to tasks moved to the fair scheduling class and moved to another
cpu, and this path will also fix that. For fork, the entity is queued
right away, so this problem does not affect that.

For a simple cgroup hierarchy (as seen below) with two equally weighted
groups, that in theory should get 50/50 of cpu time each, it often leads
to a load of 60/40 or 70/30.

parent/
cg-1/
cpu.weight: 100
cpuset.cpus: 1
cg-2/
cpu.weight: 100
cpuset.cpus: 1

If the hierarchy is deeper (as seen below), while keeping cg-1 and cg-2
equally weighted, they should still get a 50/50 balance of cpu time.
This however sometimes results in a balance of 10/90 or 1/99(!!) between
the task groups.

$ ps u -C stress
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 18568 1.1 0.0 3684 100 pts/12 R+ 13:36 0:00 stress --cpu 1
root 18580 99.3 0.0 3684 100 pts/12 R+ 13:36 0:09 stress --cpu 1

parent/
cg-1/
cpu.weight: 100
sub-group/
cpu.weight: 1
cpuset.cpus: 1
cg-2/
cpu.weight: 100
sub-group/
cpu.weight: 10000
cpuset.cpus: 1

This can be reproduced by attaching an idle process to a cgroup and
moving it to a given cpuset before it wakes up. The issue is evident in
many (if not most) container runtimes, and has been reproduced
with both crun and runc (and therefore docker and all its "derivatives"),
and with both cgroup v1 and v2.

Fixes: 3d30544f0212 ("sched/fair: Apply more PELT fixes")
Signed-off-by: Odin Ugedal <[email protected]>
---
kernel/sched/fair.c | 13 +++++++++++++
1 file changed, 13 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 794c2cb945f8..ad7556f99b4a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10916,6 +10916,19 @@ static void attach_task_cfs_rq(struct task_struct *p)

if (!vruntime_normalized(p))
se->vruntime += cfs_rq->min_vruntime;
+
+ /*
+ * Make sure the attached load will decay properly
+ * in case the task is moved to another cpu before
+ * being queued.
+ */
+ if (!task_on_rq_queued(p)) {
+ for_each_sched_entity(se) {
+ if (se->on_rq)
+ break;
+ list_add_leaf_cfs_rq(cfs_rq_of(se));
+ }
+ }
}

static void switched_from_fair(struct rq *rq, struct task_struct *p)
--
2.31.1


2021-04-27 14:28:07

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 1/1] sched/fair: Fix unfairness caused by missing load decay

Le dimanche 25 avril 2021 ? 10:09:02 (+0200), Odin Ugedal a ?crit :
> This fixes an issue where old load on a cfs_rq is not properly decayed,
> resulting in strange behavior where fairness can decrease drastically.
> Real workloads with equally weighted control groups have ended up
> getting a respective 99% and 1%(!!) of cpu time.
>
> When an idle task is attached to a cfs_rq by attaching a pid to a cgroup,
> the old load of the task is attached to the new cfs_rq and sched_entity by
> attach_entity_cfs_rq. If the task is then moved to another cpu (and
> therefore cfs_rq) before being enqueued/woken up, the load will be moved
> to cfs_rq->removed from the sched_entity. Such a move will happen when
> enforcing a cpuset on the task (eg. via a cgroup) that force it to move.

Would be good to mention that the problem happens only if the new cfs_rq has
been removed from the leaf_cfs_rq_list because its PELT metrics were already
null. In such case __update_blocked_fair() never updates the blocked load of
the new cfs_rq and never propagate the removed load in the hierarchy.

>
> The load will however not be removed from the task_group itself, making
> it look like there is a constant load on that cfs_rq. This causes the
> vruntime of tasks on other sibling cfs_rq's to increase faster than they
> are supposed to; causing severe fairness issues. If no other task is
> started on the given cfs_rq, and due to the cpuset it would not happen,
> this load would never be properly unloaded. With this patch the load
> will be properly removed inside update_blocked_averages. This also
> applies to tasks moved to the fair scheduling class and moved to another
> cpu, and this path will also fix that. For fork, the entity is queued
> right away, so this problem does not affect that.
>
> For a simple cgroup hierarchy (as seen below) with two equally weighted
> groups, that in theory should get 50/50 of cpu time each, it often leads
> to a load of 60/40 or 70/30.
>
> parent/
> cg-1/
> cpu.weight: 100
> cpuset.cpus: 1
> cg-2/
> cpu.weight: 100
> cpuset.cpus: 1
>
> If the hierarchy is deeper (as seen below), while keeping cg-1 and cg-2
> equally weighted, they should still get a 50/50 balance of cpu time.
> This however sometimes results in a balance of 10/90 or 1/99(!!) between
> the task groups.
>
> $ ps u -C stress
> USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
> root 18568 1.1 0.0 3684 100 pts/12 R+ 13:36 0:00 stress --cpu 1
> root 18580 99.3 0.0 3684 100 pts/12 R+ 13:36 0:09 stress --cpu 1
>
> parent/
> cg-1/
> cpu.weight: 100
> sub-group/
> cpu.weight: 1
> cpuset.cpus: 1
> cg-2/
> cpu.weight: 100
> sub-group/
> cpu.weight: 10000
> cpuset.cpus: 1
>
> This can be reproduced by attaching an idle process to a cgroup and
> moving it to a given cpuset before it wakes up. The issue is evident in
> many (if not most) container runtimes, and has been reproduced
> with both crun and runc (and therefore docker and all its "derivatives"),
> and with both cgroup v1 and v2.
>
> Fixes: 3d30544f0212 ("sched/fair: Apply more PELT fixes")

The fix tag should be :
Fixes: 039ae8bcf7a5 ("sched/fair: Fix O(nr_cgroups) in the load balancing path")

This patch re-introduced the del of idle cfs_rq from leaf_cfs_rq_list in order to
skip useless update of blocked load.


> Signed-off-by: Odin Ugedal <[email protected]>
> ---
> kernel/sched/fair.c | 13 +++++++++++++
> 1 file changed, 13 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 794c2cb945f8..ad7556f99b4a 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10916,6 +10916,19 @@ static void attach_task_cfs_rq(struct task_struct *p)
>
> if (!vruntime_normalized(p))
> se->vruntime += cfs_rq->min_vruntime;
> +
> + /*
> + * Make sure the attached load will decay properly
> + * in case the task is moved to another cpu before
> + * being queued.
> + */
> + if (!task_on_rq_queued(p)) {
> + for_each_sched_entity(se) {
> + if (se->on_rq)
> + break;
> + list_add_leaf_cfs_rq(cfs_rq_of(se));
> + }
> + }

propagate_entity_cfs_rq() already goes across the tg tree to
propagate the attach/detach.

would be better to call list_add_leaf_cfs_rq(cfs_rq) inside this function
instead of looping twice the tg tree. Something like:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 33b1ee31ae0f..18441ce7316c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11026,10 +11026,10 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);

- if (cfs_rq_throttled(cfs_rq))
- break;
+ if (!cfs_rq_throttled(cfs_rq))
+ update_load_avg(cfs_rq, se, UPDATE_TG);

- update_load_avg(cfs_rq, se, UPDATE_TG);
+ list_add_leaf_cfs_rq(cfs_rq);
}
}
#else


>
> }
>
> static void switched_from_fair(struct rq *rq, struct task_struct *p)
> --
> 2.31.1
>

2021-04-28 17:24:38

by Odin Ugedal

[permalink] [raw]
Subject: Re: [PATCH 1/1] sched/fair: Fix unfairness caused by missing load decay

Hi,

> Would be good to mention that the problem happens only if the new cfs_rq has
> been removed from the leaf_cfs_rq_list because its PELT metrics were already
> null. In such case __update_blocked_fair() never updates the blocked load of
> the new cfs_rq and never propagate the removed load in the hierarchy.

Well, it does technically occur when PELT metrics were null and therefore
removed from this leaf_cfs_rq_list, that is correct. We do however not add
newly created cfs_rq's to leaf_cfs_rq_list, so that is also a reason for it
to occur. Most users of cgroups are probably creating a new cgroup and then
attaching a process to it, so I think that will be the _biggest_ issue.

> The fix tag should be :
> Fixes: 039ae8bcf7a5 ("sched/fair: Fix O(nr_cgroups) in the load balancing path")
>
> This patch re-introduced the del of idle cfs_rq from leaf_cfs_rq_list in order to
> skip useless update of blocked load.

Thanks for pointing me at that patch! A quick look makes me think that that
commit caused the issue to occur _more often_, but was not the one that
introduced it. I should probably investigate a bit more tho., since I didn't
dig that deep in it. It is not a clean revert for that patch on v5.12,
but I did apply the diff below to test. It is essentially what the patch
039ae8bcf7a5 does, as far as I see. There might however been more commits
beteen those, so I might take a look further behind to see.

Doing this does make the problem less severe, resulting in ~90/10 load on the
example that without the diff results in ~99/1. So with this diff/reverting
039ae8bcf7a5, there is still an issue.

Should I keep two "Fixes", or should I just take one of them?

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 794c2cb945f8..5fac4fbf6f84 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7941,8 +7941,8 @@ static bool __update_blocked_fair(struct rq *rq,
bool *done)
* There can be a lot of idle CPU cgroups. Don't let fully
* decayed cfs_rqs linger on the list.
*/
- if (cfs_rq_is_decayed(cfs_rq))
- list_del_leaf_cfs_rq(cfs_rq);
+ // if (cfs_rq_is_decayed(cfs_rq))
+ // list_del_leaf_cfs_rq(cfs_rq);

/* Don't need periodic decay once load/util_avg are null */
if (cfs_rq_has_blocked(cfs_rq))

> propagate_entity_cfs_rq() already goes across the tg tree to
> propagate the attach/detach.
>
> would be better to call list_add_leaf_cfs_rq(cfs_rq) inside this function
> instead of looping twice the tg tree. Something like:
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 33b1ee31ae0f..18441ce7316c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11026,10 +11026,10 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
> for_each_sched_entity(se) {
> cfs_rq = cfs_rq_of(se);
>
> - if (cfs_rq_throttled(cfs_rq))
> - break;
> + if (!cfs_rq_throttled(cfs_rq))
> + update_load_avg(cfs_rq, se, UPDATE_TG);
>
> - update_load_avg(cfs_rq, se, UPDATE_TG);
> + list_add_leaf_cfs_rq(cfs_rq);
> }
> }
> #else


Thanks for that feedback!

I did think about that, but was not sure what would be the best one.
If it is "safe" to always run list_add_leaf_cfs_rq there (since it is used in
more places than just on cgroup change and move to fair class), I do agree
that that is a better solution. Will test that, and post a new patch
if it works as expected.

Also, the current code will exit from the loop in case a cfs_rq is throttled,
while your suggestion will keep looping. For list_add_leaf_cfs_rq that is fine
(and required), but should we keep running update_load_avg? I do think it is ok,
and the likelihood of a cfs_rq being throttled is not that high after all, so
I guess it doesn't really matter.

Thanks
Odin

2021-04-28 19:40:08

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 1/1] sched/fair: Fix unfairness caused by missing load decay

On Wed, 28 Apr 2021 at 15:10, Odin Ugedal <[email protected]> wrote:
>
> Hi,
>
> > Would be good to mention that the problem happens only if the new cfs_rq has
> > been removed from the leaf_cfs_rq_list because its PELT metrics were already
> > null. In such case __update_blocked_fair() never updates the blocked load of
> > the new cfs_rq and never propagate the removed load in the hierarchy.
>
> Well, it does technically occur when PELT metrics were null and therefore
> removed from this leaf_cfs_rq_list, that is correct. We do however not add
> newly created cfs_rq's to leaf_cfs_rq_list, so that is also a reason for it

You're right that we wait for the 1st task to be enqueued to add the
cfs_rq in the list

> to occur. Most users of cgroups are probably creating a new cgroup and then
> attaching a process to it, so I think that will be the _biggest_ issue.

Yes, I agree that according to your sequence, your problem mainly
comes from this and not the commit below

>
> > The fix tag should be :
> > Fixes: 039ae8bcf7a5 ("sched/fair: Fix O(nr_cgroups) in the load balancing path")
> >
> > This patch re-introduced the del of idle cfs_rq from leaf_cfs_rq_list in order to
> > skip useless update of blocked load.
>
> Thanks for pointing me at that patch! A quick look makes me think that that
> commit caused the issue to occur _more often_, but was not the one that
> introduced it. I should probably investigate a bit more tho., since I didn't
> dig that deep in it. It is not a clean revert for that patch on v5.12,
> but I did apply the diff below to test. It is essentially what the patch
> 039ae8bcf7a5 does, as far as I see. There might however been more commits
> beteen those, so I might take a look further behind to see.
>
> Doing this does make the problem less severe, resulting in ~90/10 load on the
> example that without the diff results in ~99/1. So with this diff/reverting
> 039ae8bcf7a5, there is still an issue.
>
> Should I keep two "Fixes", or should I just take one of them?

You can keep both fixes tags

>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 794c2cb945f8..5fac4fbf6f84 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7941,8 +7941,8 @@ static bool __update_blocked_fair(struct rq *rq,
> bool *done)
> * There can be a lot of idle CPU cgroups. Don't let fully
> * decayed cfs_rqs linger on the list.
> */
> - if (cfs_rq_is_decayed(cfs_rq))
> - list_del_leaf_cfs_rq(cfs_rq);
> + // if (cfs_rq_is_decayed(cfs_rq))
> + // list_del_leaf_cfs_rq(cfs_rq);
>
> /* Don't need periodic decay once load/util_avg are null */
> if (cfs_rq_has_blocked(cfs_rq))
>
> > propagate_entity_cfs_rq() already goes across the tg tree to
> > propagate the attach/detach.
> >
> > would be better to call list_add_leaf_cfs_rq(cfs_rq) inside this function
> > instead of looping twice the tg tree. Something like:
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 33b1ee31ae0f..18441ce7316c 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -11026,10 +11026,10 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
> > for_each_sched_entity(se) {
> > cfs_rq = cfs_rq_of(se);
> >
> > - if (cfs_rq_throttled(cfs_rq))
> > - break;
> > + if (!cfs_rq_throttled(cfs_rq))
> > + update_load_avg(cfs_rq, se, UPDATE_TG);
> >
> > - update_load_avg(cfs_rq, se, UPDATE_TG);
> > + list_add_leaf_cfs_rq(cfs_rq);
> > }
> > }
> > #else
>
>
> Thanks for that feedback!
>
> I did think about that, but was not sure what would be the best one.
> If it is "safe" to always run list_add_leaf_cfs_rq there (since it is used in

If the cfs_rq is already in the list list_add_leaf_cfs_rq() will exit
early but if it's not, we don't have to make sure that the whole
branch in the list

In fact, we can break as soon as list_add_leaf_cfs_rq() and
cfs_rq_throttled() return true

> more places than just on cgroup change and move to fair class), I do agree
> that that is a better solution. Will test that, and post a new patch
> if it works as expected.
>
> Also, the current code will exit from the loop in case a cfs_rq is throttled,
> while your suggestion will keep looping. For list_add_leaf_cfs_rq that is fine
> (and required), but should we keep running update_load_avg? I do think it is ok,

When a cfs_rq is throttled, it is not accounted in its parent anymore
so we don't have to update and propagate the load down.

> and the likelihood of a cfs_rq being throttled is not that high after all, so
> I guess it doesn't really matter.
>
> Thanks
> Odin

2021-04-28 20:50:05

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH 1/1] sched/fair: Fix unfairness caused by missing load decay

On 28/04/2021 17:35, Vincent Guittot wrote:
> On Wed, 28 Apr 2021 at 15:10, Odin Ugedal <[email protected]> wrote:
>>
>> Hi,
>>
>>> Would be good to mention that the problem happens only if the new cfs_rq has
>>> been removed from the leaf_cfs_rq_list because its PELT metrics were already
>>> null. In such case __update_blocked_fair() never updates the blocked load of
>>> the new cfs_rq and never propagate the removed load in the hierarchy.
>>
>> Well, it does technically occur when PELT metrics were null and therefore
>> removed from this leaf_cfs_rq_list, that is correct. We do however not add
>> newly created cfs_rq's to leaf_cfs_rq_list, so that is also a reason for it
>
> You're right that we wait for the 1st task to be enqueued to add the
> cfs_rq in the list
>
>> to occur. Most users of cgroups are probably creating a new cgroup and then
>> attaching a process to it, so I think that will be the _biggest_ issue.
>
> Yes, I agree that according to your sequence, your problem mainly
> comes from this and not the commit below
>
>>
>>> The fix tag should be :
>>> Fixes: 039ae8bcf7a5 ("sched/fair: Fix O(nr_cgroups) in the load balancing path")
>>>
>>> This patch re-introduced the del of idle cfs_rq from leaf_cfs_rq_list in order to
>>> skip useless update of blocked load.
>>
>> Thanks for pointing me at that patch! A quick look makes me think that that
>> commit caused the issue to occur _more often_, but was not the one that
>> introduced it. I should probably investigate a bit more tho., since I didn't
>> dig that deep in it. It is not a clean revert for that patch on v5.12,
>> but I did apply the diff below to test. It is essentially what the patch
>> 039ae8bcf7a5 does, as far as I see. There might however been more commits
>> beteen those, so I might take a look further behind to see.
>>
>> Doing this does make the problem less severe, resulting in ~90/10 load on the
>> example that without the diff results in ~99/1. So with this diff/reverting
>> 039ae8bcf7a5, there is still an issue.
>>
>> Should I keep two "Fixes", or should I just take one of them?
>
> You can keep both fixes tags
>
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 794c2cb945f8..5fac4fbf6f84 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -7941,8 +7941,8 @@ static bool __update_blocked_fair(struct rq *rq,
>> bool *done)
>> * There can be a lot of idle CPU cgroups. Don't let fully
>> * decayed cfs_rqs linger on the list.
>> */
>> - if (cfs_rq_is_decayed(cfs_rq))
>> - list_del_leaf_cfs_rq(cfs_rq);
>> + // if (cfs_rq_is_decayed(cfs_rq))
>> + // list_del_leaf_cfs_rq(cfs_rq);
>>
>> /* Don't need periodic decay once load/util_avg are null */
>> if (cfs_rq_has_blocked(cfs_rq))
>>
>>> propagate_entity_cfs_rq() already goes across the tg tree to
>>> propagate the attach/detach.
>>>
>>> would be better to call list_add_leaf_cfs_rq(cfs_rq) inside this function
>>> instead of looping twice the tg tree. Something like:
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 33b1ee31ae0f..18441ce7316c 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -11026,10 +11026,10 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
>>> for_each_sched_entity(se) {
>>> cfs_rq = cfs_rq_of(se);
>>>
>>> - if (cfs_rq_throttled(cfs_rq))
>>> - break;
>>> + if (!cfs_rq_throttled(cfs_rq))
>>> + update_load_avg(cfs_rq, se, UPDATE_TG);
>>>
>>> - update_load_avg(cfs_rq, se, UPDATE_TG);
>>> + list_add_leaf_cfs_rq(cfs_rq);
>>> }
>>> }
>>> #else
>>
>>
>> Thanks for that feedback!
>>
>> I did think about that, but was not sure what would be the best one.
>> If it is "safe" to always run list_add_leaf_cfs_rq there (since it is used in
>
> If the cfs_rq is already in the list list_add_leaf_cfs_rq() will exit
> early but if it's not, we don't have to make sure that the whole
> branch in the list
>
> In fact, we can break as soon as list_add_leaf_cfs_rq() and
> cfs_rq_throttled() return true
>
>> more places than just on cgroup change and move to fair class), I do agree
>> that that is a better solution. Will test that, and post a new patch
>> if it works as expected.
>>
>> Also, the current code will exit from the loop in case a cfs_rq is throttled,
>> while your suggestion will keep looping. For list_add_leaf_cfs_rq that is fine
>> (and required), but should we keep running update_load_avg? I do think it is ok,
>
> When a cfs_rq is throttled, it is not accounted in its parent anymore
> so we don't have to update and propagate the load down.
>
>> and the likelihood of a cfs_rq being throttled is not that high after all, so
>> I guess it doesn't really matter.

I think what I see on my Juno running the unfairness_missing_load_decay.sh script is
in sync which what you discussed here:

[ 91.458079] update_tg_load_avg: from=update_load_avg_2 cpu0 cgroup=/slice/cg-1/sub delta=-769 cfs_rq->tg->load_avg=858->89 [sh 1690]
[ 91.474596] update_tg_load_avg: from=update_load_avg_2 cpu0 cgroup=/slice/cg-1 delta=-32551 cfs_rq->tg->load_avg=32915->364 [ -1]
[ 91.492881] update_tg_load_avg: from=update_load_avg_2 cpu0 cgroup=/slice delta=-768 cfs_rq->tg->load_avg=776->8 [ -1]

...
[ 91.514164] dump_stack+0xd0/0x12c
[ 91.517577] attach_entity_cfs_rq+0xe4/0xec
[ 91.521773] task_change_group_fair+0x98/0x190 <---- !!!
[ 91.526228] sched_move_task+0x78/0x1e0
[ 91.530078] cpu_cgroup_attach+0x34/0x70
[ 91.534013] cgroup_migrate_execute+0x32c/0x3e4
[ 91.538558] cgroup_migrate+0x78/0x90
[ 91.542231] cgroup_attach_task+0x110/0x11c
[ 91.546425] __cgroup1_procs_write.constprop.0+0x128/0x170
...

[ 91.597490] update_tg_load_avg: from=attach_entity_cfs_rq *cpu2* cgroup=/slice/cg-2/sub delta=28 cfs_rq->tg->load_avg=0->28 [sh 1701]
[ 91.609437] update_tg_load_avg: from=update_load_avg_2 cpu2 cgroup=/slice/cg-2 delta=27 cfs_rq->tg->load_avg=0->27 [ -1]
[ 91.621033] update_tg_load_avg: from=update_load_avg_2 cpu2 cgroup=/slice delta=27 cfs_rq->tg->load_avg=8->35 [ -1]
[ 91.632763] update_tg_load_avg: from=__update_blocked_fair cpu0 cgroup=/slice/cg-1/sub delta=-7 cfs_rq->tg->load_avg=89->82 [ -1]
[ 91.644470] update_tg_load_avg: from=update_load_avg_2 cpu0 cgroup=/slice delta=-7 cfs_rq->tg->load_avg=35->28 [ -1]
[ 91.656178] update_tg_load_avg: from=update_load_avg_2 cpu0 cgroup=/slice/cg-1 delta=-355 cfs_rq->tg->load_avg=364->9 [ -1]
[ 91.656233] update_tg_load_avg: from=update_load_avg_2 cpu2 cgroup=/slice/cg-2 delta=-27 cfs_rq->tg->load_avg=27->0 [ -1]
[ 91.668272] update_tg_load_avg: from=attach_entity_cfs_rq cpu0 cgroup=/slice/cg-1/sub delta=1024 cfs_rq->tg->load_avg=82->1106 [stress 1706]
[ 91.679707] update_tg_load_avg: from=update_load_avg_2 cpu2 cgroup=/slice delta=-27 cfs_rq->tg->load_avg=28->1 [ -1]
[ 91.692419] update_tg_load_avg: from=update_load_avg_2 cpu0 cgroup=/slice/cg-1 delta=46330 cfs_rq->tg->load_avg=9->46339 [ -1]
[ 91.716193] update_tg_load_avg: from=update_load_avg_2 cpu0 cgroup=/slice delta=1022 cfs_rq->tg->load_avg=1->1023 [ -1]

[ 91.749936] dump_stack+0xd0/0x12c
[ 91.753348] update_load_avg+0x460/0x490
[ 91.757283] enqueue_task_fair+0xe8/0x7fc
[ 91.761303] ttwu_do_activate+0x68/0x160
[ 91.765239] try_to_wake_up+0x1f4/0x594
...

[ 91.833275] update_tg_load_avg: from=update_load_avg_1 *cpu0* cgroup=/slice/cg-2/sub delta=28 cfs_rq->tg->load_avg=28->56 [sh 1701]
[ 91.845307] list_add_leaf_cfs_rq: cpu0 cgroup=/slice/cg-2/sub
[ 91.851068] list_add_leaf_cfs_rq: cpu0 cgroup=/slice/cg-2

*cpu2* cgroup=/slice/cg-2 is not on the leaf_cfs_rq list.


root@juno:~# cat /proc/sched_debug | grep ":/slice" -A 28 | egrep "(:/slice)|load_avg"

...
cfs_rq[0]:/slice/cg-2
.load_avg : 1
.removed.load_avg : 0
.tg_load_avg_contrib : 1
.tg_load_avg : 89 <--- !!!
.se->avg.load_avg : 21
...

with the fix:

root@juno:~# cat /proc/sched_debug | grep ":/slice" -A 28 | egrep "(:/slice)|load_avg"

...
cfs_rq[0]:/slice/cg-2
.load_avg : 2
.removed.load_avg : 0
.tg_load_avg_contrib : 2
.tg_load_avg : 2 <--- !!!
.se->avg.load_avg : 1023
...

Snippet I used:


diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 794c2cb945f8..7214e6e89820 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10854,6 +10854,8 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
break;

update_load_avg(cfs_rq, se, UPDATE_TG);
+ if (!cfs_rq_is_decayed(cfs_rq))
+ list_add_leaf_cfs_rq(cfs_rq);
}
}

2021-05-01 14:35:49

by Odin Ugedal

[permalink] [raw]
Subject: Re: [PATCH 1/1] sched/fair: Fix unfairness caused by missing load decay

ons. 28. apr. 2021 kl. 17:36 skrev Vincent Guittot <[email protected]>:
> You can keep both fixes tags

ACK

> If the cfs_rq is already in the list list_add_leaf_cfs_rq() will exit
> early but if it's not, we don't have to make sure that the whole
> branch in the list

Yeah, thats right. Calling list_add_leaf_cfs_rq once "too much" doesnt
hurt after all.

> In fact, we can break as soon as list_add_leaf_cfs_rq() and
> cfs_rq_throttled() return true

ACK, that makes sense.

> When a cfs_rq is throttled, it is not accounted in its parent anymore
> so we don't have to update and propagate the load down.

Okay. Still need to wrap my head around this a bit more I guess. I
have looked a bit around, and there
is actually a similar issue as "this one" for the case when a
throttled cgroup is "moved" via cpuset. It is however waaay
harder to reproduce, but it is doable, and it _will_ happen in real
life systems if the timing is "correct". I will dig deeper
and finish the patch for that case some time next week (hopefully). I
think that however deserve a separate patchset,
so I will come back with that later.

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 33b1ee31ae0f..18441ce7316c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11026,10 +11026,10 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
> for_each_sched_entity(se) {
> cfs_rq = cfs_rq_of(se);
>
> - if (cfs_rq_throttled(cfs_rq))
> - break;
> + if (!cfs_rq_throttled(cfs_rq))
> + update_load_avg(cfs_rq, se, UPDATE_TG);
>
> - update_load_avg(cfs_rq, se, UPDATE_TG);
> + list_add_leaf_cfs_rq(cfs_rq);
> }
> }

Sent a v2 with something like this now; that exit if
(list_add_leaf_cfs_rq(cfs_rq) && throttled). Since this loop start at
the parent of
the cfs_rq of the supplied se, I added a list_add_leaf_cfs_rq to the
top in order to insert the leaf cfs_rq as well.

Thanks
Odin

2021-05-01 14:44:42

by Odin Ugedal

[permalink] [raw]
Subject: Re: [PATCH 1/1] sched/fair: Fix unfairness caused by missing load decay

Hi,

> I think what I see on my Juno running the unfairness_missing_load_decay.sh script is
> in sync which what you discussed here:

Thanks for taking a look!

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 794c2cb945f8..7214e6e89820 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10854,6 +10854,8 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
> break;
>
> update_load_avg(cfs_rq, se, UPDATE_TG);
> + if (!cfs_rq_is_decayed(cfs_rq))
> + list_add_leaf_cfs_rq(cfs_rq);
> }
> }

This might however lead to "loss" at /slice/cg-2/sub and
slice/cg-1/sub in this particular case tho, since
propagate_entity_cfs_rq skips one cfs_rq
by taking the parent of the provided se. The loss in that case would
however not be equally big, but will still often contribute to some
unfairness.


Thanks
Odin

2021-05-05 09:45:13

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH 1/1] sched/fair: Fix unfairness caused by missing load decay

On 01/05/2021 16:41, Odin Ugedal wrote:
> Hi,
>
>> I think what I see on my Juno running the unfairness_missing_load_decay.sh script is
>> in sync which what you discussed here:
>
> Thanks for taking a look!
>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 794c2cb945f8..7214e6e89820 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -10854,6 +10854,8 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
>> break;
>>
>> update_load_avg(cfs_rq, se, UPDATE_TG);
>> + if (!cfs_rq_is_decayed(cfs_rq))
>> + list_add_leaf_cfs_rq(cfs_rq);
>> }
>> }
>
> This might however lead to "loss" at /slice/cg-2/sub and
> slice/cg-1/sub in this particular case tho, since
> propagate_entity_cfs_rq skips one cfs_rq
> by taking the parent of the provided se. The loss in that case would
> however not be equally big, but will still often contribute to some
> unfairness.

Yeah, that's true.

By moving stopped `stress` tasks into

/sys/fs/cgroup/cpu/slice/cg-{1,2}/sub

and then into

/sys/fs/cgroup/cpuset/A

which has a cpuset.cpus {0-1,4-5} not containing the cpus the `stress`
tasks attached {2,3} to and then restart the `stress` tasks again I get:

cfs_rq[1]:/slice/cg-1/sub
.load_avg : 1024
.removed.load_avg : 0
.tg_load_avg_contrib : 1024 <---
.tg_load_avg : 2047 <---
.se->avg.load_avg : 511
cfs_rq[1]:/slice/cg-1
.load_avg : 512
.removed.load_avg : 0
.tg_load_avg_contrib : 512 <---
.tg_load_avg : 1022 <---
.se->avg.load_avg : 512
cfs_rq[1]:/slice
.load_avg : 513
.removed.load_avg : 0
.tg_load_avg_contrib : 513
.tg_load_avg : 1024
.se->avg.load_avg : 512
cfs_rq[5]:/slice/cg-1/sub
.load_avg : 1024
.removed.load_avg : 0
.tg_load_avg_contrib : 1023 <---
.tg_load_avg : 2047 <---
.se->avg.load_avg : 511
cfs_rq[5]:/slice/cg-1
.load_avg : 512
.removed.load_avg : 0
.tg_load_avg_contrib : 510 <---
.tg_load_avg : 1022 <---
.se->avg.load_avg : 511
cfs_rq[5]:/slice
.load_avg : 512
.removed.load_avg : 0
.tg_load_avg_contrib : 511
.tg_load_avg : 1024
.se->avg.load_avg : 510

I saw that your v2 patch takes care of that.