LinuxLists.cc - [PATCH v2 0/1] sched/fair: Fix unfairness caused by missing load decay

2021-05-01 14:23:17

Subject: [PATCH v2 0/1] sched/fair: Fix unfairness caused by missing load decay

TL;DR: Read the patch msg first; hopefully it makes sense. Feel free
to cc more relevant people, and (as I am pretty new to this), feedback
is highly appreciated!

I thought it was smart to split the discussion of the patch and the issuse, so
here is some (wall of text, sorry) discussion around it:

This patch fixes an extremely strange edge case, discovered more or less by
accident (while testing container runtimes for my thesis),
where the fairness of the cfs scheduler get skewed quite
extensively. This often leads to two cgroups who are supposed to get
share cpu time 50/50, instead getting something like 90/10 or
even 99/1 (!!) in some rare cases (see below for example). I find this
edge case quite interesting, especially since I am able to so easily
reproduce it on all machines and all (4.8+) kernels.

The issue can be easily reproduced on all kernels 4.8+, and afaik. most
"container runtimes" are affected. I do believe that I have wondered about
this issue earlier, but not actually understood what was causing it. There
is really (afaik.) no way to find the issue from userspace without using
a tracing tool like bpftrace. I am _sure_ this affects real production
environments _everywhere_, at least those running the normal container/using
cgroup tools together with cpu pinning; but verifying that the
fairness is skewed is quite hard, or more or less impossible in such cases,
especially when running multiple workloads simultaneously. I did inspect some
of my production servers, and found a set of this on multiple cgroups.

It _is_ possible to read '/proc/sched_debug', but since that list only prints
the cfs_rq's from "leaf_cfs_rq_list", the cfs_rq's we are _really_ looking for,
are missing; making it look like the problem isn't becuase of cfs. One can however
use the information in infer that there is an issue (look at the bottom).

Also, another solution may be to don't add the load before the task is actually
enqueued, avoiding this issue all together. But is that better (I assume that will
be more in the "hot path", and require quite a bit more code?

There may definetly be a better solution to this that I still haven't found, and that
other more experienced kernel devs see right away. :)

Also feel free to suggest additions/wording of the patch
message and title, and/or the comment in the code to make it more clear.

This issue was introduced in 4.8, so all stable (4.9+) releases should probably
get this (the final solution at least), or?

Below is various ways/scripts to reproduce - sorry this is so long (and
sorry for bash in general), but thought people might be interested in them:

note: due to the nature of the issue, the "lost" load is different each time, so
the values change each time, and sometimes/often end up at ~50/50; but my testing
shows that it keep happening almost every time:

Example on cgruoup v1. Often results in 60/40 load:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 18026 63.0 0.0 3676 100 pts/7 R+ 13:09 0:06 stress --cpu 1
root 18036 36.6 0.0 3676 100 pts/7 R+ 13:09 0:04 stress --cpu 1
--- bash start
CGROUP_CPU=/sys/fs/cgroup/cpu/slice
CGROUP_CPUSET=/sys/fs/cgroup/cpuset/slice
CPU=0

function run_sandbox {
local CG_CPUSET="$1"
local CG_CPU="$2"
local CMD="$3"

local PIPE="$(mktemp -u)"
mkfifo "$PIPE"
sh -c "read < $PIPE ; exec $CMD" &
local TASK="$!"
sleep .1
mkdir -p "$CG_CPUSET"
mkdir -p "$CG_CPU"
tee "$CG_CPU"/cgroup.procs <<< "$TASK"

tee "$CG_CPUSET"/cgroup.procs <<< "$TASK"

tee "$PIPE" <<< sandox_done
rm "$PIPE"
}

mkdir -p "$CGROUP_CPU"
mkdir -p "$CGROUP_CPUSET"
tee "$CGROUP_CPUSET"/cpuset.cpus <<< "0"
tee "$CGROUP_CPUSET"/cpuset.mems <<< "0"

run_sandbox "$CGROUP_CPUSET" "$CGROUP_CPU/cg-1" "stress --cpu 1"
run_sandbox "$CGROUP_CPUSET" "$CGROUP_CPU/cg-2" "stress --cpu 1"

read # click enter to cleanup
killall stress
sleep .2
rmdir /sys/fs/cgroup/cpuset/slice/
rmdir /sys/fs/cgroup/cpu/slice/{cg-1,cg-2,}
--- bash end

Example on cgroup v2 with sub cgroup (same as described in commit message),
where both should get 50/50, but instead getting 99% and 1% (!!).

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 18568 1.1 0.0 3684 100 pts/12 R+ 13:36 0:00 stress --cpu 1
root 18580 99.3 0.0 3684 100 pts/12 R+ 13:36 0:09 stress --cpu 1
--- bash start (in case of systemd - make sure some slice/scope is delegated(?)/use cpusets)
CGROUP_CPU=/sys/fs/cgroup/cpu/slice
CGROUP_CPUSET=/sys/fs/cgroup/cpuset/slice
CPU=0

function run_sandbox {
local CG_CPUSET="$1"
local CG_CPU="$2"
local CMD="$3"

local PIPE="$(mktemp -u)"
mkfifo "$PIPE"
sh -c "read < $PIPE ; exec $CMD" &
local TASK="$!"
sleep .01
mkdir -p "$CG_CPUSET"
mkdir -p "$CG_CPU"
tee "$CG_CPU"/cgroup.procs <<< "$TASK"

tee "$CG_CPUSET"/cgroup.procs <<< "$TASK"

tee "$PIPE" <<< sandox_done
rm "$PIPE"
}

mkdir -p "$CGROUP_CPU"
mkdir -p "$CGROUP_CPUSET"
tee "$CGROUP_CPUSET"/cpuset.cpus <<< "0"
tee "$CGROUP_CPUSET"/cpuset.mems <<< "0"

run_sandbox "$CGROUP_CPUSET" "$CGROUP_CPU/cg-1" "stress --cpu 1"
run_sandbox "$CGROUP_CPUSET" "$CGROUP_CPU/cg-2" "stress --cpu 1"

read # click enter to cleanup
killall stress
sleep .2
rmdir /sys/fs/cgroup/cpuset/slice/
rmdir /sys/fs/cgroup/cpu/slice/{cg-1,cg-2,}
--- bash end

For those who only want to run docker stuff:
(on cgroup v1, you must use systemd driver)
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 9291 60.1 0.0 7320 96 pts/0 R+ 13:18 0:07 /usr/bin/stress --verbose --cpu 1
root 9388 45.5 0.0 7320 96 pts/0 R+ 13:18 0:04 /usr/bin/stress --verbose --cpu 1
--- bash start
docker run --cpuset-cpus=0 --rm -it an-image-with-stress
docker run --cpuset-cpus=0 --rm -it an-image-with-stress
--- bash end

Example for cgroup v1 where CPU is pinned to 1, and then changed to 0. Often results
in 99/1 load balance.
--- bash start
CGROUP_CPU=/sys/fs/cgroup/cpu/slice
CGROUP_CPUSET=/sys/fs/cgroup/cpuset/slice
CGROUP_CPUSET_ME=/sys/fs/cgroup/cpuset/me
CPU=0
CPU_ME=1

function run_sandbox {
local CG_CPUSET="$1"
local CG_CPU="$2"
local INNER_SHARES="$3"
local CMD="$4"

local PIPE="$(mktemp -u)"
mkfifo "$PIPE"
sh -c "read < $PIPE ; exec $CMD" &
local TASK="$!"
sleep .1
mkdir -p "$CG_CPUSET"
mkdir -p "$CG_CPU"/sub
tee "$CG_CPU"/sub/cgroup.procs <<< "$TASK"
tee "$CG_CPU"/sub/cpu.shares <<< "$INNER_SHARES"

tee "$CG_CPUSET"/cgroup.procs <<< "$TASK"

tee "$PIPE" <<< sandox_done
rm "$PIPE"
}

mkdir -p "$CGROUP_CPU"
mkdir -p "$CGROUP_CPUSET"
mkdir -p "$CGROUP_CPUSET_ME"

tee "$CGROUP_CPUSET"/cpuset.cpus <<< "$CPU"
tee "$CGROUP_CPUSET"/cpuset.mems <<< "$CPU"

tee "$CGROUP_CPUSET_ME"/cpuset.cpus <<< "$CPU_ME"
echo $$ | tee "$CGROUP_CPUSET_ME"/cgroup.procs

run_sandbox "$CGROUP_CPUSET" "$CGROUP_CPU/cg-1" 50000 "stress --cpu 1"
run_sandbox "$CGROUP_CPUSET" "$CGROUP_CPU/cg-2" 2 "stress --cpu 1"

read # click enter to cleanup and stop all stress procs
killall stress
sleep .2
rmdir /sys/fs/cgroup/cpuset/slice/
rmdir /sys/fs/cgroup/cpu/slice/{cg-{1,2}{/sub,},}
--- bash end

To verify what is happening, one can take a close look at sched_debug:

$ cat /proc/sched_debug | grep ":/slice" -A 28 | egrep "(tg_load_avg|slice)"
cfs_rq[0]:/slice/cg-2/sub
.tg_load_avg_contrib : 1026
.tg_load_avg : 1757
cfs_rq[0]:/slice/cg-1/sub
.tg_load_avg_contrib : 1027
.tg_load_avg : 1473
cfs_rq[0]:/slice/cg-1
.tg_load_avg_contrib : 5
.tg_load_avg : 398
cfs_rq[0]:/slice/cg-2
.tg_load_avg_contrib : 60603
.tg_load_avg : 61264
cfs_rq[0]:/slice
.tg_load_avg_contrib : 1024
.tg_load_avg : 1864

We can clearly see that for all cfs_rqs, .tg_load_avg_contrib!=.tg_load_avg.
Since removed.on_list==0 on the other cfs_rq's, the load on those will never
be "cleaned" up unless a new task starts on that cpu, but due to the cpuset, that would not
be the case. slice/cg-1 and slice/cg-2 also have load attached to cpu 3 that isn't removed (but
the patch will properly decay the load on them as well).

With this patch, all of these examples just end up sharing cpu time in a fair 50/50 way, as expected.

Thanks to Vincent and Dietmar for feedback!

Changes since v1:
- Moved code to avoid looping se hierarchy twice.
- Replaced the bpftrace code with a simpler example using sched_debug in cover letter
- Updated cover letter

Odin Ugedal (1):
sched/fair: Fix unfairness caused by missing load decay

kernel/sched/fair.c | 12 +++++++++---
1 file changed, 9 insertions(+), 3 deletions(-)

--
2.31.1

2021-05-01 14:25:16

by Odin Ugedal

[permalink] [raw]

Subject: [PATCH v2 1/1] sched/fair: Fix unfairness caused by missing load decay

This fixes an issue where old load on a cfs_rq is not properly decayed,
resulting in strange behavior where fairness can decrease drastically.
Real workloads with equally weighted control groups have ended up
getting a respective 99% and 1%(!!) of cpu time.

When an idle task is attached to a cfs_rq by attaching a pid to a cgroup,
the old load of the task is attached to the new cfs_rq and sched_entity by
attach_entity_cfs_rq. If the task is then moved to another cpu (and
therefore cfs_rq) before being enqueued/woken up, the load will be moved
to cfs_rq->removed from the sched_entity. Such a move will happen when
enforcing a cpuset on the task (eg. via a cgroup) that force it to move.

The load will however not be removed from the task_group itself, making
it look like there is a constant load on that cfs_rq. This causes the
vruntime of tasks on other sibling cfs_rq's to increase faster than they
are supposed to; causing severe fairness issues. If no other task is
started on the given cfs_rq, and due to the cpuset it would not happen,
this load would never be properly unloaded. With this patch the load
will be properly removed inside update_blocked_averages. This also
applies to tasks moved to the fair scheduling class and moved to another
cpu, and this path will also fix that. For fork, the entity is queued
right away, so this problem does not affect that.

This applies to cases where the new process is the first in the cfs_rq,
issue introduced 3d30544f0212 ("sched/fair: Apply more PELT fixes"), and
when there has previously been load on the cgroup but the cgroup was
removed from the leaflist due to having null PELT load, indroduced
in 039ae8bcf7a5 ("sched/fair: Fix O(nr_cgroups) in the load balancing
path").

For a simple cgroup hierarchy (as seen below) with two equally weighted
groups, that in theory should get 50/50 of cpu time each, it often leads
to a load of 60/40 or 70/30.

parent/
cg-1/
cpu.weight: 100
cpuset.cpus: 1
cg-2/
cpu.weight: 100
cpuset.cpus: 1

If the hierarchy is deeper (as seen below), while keeping cg-1 and cg-2
equally weighted, they should still get a 50/50 balance of cpu time.
This however sometimes results in a balance of 10/90 or 1/99(!!) between
the task groups.

$ ps u -C stress
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 18568 1.1 0.0 3684 100 pts/12 R+ 13:36 0:00 stress --cpu 1
root 18580 99.3 0.0 3684 100 pts/12 R+ 13:36 0:09 stress --cpu 1

parent/
cg-1/
cpu.weight: 100
sub-group/
cpu.weight: 1
cpuset.cpus: 1
cg-2/
cpu.weight: 100
sub-group/
cpu.weight: 10000
cpuset.cpus: 1

This can be reproduced by attaching an idle process to a cgroup and
moving it to a given cpuset before it wakes up. The issue is evident in
many (if not most) container runtimes, and has been reproduced
with both crun and runc (and therefore docker and all its "derivatives"),
and with both cgroup v1 and v2.

Fixes: 3d30544f0212 ("sched/fair: Apply more PELT fixes")
Fixes: 039ae8bcf7a5 ("sched/fair: Fix O(nr_cgroups) in the load balancing path")
Signed-off-by: Odin Ugedal <[email protected]>
---
kernel/sched/fair.c | 12 +++++++++---
1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 794c2cb945f8..9e189727a457 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10844,16 +10844,22 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
{
struct cfs_rq *cfs_rq;

+ list_add_leaf_cfs_rq(cfs_rq_of(se));
+
/* Start to propagate at parent */
se = se->parent;

for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);

- if (cfs_rq_throttled(cfs_rq))
- break;
+ if (!cfs_rq_throttled(cfs_rq)){
+ update_load_avg(cfs_rq, se, UPDATE_TG);
+ list_add_leaf_cfs_rq(cfs_rq);
+ continue;
+ }

- update_load_avg(cfs_rq, se, UPDATE_TG);
+ if (list_add_leaf_cfs_rq(cfs_rq))
+ break;
}
}
#else
--
2.31.1

2021-05-04 07:16:01

by Vincent Guittot

[permalink] [raw]

Subject: Re: [PATCH v2 1/1] sched/fair: Fix unfairness caused by missing load decay

On Sat, 1 May 2021 at 16:22, Odin Ugedal <[email protected]> wrote:
>
> This fixes an issue where old load on a cfs_rq is not properly decayed,
> resulting in strange behavior where fairness can decrease drastically.
> Real workloads with equally weighted control groups have ended up
> getting a respective 99% and 1%(!!) of cpu time.
>
> When an idle task is attached to a cfs_rq by attaching a pid to a cgroup,
> the old load of the task is attached to the new cfs_rq and sched_entity by
> attach_entity_cfs_rq. If the task is then moved to another cpu (and
> therefore cfs_rq) before being enqueued/woken up, the load will be moved
> to cfs_rq->removed from the sched_entity. Such a move will happen when
> enforcing a cpuset on the task (eg. via a cgroup) that force it to move.
>
> The load will however not be removed from the task_group itself, making
> it look like there is a constant load on that cfs_rq. This causes the
> vruntime of tasks on other sibling cfs_rq's to increase faster than they
> are supposed to; causing severe fairness issues. If no other task is
> started on the given cfs_rq, and due to the cpuset it would not happen,
> this load would never be properly unloaded. With this patch the load
> will be properly removed inside update_blocked_averages. This also
> applies to tasks moved to the fair scheduling class and moved to another
> cpu, and this path will also fix that. For fork, the entity is queued
> right away, so this problem does not affect that.
>
> This applies to cases where the new process is the first in the cfs_rq,
> issue introduced 3d30544f0212 ("sched/fair: Apply more PELT fixes"), and
> when there has previously been load on the cgroup but the cgroup was
> removed from the leaflist due to having null PELT load, indroduced
> in 039ae8bcf7a5 ("sched/fair: Fix O(nr_cgroups) in the load balancing
> path").
>
> For a simple cgroup hierarchy (as seen below) with two equally weighted
> groups, that in theory should get 50/50 of cpu time each, it often leads
> to a load of 60/40 or 70/30.
>
> parent/
> cg-1/
> cpu.weight: 100
> cpuset.cpus: 1
> cg-2/
> cpu.weight: 100
> cpuset.cpus: 1
>
> If the hierarchy is deeper (as seen below), while keeping cg-1 and cg-2
> equally weighted, they should still get a 50/50 balance of cpu time.
> This however sometimes results in a balance of 10/90 or 1/99(!!) between
> the task groups.
>
> $ ps u -C stress
> USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
> root 18568 1.1 0.0 3684 100 pts/12 R+ 13:36 0:00 stress --cpu 1
> root 18580 99.3 0.0 3684 100 pts/12 R+ 13:36 0:09 stress --cpu 1
>
> parent/
> cg-1/
> cpu.weight: 100
> sub-group/
> cpu.weight: 1
> cpuset.cpus: 1
> cg-2/
> cpu.weight: 100
> sub-group/
> cpu.weight: 10000
> cpuset.cpus: 1
>
> This can be reproduced by attaching an idle process to a cgroup and
> moving it to a given cpuset before it wakes up. The issue is evident in
> many (if not most) container runtimes, and has been reproduced
> with both crun and runc (and therefore docker and all its "derivatives"),
> and with both cgroup v1 and v2.
>
> Fixes: 3d30544f0212 ("sched/fair: Apply more PELT fixes")
> Fixes: 039ae8bcf7a5 ("sched/fair: Fix O(nr_cgroups) in the load balancing path")
> Signed-off-by: Odin Ugedal <[email protected]>

Reviewed-by: Vincent Guittot <[email protected]>

Thanks

> ---
> kernel/sched/fair.c | 12 +++++++++---
> 1 file changed, 9 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 794c2cb945f8..9e189727a457 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10844,16 +10844,22 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
> {
> struct cfs_rq *cfs_rq;
>
> + list_add_leaf_cfs_rq(cfs_rq_of(se));
> +
> /* Start to propagate at parent */
> se = se->parent;
>
> for_each_sched_entity(se) {
> cfs_rq = cfs_rq_of(se);
>
> - if (cfs_rq_throttled(cfs_rq))
> - break;
> + if (!cfs_rq_throttled(cfs_rq)){
> + update_load_avg(cfs_rq, se, UPDATE_TG);
> + list_add_leaf_cfs_rq(cfs_rq);
> + continue;
> + }
>
> - update_load_avg(cfs_rq, se, UPDATE_TG);
> + if (list_add_leaf_cfs_rq(cfs_rq))
> + break;
> }
> }
> #else
> --
> 2.31.1
>

2021-05-04 09:37:03

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH v2 1/1] sched/fair: Fix unfairness caused by missing load decay

On Tue, May 04, 2021 at 09:14:12AM +0200, Vincent Guittot wrote:
> On Sat, 1 May 2021 at 16:22, Odin Ugedal <[email protected]> wrote:
> >
> > This fixes an issue where old load on a cfs_rq is not properly decayed,
> > resulting in strange behavior where fairness can decrease drastically.
> > Real workloads with equally weighted control groups have ended up
> > getting a respective 99% and 1%(!!) of cpu time.
> >
> > When an idle task is attached to a cfs_rq by attaching a pid to a cgroup,
> > the old load of the task is attached to the new cfs_rq and sched_entity by
> > attach_entity_cfs_rq. If the task is then moved to another cpu (and
> > therefore cfs_rq) before being enqueued/woken up, the load will be moved
> > to cfs_rq->removed from the sched_entity. Such a move will happen when
> > enforcing a cpuset on the task (eg. via a cgroup) that force it to move.
> >
> > The load will however not be removed from the task_group itself, making
> > it look like there is a constant load on that cfs_rq. This causes the
> > vruntime of tasks on other sibling cfs_rq's to increase faster than they
> > are supposed to; causing severe fairness issues. If no other task is
> > started on the given cfs_rq, and due to the cpuset it would not happen,
> > this load would never be properly unloaded. With this patch the load
> > will be properly removed inside update_blocked_averages. This also
> > applies to tasks moved to the fair scheduling class and moved to another
> > cpu, and this path will also fix that. For fork, the entity is queued
> > right away, so this problem does not affect that.
> >
> > This applies to cases where the new process is the first in the cfs_rq,
> > issue introduced 3d30544f0212 ("sched/fair: Apply more PELT fixes"), and
> > when there has previously been load on the cgroup but the cgroup was
> > removed from the leaflist due to having null PELT load, indroduced
> > in 039ae8bcf7a5 ("sched/fair: Fix O(nr_cgroups) in the load balancing
> > path").
> >
> > For a simple cgroup hierarchy (as seen below) with two equally weighted
> > groups, that in theory should get 50/50 of cpu time each, it often leads
> > to a load of 60/40 or 70/30.
> >
> > parent/
> > cg-1/
> > cpu.weight: 100
> > cpuset.cpus: 1
> > cg-2/
> > cpu.weight: 100
> > cpuset.cpus: 1
> >
> > If the hierarchy is deeper (as seen below), while keeping cg-1 and cg-2
> > equally weighted, they should still get a 50/50 balance of cpu time.
> > This however sometimes results in a balance of 10/90 or 1/99(!!) between
> > the task groups.
> >
> > $ ps u -C stress
> > USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
> > root 18568 1.1 0.0 3684 100 pts/12 R+ 13:36 0:00 stress --cpu 1
> > root 18580 99.3 0.0 3684 100 pts/12 R+ 13:36 0:09 stress --cpu 1
> >
> > parent/
> > cg-1/
> > cpu.weight: 100
> > sub-group/
> > cpu.weight: 1
> > cpuset.cpus: 1
> > cg-2/
> > cpu.weight: 100
> > sub-group/
> > cpu.weight: 10000
> > cpuset.cpus: 1
> >
> > This can be reproduced by attaching an idle process to a cgroup and
> > moving it to a given cpuset before it wakes up. The issue is evident in
> > many (if not most) container runtimes, and has been reproduced
> > with both crun and runc (and therefore docker and all its "derivatives"),
> > and with both cgroup v1 and v2.
> >
> > Fixes: 3d30544f0212 ("sched/fair: Apply more PELT fixes")
> > Fixes: 039ae8bcf7a5 ("sched/fair: Fix O(nr_cgroups) in the load balancing path")
> > Signed-off-by: Odin Ugedal <[email protected]>
>
> Reviewed-by: Vincent Guittot <[email protected]>

Thanks!

2021-05-06 13:49:57

by tip-bot2 for Alexey Makhalov

[permalink] [raw]

Subject: [tip: sched/urgent] sched/fair: Fix unfairness caused by missing load decay

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID: 0258bdfaff5bd13c4d2383150b7097aecd6b6d82
Gitweb: https://git.kernel.org/tip/0258bdfaff5bd13c4d2383150b7097aecd6b6d82
Author: Odin Ugedal <[email protected]>
AuthorDate: Sat, 01 May 2021 16:19:50 +02:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Thu, 06 May 2021 15:33:27 +02:00

sched/fair: Fix unfairness caused by missing load decay

This fixes an issue where old load on a cfs_rq is not properly decayed,
resulting in strange behavior where fairness can decrease drastically.
Real workloads with equally weighted control groups have ended up
getting a respective 99% and 1%(!!) of cpu time.

When an idle task is attached to a cfs_rq by attaching a pid to a cgroup,
the old load of the task is attached to the new cfs_rq and sched_entity by
attach_entity_cfs_rq. If the task is then moved to another cpu (and
therefore cfs_rq) before being enqueued/woken up, the load will be moved
to cfs_rq->removed from the sched_entity. Such a move will happen when
enforcing a cpuset on the task (eg. via a cgroup) that force it to move.

The load will however not be removed from the task_group itself, making
it look like there is a constant load on that cfs_rq. This causes the
vruntime of tasks on other sibling cfs_rq's to increase faster than they
are supposed to; causing severe fairness issues. If no other task is
started on the given cfs_rq, and due to the cpuset it would not happen,
this load would never be properly unloaded. With this patch the load
will be properly removed inside update_blocked_averages. This also
applies to tasks moved to the fair scheduling class and moved to another
cpu, and this path will also fix that. For fork, the entity is queued
right away, so this problem does not affect that.

This applies to cases where the new process is the first in the cfs_rq,
issue introduced 3d30544f0212 ("sched/fair: Apply more PELT fixes"), and
when there has previously been load on the cgroup but the cgroup was
removed from the leaflist due to having null PELT load, indroduced
in 039ae8bcf7a5 ("sched/fair: Fix O(nr_cgroups) in the load balancing
path").

For a simple cgroup hierarchy (as seen below) with two equally weighted
groups, that in theory should get 50/50 of cpu time each, it often leads
to a load of 60/40 or 70/30.

parent/
cg-1/
cpu.weight: 100
cpuset.cpus: 1
cg-2/
cpu.weight: 100
cpuset.cpus: 1

If the hierarchy is deeper (as seen below), while keeping cg-1 and cg-2
equally weighted, they should still get a 50/50 balance of cpu time.
This however sometimes results in a balance of 10/90 or 1/99(!!) between
the task groups.

$ ps u -C stress
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 18568 1.1 0.0 3684 100 pts/12 R+ 13:36 0:00 stress --cpu 1
root 18580 99.3 0.0 3684 100 pts/12 R+ 13:36 0:09 stress --cpu 1

parent/
cg-1/
cpu.weight: 100
sub-group/
cpu.weight: 1
cpuset.cpus: 1
cg-2/
cpu.weight: 100
sub-group/
cpu.weight: 10000
cpuset.cpus: 1

This can be reproduced by attaching an idle process to a cgroup and
moving it to a given cpuset before it wakes up. The issue is evident in
many (if not most) container runtimes, and has been reproduced
with both crun and runc (and therefore docker and all its "derivatives"),
and with both cgroup v1 and v2.

Fixes: 3d30544f0212 ("sched/fair: Apply more PELT fixes")
Fixes: 039ae8bcf7a5 ("sched/fair: Fix O(nr_cgroups) in the load balancing path")
Signed-off-by: Odin Ugedal <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Vincent Guittot <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
kernel/sched/fair.c | 12 +++++++++---
1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1d75af1..20aa234 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10878,16 +10878,22 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
{
struct cfs_rq *cfs_rq;

+ list_add_leaf_cfs_rq(cfs_rq_of(se));
+
/* Start to propagate at parent */
se = se->parent;

for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);

- if (cfs_rq_throttled(cfs_rq))
- break;
+ if (!cfs_rq_throttled(cfs_rq)){
+ update_load_avg(cfs_rq, se, UPDATE_TG);
+ list_add_leaf_cfs_rq(cfs_rq);
+ continue;
+ }

- update_load_avg(cfs_rq, se, UPDATE_TG);
+ if (list_add_leaf_cfs_rq(cfs_rq))
+ break;
}
}
#else