2021-04-25 08:12:36

by Odin Ugedal

[permalink] [raw]
Subject: [PATCH 0/1] sched/fair: Fix unfairness caused by missing load decay

TL;DR: Read the patch msg first; hopefully it makes sense. Feel free
to cc more relevant people, and (as I am pretty new to this) feedback
is highly appreciated!

I thought it was smart to split the discussion of the patch and the issuse, so
here is some (wall of text, sorry) discussion around it:

This patch fixes an extremely strange edge case, discovered more or less by
accident, where the fairness of the cfs scheduler get skewed quite
extensively. This often leads to two cgroups who are supposed to get
share cpu time 50/50, instead getting something like 90/10 or
even 99/1 (!!) in some rare cases (see below for example). I find this
edge case quite interesting, especially since I am able to so easily
reproduce it on all machines and all (4.8+) kernels.

The issue can be easily reproduced on all kernels 4.8+, and afaik. most
"container runtimes" are affected. I do believe that I have wondered about
this issue earlier, but not actually understood what was causing it. There
is really (afaik.) no way to find the issue from userspace without using
a tracing tool like bpftrace. I am _sure_ this affects real production
environments _everywhere_, at least those running the normal container/using
cgroup tools together with cpu pinning; but verifying that the
fairness is skewed is quite hard, or more or less impossible in such cases,
especially when running multiple workloads simultaneously. I did inspect some
of my production servers, and found a set of this on multiple cgroups.

It _is_ possible to read '/proc/sched_debug', but since that list only prints
the cfs_rq's from "leaf_cfs_rq_list", the cfs_rq's we are _really_ looking for,
are missing; making it look like the problem isn't becuase of cfs.

I think the current patch is the best way solve the issue, but I will happily
discuss other possible solutions. One could also just "don't" propagate the
load to the task_group, but I think the current implementation on that part
is the correct way to do it. This is at least my best solution to avoid adding
logic to a "hotter" path that also would require more logic.

The only thing I am thinking a bit about is if the cfs_rq (or one of its ancestors) is
throttled. In case the cfs_rq->throttled==1, I don't think we currently can skip, since
it only adds to the leaf_list in case (cfs_rq->nr_running >= 1), and the same applies for
(cfs_rq->throttle_count >= 1), since we cannot guarantee that it will be added again. Will
this code however interfere with the throttling mechanism? I have tested inserting new procs
to throttled cgroups (or children cgroups of throttled ones) with the patch applied,
and it works as I expect it to do (but other people might have more insight).

Also, another solution may be to don't add the load before the task is actually
enqueued, avoiding this issue all together. But is that better (I assume that will
be more in the "hot path", and require quite a bit more code)?

There may definetly be a better solution to this that I still haven't found, and that
other more experienced kernel devs see right away. :)


Also feel free to suggest additions/wording of the patch
message and title, and/or the comment in the code to make it more clear.

This issue was introduced in 4.8, so all stable (4.9+) releases should probably
get this (the final solution at least), or?



Below is various ways/scripts to reproduce - sorry this is so long (and
sorry for bash in general), but thought people might be interested in them:

note: due to the nature of the issue, the "lost" load is different each time, so
the values change each time, and sometimes/often end up at ~50/50; but my testing
shows that it keep happening almost every time:


Example on cgruoup v1. Often results in 60/40 load:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 18026 63.0 0.0 3676 100 pts/7 R+ 13:09 0:06 stress --cpu 1
root 18036 36.6 0.0 3676 100 pts/7 R+ 13:09 0:04 stress --cpu 1
--- bash start
CGROUP_CPU=/sys/fs/cgroup/cpu/slice
CGROUP_CPUSET=/sys/fs/cgroup/cpuset/slice
CPU=0

function run_sandbox {
local CG_CPUSET="$1"
local CG_CPU="$2"
local CMD="$3"

local PIPE="$(mktemp -u)"
mkfifo "$PIPE"
sh -c "read < $PIPE ; exec $CMD" &
local TASK="$!"
sleep .1
mkdir -p "$CG_CPUSET"
mkdir -p "$CG_CPU"
tee "$CG_CPU"/cgroup.procs <<< "$TASK"

tee "$CG_CPUSET"/cgroup.procs <<< "$TASK"

tee "$PIPE" <<< sandox_done
rm "$PIPE"
}

mkdir -p "$CGROUP_CPU"
mkdir -p "$CGROUP_CPUSET"
tee "$CGROUP_CPUSET"/cpuset.cpus <<< "0"
tee "$CGROUP_CPUSET"/cpuset.mems <<< "0"

run_sandbox "$CGROUP_CPUSET" "$CGROUP_CPU/cg-1" "stress --cpu 1"
run_sandbox "$CGROUP_CPUSET" "$CGROUP_CPU/cg-2" "stress --cpu 1"

read # click enter to cleanup
killall stress
sleep .2
rmdir /sys/fs/cgroup/cpuset/slice/
rmdir /sys/fs/cgroup/cpu/slice/{cg-1,cg-2,}
--- bash end

Example on cgroup v2 with sub cgroup (same as described in commit message),
where both should get 50/50, but instead getting 99% and 1% (!!).

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 18568 1.1 0.0 3684 100 pts/12 R+ 13:36 0:00 stress --cpu 1
root 18580 99.3 0.0 3684 100 pts/12 R+ 13:36 0:09 stress --cpu 1

(in case of systemd on vg v2- make sure some slice/scope is delegated(?)/use
cpusets, otherwise systemd will fight you)
--- bash start
CGROUP_CPU=/sys/fs/cgroup/cpu/slice
CGROUP_CPUSET=/sys/fs/cgroup/cpuset/slice
CPU=0

function run_sandbox {
local CG_CPUSET="$1"
local CG_CPU="$2"
local CMD="$3"

local PIPE="$(mktemp -u)"
mkfifo "$PIPE"
sh -c "read < $PIPE ; exec $CMD" &
local TASK="$!"
sleep .01
mkdir -p "$CG_CPUSET"
mkdir -p "$CG_CPU"
tee "$CG_CPU"/cgroup.procs <<< "$TASK"

tee "$CG_CPUSET"/cgroup.procs <<< "$TASK"

tee "$PIPE" <<< sandox_done
rm "$PIPE"
}

mkdir -p "$CGROUP_CPU"
mkdir -p "$CGROUP_CPUSET"
tee "$CGROUP_CPUSET"/cpuset.cpus <<< "0"
tee "$CGROUP_CPUSET"/cpuset.mems <<< "0"

run_sandbox "$CGROUP_CPUSET" "$CGROUP_CPU/cg-1" "stress --cpu 1"
run_sandbox "$CGROUP_CPUSET" "$CGROUP_CPU/cg-2" "stress --cpu 1"

read # click enter to cleanup
killall stress
sleep .2
rmdir /sys/fs/cgroup/cpuset/slice/
rmdir /sys/fs/cgroup/cpu/slice/{cg-1,cg-2,}
--- bash end



For those who only want to run docker stuff:
(on cgroup v1, you must use systemd driver)
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 9291 60.1 0.0 7320 96 pts/0 R+ 13:18 0:07 /usr/bin/stress --verbose --cpu 1
root 9388 45.5 0.0 7320 96 pts/0 R+ 13:18 0:04 /usr/bin/stress --verbose --cpu 1
--- bash start
docker run --cpuset-cpus=0 --rm -it an-image-with-stress
docker run --cpuset-cpus=0 --rm -it an-image-with-stress
--- bash end


Here is a bpftrace script that can show what is happening (again. sorry for bash,
but bpftrace only allow constants as array indexes as of now, so this is the best I can do).
This script also needs bpftrace v0.13 (not released as of now, so currently you have
to compile master) or newer.


--- bash start
PROBE='kfunc:sched_group_set_shares{
printf("cgroup: %s/%s\n",
str(args->tg->css.cgroup->kn->parent->name),
str(args->tg->css.cgroup->kn->name));
printf(
"cpu load.weight avg.load_avg removed.load_avg removed.nr removed.on_list tg_load_avg_contrib tg->load_avg\n"
);
}'
for i in $(seq 0 $(($(nproc)-1))); do
PROBE="$PROBE""$(sed "s/cpu_nr/$i/" <<<' kfunc:sched_group_set_shares{
printf("%-4d %-12llu %-13llu %-17llu %-11d %-16d %-20llu %d\n",
cpu_nr,
(args->tg->cfs_rq[cpu_nr])->load.weight,
(args->tg->cfs_rq[cpu_nr])->avg.load_avg,
(args->tg->cfs_rq[cpu_nr])->removed.load_avg,
(args->tg->cfs_rq[cpu_nr])->removed.nr,
(args->tg->cfs_rq[cpu_nr])->on_list,
(args->tg->cfs_rq[cpu_nr])->tg_load_avg_contrib,
args->tg->load_avg.counter
);
}')"
done
PROBE="$PROBE"'kfunc:sched_group_set_shares{
printf("\n");
}'

bpftrace -e "$PROBE"
--- bash end


When running the bpftrace script when the sub cgroup example is running, and
executing (just setting the weight of the cgroup the same value as before, no change):

--- bash start
tee /sys/fs/cgroup/slice/cg-1/sub/cpu.weight <<< 1
tee /sys/fs/cgroup/slice/cg-2/sub/cpu.weight <<< 10000
tee /sys/fs/cgroup/slice/cg-1/cpu.weight <<< 100
tee /sys/fs/cgroup/slice/cg-2/cpu.weight <<< 100
--- bash end

the output is:

--- output start
Attaching 6 probes...
cgroup: cg-1/sub
cpu load.weight avg.load_avg removed.load_avg removed.nr removed.on_list tg_load_avg_contrib tg->load_avg
0 1048576 1023 0 0 1 1034 1662
1 0 0 0 0 0 0 1662
2 0 0 0 0 0 0 1662
3 0 628 628 1 0 628 1662

cgroup: cg-2/sub
cpu load.weight avg.load_avg removed.load_avg removed.nr removed.on_list tg_load_avg_contrib tg->load_avg
0 1048576 1023 0 0 1 1023 1830
1 0 0 0 0 0 0 1830
2 0 0 0 0 0 0 1830
3 0 807 807 1 0 807 1830

cgroup: slice/cg-1
cpu load.weight avg.load_avg removed.load_avg removed.nr removed.on_list tg_load_avg_contrib tg->load_avg
0 6347 5 0 0 1 5 593
1 0 0 0 0 0 0 593
2 0 0 0 0 0 0 593
3 0 5 0 0 0 588 593

cgroup: slice/cg-2
cpu load.weight avg.load_avg removed.load_avg removed.nr removed.on_list tg_load_avg_contrib tg->load_avg
0 58642371 57263 0 0 1 57909 58665
1 0 0 0 0 0 0 58665
2 0 0 0 0 0 0 58665
3 0 75615 0 0 0 756 58665

--- output end

We can clearly see that both cg-1/sub and cg-2/sub have removed.nr==1 on cpu 3,
and therefore still contribute to the tg->load_avg. Since removed.on_list==0, the load would
never be "cleaned" up unless a new task starts on that cpu, but due to the cpuset, that would not
be the case. slice/cg-1 and slice/cg-2 also have load attached to cpu 3 that isn't removed (but
the patch will properly decay the load on them as well).

With this path, all of these examples just end up sharing cpu time in a fair 50/50 way, as expected.

Odin Ugedal (1):
sched/fair: Fix unfairness caused by missing load decay

kernel/sched/fair.c | 13 +++++++++++++
1 file changed, 13 insertions(+)

--
2.31.1


2021-04-26 14:59:51

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 0/1] sched/fair: Fix unfairness caused by missing load decay

On Sun, 25 Apr 2021 at 10:11, Odin Ugedal <[email protected]> wrote:
>
> TL;DR: Read the patch msg first; hopefully it makes sense. Feel free
> to cc more relevant people, and (as I am pretty new to this) feedback
> is highly appreciated!
>
> I thought it was smart to split the discussion of the patch and the issuse, so
> here is some (wall of text, sorry) discussion around it:
>
> This patch fixes an extremely strange edge case, discovered more or less by
> accident, where the fairness of the cfs scheduler get skewed quite
> extensively. This often leads to two cgroups who are supposed to get
> share cpu time 50/50, instead getting something like 90/10 or
> even 99/1 (!!) in some rare cases (see below for example). I find this
> edge case quite interesting, especially since I am able to so easily
> reproduce it on all machines and all (4.8+) kernels.
>
> The issue can be easily reproduced on all kernels 4.8+, and afaik. most
> "container runtimes" are affected. I do believe that I have wondered about
> this issue earlier, but not actually understood what was causing it. There
> is really (afaik.) no way to find the issue from userspace without using
> a tracing tool like bpftrace. I am _sure_ this affects real production
> environments _everywhere_, at least those running the normal container/using
> cgroup tools together with cpu pinning; but verifying that the
> fairness is skewed is quite hard, or more or less impossible in such cases,
> especially when running multiple workloads simultaneously. I did inspect some
> of my production servers, and found a set of this on multiple cgroups.
>
> It _is_ possible to read '/proc/sched_debug', but since that list only prints
> the cfs_rq's from "leaf_cfs_rq_list", the cfs_rq's we are _really_ looking for,
> are missing; making it look like the problem isn't becuase of cfs.
>
> I think the current patch is the best way solve the issue, but I will happily
> discuss other possible solutions. One could also just "don't" propagate the
> load to the task_group, but I think the current implementation on that part
> is the correct way to do it. This is at least my best solution to avoid adding
> logic to a "hotter" path that also would require more logic.
>
> The only thing I am thinking a bit about is if the cfs_rq (or one of its ancestors) is
> throttled. In case the cfs_rq->throttled==1, I don't think we currently can skip, since
> it only adds to the leaf_list in case (cfs_rq->nr_running >= 1), and the same applies for
> (cfs_rq->throttle_count >= 1), since we cannot guarantee that it will be added again. Will
> this code however interfere with the throttling mechanism? I have tested inserting new procs
> to throttled cgroups (or children cgroups of throttled ones) with the patch applied,
> and it works as I expect it to do (but other people might have more insight).
>
> Also, another solution may be to don't add the load before the task is actually
> enqueued, avoiding this issue all together. But is that better (I assume that will
> be more in the "hot path", and require quite a bit more code)?
>
> There may definetly be a better solution to this that I still haven't found, and that
> other more experienced kernel devs see right away. :)
>
>
> Also feel free to suggest additions/wording of the patch
> message and title, and/or the comment in the code to make it more clear.
>
> This issue was introduced in 4.8, so all stable (4.9+) releases should probably
> get this (the final solution at least), or?
>

Have you been able to reproduce this on mainline ?

>
>
> Below is various ways/scripts to reproduce - sorry this is so long (and
> sorry for bash in general), but thought people might be interested in them:
>
> note: due to the nature of the issue, the "lost" load is different each time, so
> the values change each time, and sometimes/often end up at ~50/50; but my testing
> shows that it keep happening almost every time:
>
>
> Example on cgruoup v1. Often results in 60/40 load:
> USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
> root 18026 63.0 0.0 3676 100 pts/7 R+ 13:09 0:06 stress --cpu 1
> root 18036 36.6 0.0 3676 100 pts/7 R+ 13:09 0:04 stress --cpu 1

When running the script below on v5.12, I'm not able to reproduce your problem

> --- bash start
> CGROUP_CPU=/sys/fs/cgroup/cpu/slice
> CGROUP_CPUSET=/sys/fs/cgroup/cpuset/slice
> CPU=0
>
> function run_sandbox {
> local CG_CPUSET="$1"
> local CG_CPU="$2"
> local CMD="$3"
>
> local PIPE="$(mktemp -u)"
> mkfifo "$PIPE"
> sh -c "read < $PIPE ; exec $CMD" &
> local TASK="$!"
> sleep .1
> mkdir -p "$CG_CPUSET"
> mkdir -p "$CG_CPU"
> tee "$CG_CPU"/cgroup.procs <<< "$TASK"
>
> tee "$CG_CPUSET"/cgroup.procs <<< "$TASK"
>
> tee "$PIPE" <<< sandox_done
> rm "$PIPE"
> }
>
> mkdir -p "$CGROUP_CPU"
> mkdir -p "$CGROUP_CPUSET"
> tee "$CGROUP_CPUSET"/cpuset.cpus <<< "0"
> tee "$CGROUP_CPUSET"/cpuset.mems <<< "0"
>
> run_sandbox "$CGROUP_CPUSET" "$CGROUP_CPU/cg-1" "stress --cpu 1"
> run_sandbox "$CGROUP_CPUSET" "$CGROUP_CPU/cg-2" "stress --cpu 1"
>
> read # click enter to cleanup
> killall stress
> sleep .2
> rmdir /sys/fs/cgroup/cpuset/slice/
> rmdir /sys/fs/cgroup/cpu/slice/{cg-1,cg-2,}
> --- bash end
>
> Example on cgroup v2 with sub cgroup (same as described in commit message),
> where both should get 50/50, but instead getting 99% and 1% (!!).
>
> USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
> root 18568 1.1 0.0 3684 100 pts/12 R+ 13:36 0:00 stress --cpu 1
> root 18580 99.3 0.0 3684 100 pts/12 R+ 13:36 0:09 stress --cpu 1
>
> (in case of systemd on vg v2- make sure some slice/scope is delegated(?)/use
> cpusets, otherwise systemd will fight you)
> --- bash start
> CGROUP_CPU=/sys/fs/cgroup/cpu/slice
> CGROUP_CPUSET=/sys/fs/cgroup/cpuset/slice
> CPU=0
>
> function run_sandbox {
> local CG_CPUSET="$1"
> local CG_CPU="$2"
> local CMD="$3"
>
> local PIPE="$(mktemp -u)"
> mkfifo "$PIPE"
> sh -c "read < $PIPE ; exec $CMD" &
> local TASK="$!"
> sleep .01
> mkdir -p "$CG_CPUSET"
> mkdir -p "$CG_CPU"
> tee "$CG_CPU"/cgroup.procs <<< "$TASK"
>
> tee "$CG_CPUSET"/cgroup.procs <<< "$TASK"
>
> tee "$PIPE" <<< sandox_done
> rm "$PIPE"
> }
>
> mkdir -p "$CGROUP_CPU"
> mkdir -p "$CGROUP_CPUSET"
> tee "$CGROUP_CPUSET"/cpuset.cpus <<< "0"
> tee "$CGROUP_CPUSET"/cpuset.mems <<< "0"
>
> run_sandbox "$CGROUP_CPUSET" "$CGROUP_CPU/cg-1" "stress --cpu 1"
> run_sandbox "$CGROUP_CPUSET" "$CGROUP_CPU/cg-2" "stress --cpu 1"
>
> read # click enter to cleanup
> killall stress
> sleep .2
> rmdir /sys/fs/cgroup/cpuset/slice/
> rmdir /sys/fs/cgroup/cpu/slice/{cg-1,cg-2,}
> --- bash end
>
>
>
> For those who only want to run docker stuff:
> (on cgroup v1, you must use systemd driver)
> USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
> root 9291 60.1 0.0 7320 96 pts/0 R+ 13:18 0:07 /usr/bin/stress --verbose --cpu 1
> root 9388 45.5 0.0 7320 96 pts/0 R+ 13:18 0:04 /usr/bin/stress --verbose --cpu 1
> --- bash start
> docker run --cpuset-cpus=0 --rm -it an-image-with-stress
> docker run --cpuset-cpus=0 --rm -it an-image-with-stress
> --- bash end
>
>
> Here is a bpftrace script that can show what is happening (again. sorry for bash,
> but bpftrace only allow constants as array indexes as of now, so this is the best I can do).
> This script also needs bpftrace v0.13 (not released as of now, so currently you have
> to compile master) or newer.
>
>
> --- bash start
> PROBE='kfunc:sched_group_set_shares{
> printf("cgroup: %s/%s\n",
> str(args->tg->css.cgroup->kn->parent->name),
> str(args->tg->css.cgroup->kn->name));
> printf(
> "cpu load.weight avg.load_avg removed.load_avg removed.nr removed.on_list tg_load_avg_contrib tg->load_avg\n"
> );
> }'
> for i in $(seq 0 $(($(nproc)-1))); do
> PROBE="$PROBE""$(sed "s/cpu_nr/$i/" <<<' kfunc:sched_group_set_shares{
> printf("%-4d %-12llu %-13llu %-17llu %-11d %-16d %-20llu %d\n",
> cpu_nr,
> (args->tg->cfs_rq[cpu_nr])->load.weight,
> (args->tg->cfs_rq[cpu_nr])->avg.load_avg,
> (args->tg->cfs_rq[cpu_nr])->removed.load_avg,
> (args->tg->cfs_rq[cpu_nr])->removed.nr,
> (args->tg->cfs_rq[cpu_nr])->on_list,
> (args->tg->cfs_rq[cpu_nr])->tg_load_avg_contrib,
> args->tg->load_avg.counter
> );
> }')"
> done
> PROBE="$PROBE"'kfunc:sched_group_set_shares{
> printf("\n");
> }'
>
> bpftrace -e "$PROBE"
> --- bash end
>
>
> When running the bpftrace script when the sub cgroup example is running, and
> executing (just setting the weight of the cgroup the same value as before, no change):
>
> --- bash start
> tee /sys/fs/cgroup/slice/cg-1/sub/cpu.weight <<< 1
> tee /sys/fs/cgroup/slice/cg-2/sub/cpu.weight <<< 10000
> tee /sys/fs/cgroup/slice/cg-1/cpu.weight <<< 100
> tee /sys/fs/cgroup/slice/cg-2/cpu.weight <<< 100
> --- bash end
>
> the output is:
>
> --- output start
> Attaching 6 probes...
> cgroup: cg-1/sub
> cpu load.weight avg.load_avg removed.load_avg removed.nr removed.on_list tg_load_avg_contrib tg->load_avg
> 0 1048576 1023 0 0 1 1034 1662
> 1 0 0 0 0 0 0 1662
> 2 0 0 0 0 0 0 1662
> 3 0 628 628 1 0 628 1662
>
> cgroup: cg-2/sub
> cpu load.weight avg.load_avg removed.load_avg removed.nr removed.on_list tg_load_avg_contrib tg->load_avg
> 0 1048576 1023 0 0 1 1023 1830
> 1 0 0 0 0 0 0 1830
> 2 0 0 0 0 0 0 1830
> 3 0 807 807 1 0 807 1830
>
> cgroup: slice/cg-1
> cpu load.weight avg.load_avg removed.load_avg removed.nr removed.on_list tg_load_avg_contrib tg->load_avg
> 0 6347 5 0 0 1 5 593
> 1 0 0 0 0 0 0 593
> 2 0 0 0 0 0 0 593
> 3 0 5 0 0 0 588 593
>
> cgroup: slice/cg-2
> cpu load.weight avg.load_avg removed.load_avg removed.nr removed.on_list tg_load_avg_contrib tg->load_avg
> 0 58642371 57263 0 0 1 57909 58665
> 1 0 0 0 0 0 0 58665
> 2 0 0 0 0 0 0 58665
> 3 0 75615 0 0 0 756 58665
>
> --- output end
>
> We can clearly see that both cg-1/sub and cg-2/sub have removed.nr==1 on cpu 3,
> and therefore still contribute to the tg->load_avg. Since removed.on_list==0, the load would
> never be "cleaned" up unless a new task starts on that cpu, but due to the cpuset, that would not
> be the case. slice/cg-1 and slice/cg-2 also have load attached to cpu 3 that isn't removed (but
> the patch will properly decay the load on them as well).
>
> With this path, all of these examples just end up sharing cpu time in a fair 50/50 way, as expected.
>
> Odin Ugedal (1):
> sched/fair: Fix unfairness caused by missing load decay
>
> kernel/sched/fair.c | 13 +++++++++++++
> 1 file changed, 13 insertions(+)
>
> --
> 2.31.1
>

2021-04-26 16:35:12

by Odin Ugedal

[permalink] [raw]
Subject: Re: [PATCH 0/1] sched/fair: Fix unfairness caused by missing load decay

Hi,


> Have you been able to reproduce this on mainline ?

Yes. I have been debugging and testing with v5.12-rc8. After I found
the suspected
commit in ~v4.8, I compiled both the v4.4.267 and v4.9.267, and was able to
successfully reproduce it on v4.9.267 and not on v4.4.267. It is also
reproducible
on 5.11.16-arch1-1 that my distro ships, and it is reproducible on all
the machines
I have tested.

> When running the script below on v5.12, I'm not able to reproduce your problem

v5.12 is pretty fresh, so I have not tested on anything before v5.12-rc8. I did
compile v5.12.0 now, and I am able to reproduce it there as well.

Which version did you try (the one for cgroup v1 or v2)? And/or did you try
to run the inspection bpftrace script? If you tested the cg v1
version, it will often
end up at 50/50, 51/49 etc., and sometimes 60/40+-, making it hard to
verify without inspection.

I have attached a version of the "sub cgroup" example for cgroup v1,
that also force
the process to start on cpu 1 (CPU_ME), and sends it over to cpu 0
(CPU) after attaching
to the new cgroup. That will make it evident each time. This example should also
always end up with 50/50 per stress process, but "always" ends up more
like 99/1.

Can you confirm if you are able to reproduce with this version?

--- bash start
CGROUP_CPU=/sys/fs/cgroup/cpu/slice
CGROUP_CPUSET=/sys/fs/cgroup/cpuset/slice
CGROUP_CPUSET_ME=/sys/fs/cgroup/cpuset/me
CPU=0
CPU_ME=1

function run_sandbox {
local CG_CPUSET="$1"
local CG_CPU="$2"
local INNER_SHARES="$3"
local CMD="$4"

local PIPE="$(mktemp -u)"
mkfifo "$PIPE"
sh -c "read < $PIPE ; exec $CMD" &
local TASK="$!"
sleep .1
mkdir -p "$CG_CPUSET"
mkdir -p "$CG_CPU"/sub
tee "$CG_CPU"/sub/cgroup.procs <<< "$TASK"
tee "$CG_CPU"/sub/cpu.shares <<< "$INNER_SHARES"

tee "$CG_CPUSET"/cgroup.procs <<< "$TASK"

tee "$PIPE" <<< sandox_done
rm "$PIPE"
}

mkdir -p "$CGROUP_CPU"
mkdir -p "$CGROUP_CPUSET"
mkdir -p "$CGROUP_CPUSET_ME"

tee "$CGROUP_CPUSET"/cpuset.cpus <<< "$CPU"
tee "$CGROUP_CPUSET"/cpuset.mems <<< "$CPU"

tee "$CGROUP_CPUSET_ME"/cpuset.cpus <<< "$CPU_ME"
echo $$ | tee "$CGROUP_CPUSET_ME"/cgroup.procs

run_sandbox "$CGROUP_CPUSET" "$CGROUP_CPU/cg-1" 50000 "stress --cpu 1"
run_sandbox "$CGROUP_CPUSET" "$CGROUP_CPU/cg-2" 2 "stress --cpu 1"

read # click enter to cleanup and stop all stress procs
killall stress
sleep .2
rmdir /sys/fs/cgroup/cpuset/slice/
rmdir /sys/fs/cgroup/cpu/slice/{cg-{1,2}{/sub,},}
--- bash end


Thanks
Odin

2021-04-27 08:38:43

by Odin Ugedal

[permalink] [raw]
Subject: Re: [PATCH 0/1] sched/fair: Fix unfairness caused by missing load decay

Also, instead of bpftrace, one can look at the /proc/sched_debug file,
and infer from there.

Something like:

$ cat /proc/sched_debug | grep ":/slice" -A 28 | egrep "(:/slice)|load_avg"

gives me the output (when one stress proc gets 99%, and the other one 1%):

cfs_rq[0]:/slice/cg-2/sub
.load_avg : 1023
.removed.load_avg : 0
.tg_load_avg_contrib : 1035
.tg_load_avg : 1870
.se->avg.load_avg : 56391
cfs_rq[0]:/slice/cg-1/sub
.load_avg : 1023
.removed.load_avg : 0
.tg_load_avg_contrib : 1024
.tg_load_avg : 1847
.se->avg.load_avg : 4
cfs_rq[0]:/slice/cg-1
.load_avg : 4
.removed.load_avg : 0
.tg_load_avg_contrib : 4
.tg_load_avg : 794
.se->avg.load_avg : 5
cfs_rq[0]:/slice/cg-2
.load_avg : 56401
.removed.load_avg : 0
.tg_load_avg_contrib : 56700
.tg_load_avg : 57496
.se->avg.load_avg : 1008
cfs_rq[0]:/slice
.load_avg : 1015
.removed.load_avg : 0
.tg_load_avg_contrib : 1009
.tg_load_avg : 2314
.se->avg.load_avg : 447


As can be seen here, no other cfs_rq for the relevant cgroups are
"active" and listed, but they still contribute to eg. the "tg_load_avg". In this
example, "cfs_rq[0]:/slice/cg-1" has a load_avg of 4, and contributes with 4 to
tg_load_avg. However, the total total tg_load_avg is 794. That means
that the other 790 have to come from somewhere, and in this example,
they come from the cfs_rq on another cpu.

Hopefully that clarified a bit.

For reference, here is the output when the issue is not occuring:
cfs_rq[1]:/slice/cg-2/sub
.load_avg : 1024
.removed.load_avg : 0
.tg_load_avg_contrib : 1039
.tg_load_avg : 1039
.se->avg.load_avg : 1
cfs_rq[1]:/slice/cg-1/sub
.load_avg : 1023
.removed.load_avg : 0
.tg_load_avg_contrib : 1034
.tg_load_avg : 1034
.se->avg.load_avg : 49994
cfs_rq[1]:/slice/cg-1
.load_avg : 49998
.removed.load_avg : 0
.tg_load_avg_contrib : 49534
.tg_load_avg : 49534
.se->avg.load_avg : 1023
cfs_rq[1]:/slice/cg-2
.load_avg : 1
.removed.load_avg : 0
.tg_load_avg_contrib : 1
.tg_load_avg : 1
.se->avg.load_avg : 1023
cfs_rq[1]:/slice
.load_avg : 2048
.removed.load_avg : 0
.tg_load_avg_contrib : 2021
.tg_load_avg : 2021
.se->avg.load_avg : 1023


Odin

2021-04-27 10:58:42

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 0/1] sched/fair: Fix unfairness caused by missing load decay

On Mon, 26 Apr 2021 at 18:33, Odin Ugedal <[email protected]> wrote:
>
> Hi,
>
>
> > Have you been able to reproduce this on mainline ?
>
> Yes. I have been debugging and testing with v5.12-rc8. After I found
> the suspected
> commit in ~v4.8, I compiled both the v4.4.267 and v4.9.267, and was able to
> successfully reproduce it on v4.9.267 and not on v4.4.267. It is also
> reproducible
> on 5.11.16-arch1-1 that my distro ships, and it is reproducible on all
> the machines
> I have tested.
>
> > When running the script below on v5.12, I'm not able to reproduce your problem
>
> v5.12 is pretty fresh, so I have not tested on anything before v5.12-rc8. I did
> compile v5.12.0 now, and I am able to reproduce it there as well.

I wanted to say one v5.12-rcX version to make sure this is still a
valid problem on latest version

>
> Which version did you try (the one for cgroup v1 or v2)? And/or did you try
> to run the inspection bpftrace script? If you tested the cg v1
> version, it will often
> end up at 50/50, 51/49 etc., and sometimes 60/40+-, making it hard to
> verify without inspection.

I tried cgroup v1 and v2 but not the bpf script

>
> I have attached a version of the "sub cgroup" example for cgroup v1,
> that also force
> the process to start on cpu 1 (CPU_ME), and sends it over to cpu 0
> (CPU) after attaching
> to the new cgroup. That will make it evident each time. This example should also
> always end up with 50/50 per stress process, but "always" ends up more
> like 99/1.
>
> Can you confirm if you are able to reproduce with this version?

I confirm that I can see a ratio of 4ms vs 204ms running time with the
patch below. But when I look more deeply in my trace (I have
instrumented the code), it seems that the 2 stress-ng don't belong to
the same cgroup but remained in cg-1 and cg-2 which explains such
running time difference. So your script doesn't reproduce the bug you
want to highlight. That being said, I can also see a diff between the
contrib of the cpu0 in the tg_load. I'm going to look further

>
> --- bash start
> CGROUP_CPU=/sys/fs/cgroup/cpu/slice
> CGROUP_CPUSET=/sys/fs/cgroup/cpuset/slice
> CGROUP_CPUSET_ME=/sys/fs/cgroup/cpuset/me
> CPU=0
> CPU_ME=1
>
> function run_sandbox {
> local CG_CPUSET="$1"
> local CG_CPU="$2"
> local INNER_SHARES="$3"
> local CMD="$4"
>
> local PIPE="$(mktemp -u)"
> mkfifo "$PIPE"
> sh -c "read < $PIPE ; exec $CMD" &
> local TASK="$!"
> sleep .1
> mkdir -p "$CG_CPUSET"
> mkdir -p "$CG_CPU"/sub
> tee "$CG_CPU"/sub/cgroup.procs <<< "$TASK"
> tee "$CG_CPU"/sub/cpu.shares <<< "$INNER_SHARES"
>
> tee "$CG_CPUSET"/cgroup.procs <<< "$TASK"
>
> tee "$PIPE" <<< sandox_done
> rm "$PIPE"
> }
>
> mkdir -p "$CGROUP_CPU"
> mkdir -p "$CGROUP_CPUSET"
> mkdir -p "$CGROUP_CPUSET_ME"
>
> tee "$CGROUP_CPUSET"/cpuset.cpus <<< "$CPU"
> tee "$CGROUP_CPUSET"/cpuset.mems <<< "$CPU"
>
> tee "$CGROUP_CPUSET_ME"/cpuset.cpus <<< "$CPU_ME"
> echo $$ | tee "$CGROUP_CPUSET_ME"/cgroup.procs
>
> run_sandbox "$CGROUP_CPUSET" "$CGROUP_CPU/cg-1" 50000 "stress --cpu 1"
> run_sandbox "$CGROUP_CPUSET" "$CGROUP_CPU/cg-2" 2 "stress --cpu 1"
>
> read # click enter to cleanup and stop all stress procs
> killall stress
> sleep .2
> rmdir /sys/fs/cgroup/cpuset/slice/
> rmdir /sys/fs/cgroup/cpu/slice/{cg-{1,2}{/sub,},}
> --- bash end
>
>
> Thanks
> Odin

2021-04-27 11:27:02

by Odin Ugedal

[permalink] [raw]
Subject: Re: [PATCH 0/1] sched/fair: Fix unfairness caused by missing load decay

Hi,

> I wanted to say one v5.12-rcX version to make sure this is still a
> valid problem on latest version

Ahh, I see. No problem. :) Thank you so much for taking the time to
look at this!

> I confirm that I can see a ratio of 4ms vs 204ms running time with the
> patch below.

(I assume you talk about the bash code for reproducing, not the actual
sched patch.)

> But when I look more deeply in my trace (I have
> instrumented the code), it seems that the 2 stress-ng don't belong to
> the same cgroup but remained in cg-1 and cg-2 which explains such
> running time difference.

(mail reply number two to your previous mail might also help surface it)

Not sure if I have stated it correctly, or if we are talking about the
same thing. It _is_ the intention that the two procs should not be in the
same cgroup. In the same way as people create "containers", each proc runs
in a separate cgroup in the example. The issue is not the balancing
between the procs
themselves, but rather cgroups/sched_entities inside the cgroup hierarchy.
(due to the fact that the vruntime of those sched_entities end up
being calculated with more load than they are supposed to).

If you have any thought about the phrasing of the patch itself to make it
easier to understand, feel free to suggest.

Given the last cgroup v1 script, I get this:

- cat /proc/<stress-pid-1>/cgroup | grep cpu
11:cpu,cpuacct:/slice/cg-1/sub
3:cpuset:/slice

- cat /proc/<stress-pid-2>/cgroup | grep cpu
11:cpu,cpuacct:/slice/cg-2/sub
3:cpuset:/slice


The cgroup hierarchy will then roughly be like this (using cgroup v2 terms,
becuase I find them easier to reason about):

slice/
cg-1/
cpu.shares: 100
sub/
cpu.weight: 1
cpuset.cpus: 1
cgroup.procs - stress process 1 here
cg-2/
cpu.weight: 100
sub/
cpu.weight: 10000
cpuset.cpus: 1
cgroup.procs - stress process 2 here

This should result in 50/50 due to the fact that cg-1 and cg-2 both have a
weight of 100, and "live" inside the /slice cgroup. The inner weight should not
matter, since there is only one cgroup at that level.

> So your script doesn't reproduce the bug you
> want to highlight. That being said, I can also see a diff between the
> contrib of the cpu0 in the tg_load. I'm going to look further

There can definitely be some other issues involved, and I am pretty sure
you have way more knowledge about the scheduler than me... :) However,
I am pretty sure that it is in fact showing the issue I am talking about,
and applying the patch does indeed make it impossible to reproduce it
on my systems.

Odin

2021-04-27 12:48:34

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 0/1] sched/fair: Fix unfairness caused by missing load decay

On Tue, 27 Apr 2021 at 13:24, Odin Ugedal <[email protected]> wrote:
>
> Hi,
>
> > I wanted to say one v5.12-rcX version to make sure this is still a
> > valid problem on latest version
>
> Ahh, I see. No problem. :) Thank you so much for taking the time to
> look at this!
>
> > I confirm that I can see a ratio of 4ms vs 204ms running time with the
> > patch below.
>
> (I assume you talk about the bash code for reproducing, not the actual
> sched patch.)

yes sorry

>
> > But when I look more deeply in my trace (I have
> > instrumented the code), it seems that the 2 stress-ng don't belong to
> > the same cgroup but remained in cg-1 and cg-2 which explains such
> > running time difference.
>
> (mail reply number two to your previous mail might also help surface it)
>
> Not sure if I have stated it correctly, or if we are talking about the
> same thing. It _is_ the intention that the two procs should not be in the
> same cgroup. In the same way as people create "containers", each proc runs
> in a separate cgroup in the example. The issue is not the balancing
> between the procs
> themselves, but rather cgroups/sched_entities inside the cgroup hierarchy.
> (due to the fact that the vruntime of those sched_entities end up
> being calculated with more load than they are supposed to).
>
> If you have any thought about the phrasing of the patch itself to make it
> easier to understand, feel free to suggest.
>
> Given the last cgroup v1 script, I get this:
>
> - cat /proc/<stress-pid-1>/cgroup | grep cpu
> 11:cpu,cpuacct:/slice/cg-1/sub
> 3:cpuset:/slice
>
> - cat /proc/<stress-pid-2>/cgroup | grep cpu
> 11:cpu,cpuacct:/slice/cg-2/sub
> 3:cpuset:/slice
>
>
> The cgroup hierarchy will then roughly be like this (using cgroup v2 terms,
> becuase I find them easier to reason about):
>
> slice/
> cg-1/
> cpu.shares: 100
> sub/
> cpu.weight: 1
> cpuset.cpus: 1
> cgroup.procs - stress process 1 here
> cg-2/
> cpu.weight: 100
> sub/
> cpu.weight: 10000
> cpuset.cpus: 1
> cgroup.procs - stress process 2 here
>
> This should result in 50/50 due to the fact that cg-1 and cg-2 both have a
> weight of 100, and "live" inside the /slice cgroup. The inner weight should not
> matter, since there is only one cgroup at that level.
>
> > So your script doesn't reproduce the bug you
> > want to highlight. That being said, I can also see a diff between the
> > contrib of the cpu0 in the tg_load. I'm going to look further
>
> There can definitely be some other issues involved, and I am pretty sure
> you have way more knowledge about the scheduler than me... :) However,
> I am pretty sure that it is in fact showing the issue I am talking about,
> and applying the patch does indeed make it impossible to reproduce it
> on my systems.

Your script is correct. I was wrongly interpreting my trace. I have
been able to reproduce your problem and your analysis is correct. Let
me continue on the patch itself

>
> Odin