2021-05-20 14:06:48

by Odin Ugedal

[permalink] [raw]
Subject: Re: [PATCH v5 1/3] sched/fair: Introduce the burstable CFS controller

Hi,

Here are some more thoughts and questions:

> The benefit of burst is seen when testing with schbench:
>
> echo $$ > /sys/fs/cgroup/cpu/test/cgroup.procs
> echo 600000 > /sys/fs/cgroup/cpu/test/cpu.cfs_quota_us
> echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_period_us
> echo 400000 > /sys/fs/cgroup/cpu/test/cpu.cfs_burst_us
>
> # The average CPU usage is around 500%, which is 200ms CPU time
> # every 40ms.
> ./schbench -m 1 -t 30 -r 10 -c 10000 -R 500
>
> Without burst:
>
> Latency percentiles (usec)
> 50.0000th: 7
> 75.0000th: 8
> 90.0000th: 9
> 95.0000th: 10
> *99.0000th: 933
> 99.5000th: 981
> 99.9000th: 3068
> min=0, max=20054
> rps: 498.31 p95 (usec) 10 p99 (usec) 933 p95/cputime 0.10% p99/cputime 9.33%

It should be noted that this was running on a 64 core machine (if that was
the case, ref. your previous patch).

I am curious how much you have tried tweaking both the period and the quota
for this workload. I assume a longer period can help such bursty application,
and from the small slowdowns, a slightly higher quota could also help
I guess. I am
not saying this is a bad idea, but that we need to understand what it
fixes, and how,
in order to be able to understand how/if to use it.

Also, what value of the sysctl kernel.sched_cfs_bandwidth_slice_us are
you using?
What CONFIG_HZ you are using is also interesting, due to how bw is
accounted for.
There is some more info about it here: Documentation/scheduler/sched-bwc.rst. I
assume a smaller slice value may also help, and it would be interesting to see
what implications it gives. A high threads to (quota/period) ratio, together
with a high bandwidth_slice will probably cause some throttling, so one has
to choose between precision and overhead.

Also, here you give a burst of 66% the quota. Would that be a typical value
for a cgroup, or is it just a result of testing? As I understand this
patchset, your example
would allow 600% constant CPU load, then one period with 1000% load,
then another
"long set" of periods with 600% load. Have you discussed a way of limiting how
long burst can be "saved" before expiring?

> @@ -9427,7 +9478,8 @@ static int cpu_max_show(struct seq_file *sf, void *v)
> {
> struct task_group *tg = css_tg(seq_css(sf));
>
> - cpu_period_quota_print(sf, tg_get_cfs_period(tg), tg_get_cfs_quota(tg));
> + cpu_period_quota_print(sf, tg_get_cfs_period(tg), tg_get_cfs_quota(tg),
> + tg_get_cfs_burst(tg));
> return 0;
> }

The current cgroup v2 docs say the following:

> cpu.max
> A read-write two value file which exists on non-root cgroups.
> The default is "max 100000".

This will become a "three value file", and I know a few user space projects
who parse this file by splitting on the middle space. I am not sure if they are
"wrong", but I don't think we usually break such things. Not sure what
Tejun thinks about this.

Thanks
Odin


2021-05-21 10:49:22

by changhuaixin

[permalink] [raw]
Subject: Re: [PATCH v5 1/3] sched/fair: Introduce the burstable CFS controller



> On May 20, 2021, at 10:00 PM, Odin Ugedal <[email protected]> wrote:
>
> Hi,
>
> Here are some more thoughts and questions:
>
>> The benefit of burst is seen when testing with schbench:
>>
>> echo $$ > /sys/fs/cgroup/cpu/test/cgroup.procs
>> echo 600000 > /sys/fs/cgroup/cpu/test/cpu.cfs_quota_us
>> echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_period_us
>> echo 400000 > /sys/fs/cgroup/cpu/test/cpu.cfs_burst_us
>>
>> # The average CPU usage is around 500%, which is 200ms CPU time
>> # every 40ms.
>> ./schbench -m 1 -t 30 -r 10 -c 10000 -R 500
>>
>> Without burst:
>>
>> Latency percentiles (usec)
>> 50.0000th: 7
>> 75.0000th: 8
>> 90.0000th: 9
>> 95.0000th: 10
>> *99.0000th: 933
>> 99.5000th: 981
>> 99.9000th: 3068
>> min=0, max=20054
>> rps: 498.31 p95 (usec) 10 p99 (usec) 933 p95/cputime 0.10% p99/cputime 9.33%
>
> It should be noted that this was running on a 64 core machine (if that was
> the case, ref. your previous patch).
>
> I am curious how much you have tried tweaking both the period and the quota
> for this workload. I assume a longer period can help such bursty application,
> and from the small slowdowns, a slightly higher quota could also help
> I guess. I am
> not saying this is a bad idea, but that we need to understand what it
> fixes, and how,
> in order to be able to understand how/if to use it.
>

Yeah, it is a well tuned workload and configuration. I did this because for benchmarks
like schbench, workloads are generated in a fixed pattern without burst. So I set schbench
params carefully to generate burst during each 100ms periods, to show burst works. Longer
period or higher quota helps indeed, in which case more workloads can be used to generate
tail latency then.

In my view, burst is like the cfsb way of token bucket. For the present cfsb, bucket capacity
is strictly limited to quota. And that is changed into quota + burst now. And it shall be used when
tasks get throttled and CPU is under utilized for the whole system.

> Also, what value of the sysctl kernel.sched_cfs_bandwidth_slice_us are
> you using?
> What CONFIG_HZ you are using is also interesting, due to how bw is
> accounted for.
> There is some more info about it here: Documentation/scheduler/sched-bwc.rst. I
> assume a smaller slice value may also help, and it would be interesting to see
> what implications it gives. A high threads to (quota/period) ratio, together
> with a high bandwidth_slice will probably cause some throttling, so one has
> to choose between precision and overhead.
>

Default value of kernel.sched_cfs_bandwidth_slice_us(5ms) and CONFIG_HZ(1000) is used.

The following case might be used to prevent getting throttled from many threads and high bandwidth
slice:

mkdir /sys/fs/cgroup/cpu/test
echo $$ > /sys/fs/cgroup/cpu/test/cgroup.procs
echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_quota_us
echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_burst_us

./schbench -m 1 -t 3 -r 20 -c 80000 -R 20

On my machine, two workers work for 80ms and sleep for 120ms in each round. The average utilization is
around 80%. This will work on a two-core system. It is recommended to try it multiple times as getting
throttled doesn't necessarily cause tail latency for schbench.


> Also, here you give a burst of 66% the quota. Would that be a typical value
> for a cgroup, or is it just a result of testing? As I understand this

Yeah, it is not a typical value, and tuned for this test.

> patchset, your example
> would allow 600% constant CPU load, then one period with 1000% load,
> then another
> "long set" of periods with 600% load. Have you discussed a way of limiting how
> long burst can be "saved" before expiring?

Haven't thought about it much. It is interesting but I doubt the need to do that.

>
>> @@ -9427,7 +9478,8 @@ static int cpu_max_show(struct seq_file *sf, void *v)
>> {
>> struct task_group *tg = css_tg(seq_css(sf));
>>
>> - cpu_period_quota_print(sf, tg_get_cfs_period(tg), tg_get_cfs_quota(tg));
>> + cpu_period_quota_print(sf, tg_get_cfs_period(tg), tg_get_cfs_quota(tg),
>> + tg_get_cfs_burst(tg));
>> return 0;
>> }
>
> The current cgroup v2 docs say the following:
>
>> cpu.max
>> A read-write two value file which exists on non-root cgroups.
>> The default is "max 100000".
>
> This will become a "three value file", and I know a few user space projects
> who parse this file by splitting on the middle space. I am not sure if they are
> "wrong", but I don't think we usually break such things. Not sure what
> Tejun thinks about this.
>

Thanks, it will be modified in the way Tejun suggests.

> Thanks
> Odin

2021-05-21 11:12:01

by Odin Ugedal

[permalink] [raw]
Subject: Re: [PATCH v5 1/3] sched/fair: Introduce the burstable CFS controller

Hi,

> Yeah, it is a well tuned workload and configuration. I did this because for benchmarks
> like schbench, workloads are generated in a fixed pattern without burst. So I set schbench
> params carefully to generate burst during each 100ms periods, to show burst works. Longer
> period or higher quota helps indeed, in which case more workloads can be used to generate
> tail latency then.

Yeah, that makes sense. When it comes to fairness (you are talking
about generating tail
latency), I think configuration of cpu shares/weight between cgroups
is more relevant.

How much more tail latency will a cgroup be able to "create" when
doubling the period?


> In my view, burst is like the cfsb way of token bucket. For the present cfsb, bucket capacity
> is strictly limited to quota. And that is changed into quota + burst now. And it shall be used when
> tasks get throttled and CPU is under utilized for the whole system.

Well, it is as strict as we can make it, depending on how one looks at it. We
cannot guarantee anything more strict than the length of a jiffy or
kernel.sched_cfs_bandwidth_slice_us (simplified ofc.), especially since we allow
runtime from one period to be used in another. I think there is a
"big" distinction between
runtime transferred from the cfs_bw to cfs_rq's in a period compared
to the actual runtime used.

> Default value of kernel.sched_cfs_bandwidth_slice_us(5ms) and CONFIG_HZ(1000) is used.

You should mention that in the msg then, since it is highly relevant
to the results. Can you try to tweak
kernel.sched_cfs_bandwidth_slice_us to something like 1ms, and see
what the result will be?

For such a workload and high cfs_bw_slice, a smaller CONFIG_HZ might
also be beneficial (although
there are many things to consider when talking about that, and a lot
of people know more about that than me).

> The following case might be used to prevent getting throttled from many threads and high bandwidth
> slice:
>
> mkdir /sys/fs/cgroup/cpu/test
> echo $$ > /sys/fs/cgroup/cpu/test/cgroup.procs
> echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_quota_us
> echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_burst_us
>
> ./schbench -m 1 -t 3 -r 20 -c 80000 -R 20
>
> On my machine, two workers work for 80ms and sleep for 120ms in each round. The average utilization is
> around 80%. This will work on a two-core system. It is recommended to try it multiple times as getting
> throttled doesn't necessarily cause tail latency for schbench.

When I run this, I get the following results without cfs bandwidth enabled.

$ time ./schbench -m 1 -t 3 -r 20 -c 80000 -R 20
Latency percentiles (usec) runtime 20 (s) (398 total samples)
50.0th: 22 (201 samples)
75.0th: 50 (158 samples)
90.0th: 50 (0 samples)
95.0th: 51 (38 samples)
*99.0th: 51 (0 samples)
99.5th: 51 (0 samples)
99.9th: 52 (1 samples)
min=5, max=52
rps: 19900000.00 p95 (usec) 51 p99 (usec) 51 p95/cputime 0.06% p99/cputime 0.06%
./schbench -m 1 -t 3 -r 20 -c 80000 -R 20 31.85s user 0.00s system
159% cpu 20.021 total

In this case, I see 80% load on two cores, ending at a total of 160%. If setting
period: 100ms and quota: 100ms (aka. 1 cpu), throttling is what
you would expect, or?. In this case, burst wouldn't matter?


Thanks
Odin

2021-05-21 12:55:17

by changhuaixin

[permalink] [raw]
Subject: Re: [PATCH v5 1/3] sched/fair: Introduce the burstable CFS controller



> On May 21, 2021, at 5:38 PM, Odin Ugedal <[email protected]> wrote:
>
> Hi,
>
>> Yeah, it is a well tuned workload and configuration. I did this because for benchmarks
>> like schbench, workloads are generated in a fixed pattern without burst. So I set schbench
>> params carefully to generate burst during each 100ms periods, to show burst works. Longer
>> period or higher quota helps indeed, in which case more workloads can be used to generate
>> tail latency then.
>
> Yeah, that makes sense. When it comes to fairness (you are talking
> about generating tail
> latency), I think configuration of cpu shares/weight between cgroups
> is more relevant.
>
> How much more tail latency will a cgroup be able to "create" when
> doubling the period?
>

Indeed, fairness is another factor relevant to tail latency. However, real workloads benefit from burst
feature, too. For java workloads with equal fairness between cgroups, a huge drop of tail latency from
500ms to 27ms is seen after enabling burst feature. I shouldn't delete this info in the msg.

I guess tail latency from schbench is small here, because schbench is simple and only measures wakeup
latency. For workloads measuring round trip time, the effect of getting throttled is more obvious.

>
>> In my view, burst is like the cfsb way of token bucket. For the present cfsb, bucket capacity
>> is strictly limited to quota. And that is changed into quota + burst now. And it shall be used when
>> tasks get throttled and CPU is under utilized for the whole system.
>
> Well, it is as strict as we can make it, depending on how one looks at it. We
> cannot guarantee anything more strict than the length of a jiffy or
> kernel.sched_cfs_bandwidth_slice_us (simplified ofc.), especially since we allow
> runtime from one period to be used in another. I think there is a
> "big" distinction between
> runtime transferred from the cfs_bw to cfs_rq's in a period compared
> to the actual runtime used.
>
>> Default value of kernel.sched_cfs_bandwidth_slice_us(5ms) and CONFIG_HZ(1000) is used.
>
> You should mention that in the msg then, since it is highly relevant
> to the results. Can you try to tweak

Sorry for causing trouble reproducing this. I'll add these info.

> kernel.sched_cfs_bandwidth_slice_us to something like 1ms, and see
> what the result will be?
>

After using 1ms kernel.sched_cfs_bandwidth_slice_us I see 99.0th and 99.5th latency drop, and 99.9th
latency remains at several ms. I guess I can't tell it from some small spikes now.

# 1ms kernel.sched_cfs_bandwidth_slice_us
echo $$ > /sys/fs/cgroup/cpu/test/cgroup.procs
echo 600000 > /sys/fs/cgroup/cpu/test/cpu.cfs_quota_us
#echo 400000 > /sys/fs/cgroup/cpu/test/cpu.cfs_burst_us
cat /sys/fs/cgroup/cpu/test/cpu.stat | grep nr_throttled

./schbench -m 1 -t 30 -r 10 -c 10000 -R 500

Latency percentiles (usec)
50.0000th: 8
75.0000th: 8
90.0000th: 9
95.0000th: 10
*99.0000th: 13
99.5000th: 17
99.9000th: 6408
min=0, max=7576
rps: 497.44 p95 (usec) 10 p99 (usec) 13 p95/cputime 0.10% p99/cputime 0.13%


> For such a workload and high cfs_bw_slice, a smaller CONFIG_HZ might
> also be beneficial (although
> there are many things to consider when talking about that, and a lot
> of people know more about that than me).
>
>> The following case might be used to prevent getting throttled from many threads and high bandwidth
>> slice:
>>
>> mkdir /sys/fs/cgroup/cpu/test
>> echo $$ > /sys/fs/cgroup/cpu/test/cgroup.procs
>> echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_quota_us
>> echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_burst_us
>>
>> ./schbench -m 1 -t 3 -r 20 -c 80000 -R 20
>>
>> On my machine, two workers work for 80ms and sleep for 120ms in each round. The average utilization is
>> around 80%. This will work on a two-core system. It is recommended to try it multiple times as getting
>> throttled doesn't necessarily cause tail latency for schbench.
>
> When I run this, I get the following results without cfs bandwidth enabled.
>
> $ time ./schbench -m 1 -t 3 -r 20 -c 80000 -R 20
> Latency percentiles (usec) runtime 20 (s) (398 total samples)
> 50.0th: 22 (201 samples)
> 75.0th: 50 (158 samples)
> 90.0th: 50 (0 samples)
> 95.0th: 51 (38 samples)
> *99.0th: 51 (0 samples)
> 99.5th: 51 (0 samples)
> 99.9th: 52 (1 samples)
> min=5, max=52
> rps: 19900000.00 p95 (usec) 51 p99 (usec) 51 p95/cputime 0.06% p99/cputime 0.06%
> ./schbench -m 1 -t 3 -r 20 -c 80000 -R 20 31.85s user 0.00s system
> 159% cpu 20.021 total
>
> In this case, I see 80% load on two cores, ending at a total of 160%. If setting
> period: 100ms and quota: 100ms (aka. 1 cpu), throttling is what
> you would expect, or?. In this case, burst wouldn't matter?
>

Sorry for my mistake. The -R option should be 10 instead of 20. And the case should be:

# 1ms kernel.sched_cfs_bandwidth_slice_us
mkdir /sys/fs/cgroup/cpu/test
echo $$ > /sys/fs/cgroup/cpu/test/cgroup.procs
echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_quota_us
echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_burst_us

./schbench -m 1 -t 3 -r 20 -c 80000 -R 10

The average CPU usage is at 80%. I run this for 10 times, and got long tail latency for 6 times and got throttled
for 8 times.

Tail latencies are showed below, and it wasn't the worst case.

Latency percentiles (usec)
50.0000th: 19872
75.0000th: 21344
90.0000th: 22176
95.0000th: 22496
*99.0000th: 22752
99.5000th: 22752
99.9000th: 22752
min=0, max=22727
rps: 9.90 p95 (usec) 22496 p99 (usec) 22752 p95/cputime 28.12% p99/cputime 28.44%


Sometimes the measured period of schbench is not throttled and thus no tail latency is seen. Sometimes tasks do
not get throttled because the the offset of schbench worker start from period start matters too. In this case, these
two CPUs work for 80ms and sleeps for 120ms, If the 80ms burst period for 2 workers is cut into two cfsb periods,
they might not get throttled.

I'll use this case in the commit log.

>
> Thanks
> Odin

2021-05-21 16:12:27

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v5 1/3] sched/fair: Introduce the burstable CFS controller

Hello, Odin.

On Thu, May 20, 2021 at 04:00:29PM +0200, Odin Ugedal wrote:
> > cpu.max
> > A read-write two value file which exists on non-root cgroups.
> > The default is "max 100000".
>
> This will become a "three value file", and I know a few user space projects
> who parse this file by splitting on the middle space. I am not sure if they are
> "wrong", but I don't think we usually break such things. Not sure what
> Tejun thinks about this.

Good point. I haven't thought about that. It would make more sense to
separate it out to a separate file then - e.g. sth like cpu.max.burst, but
it seems like there are important questions to answer before adding new
interfaces.

Thanks.

--
tejun