LinuxLists.cc - Re: [PATCH v5 2/3] sched/fair: Add cfs bandwidth burst statistics

2021-05-20 14:17:06

Subject: Re: [PATCH v5 2/3] sched/fair: Add cfs bandwidth burst statistics

I am a bit sceptical about both the nr_burst and burst_time as they are now.

As an example; a control group using "99.9%" of the quota each period
and that is never throttled. Such group would with this patch with a burst of X
still get nr_throttled = 0 (as before), but it would get a nr_burst
and burst_time that
will keep increasing.

I think there is a big difference between runtime moved/taken from
cfs_b->runtime to cfs_rq->runtime_remaining and the actual runtime used
in the period. Currently, cfs bw can only supply info the first one, and
not the latter.

I think that if people see nr_burst increasing, that they think they _have_
to use cfs burst in order to avoid being throttled, even though that might
not be the case. It is probably fine as is, as long as it is explicitly stated
what the values mean and imply, and what they do not. I cannot see another
way to calculate it as it is now, but maybe someone else has some thoughts.

Thanks
Odin

2021-05-21 20:15:35

by changhuaixin

[permalink] [raw]

Subject: Re: [PATCH v5 2/3] sched/fair: Add cfs bandwidth burst statistics

> On May 20, 2021, at 10:11 PM, Odin Ugedal <[email protected]> wrote:
>
> I am a bit sceptical about both the nr_burst and burst_time as they are now.
>
> As an example; a control group using "99.9%" of the quota each period
> and that is never throttled. Such group would with this patch with a burst of X
> still get nr_throttled = 0 (as before), but it would get a nr_burst
> and burst_time that
> will keep increasing.
>

Agreed, there are false positive and false negetive cases, as the current implementation
uses cfs_b->runtime to judge instead of the actual runtime used.

> I think there is a big difference between runtime moved/taken from
> cfs_b->runtime to cfs_rq->runtime_remaining and the actual runtime used
> in the period. Currently, cfs bw can only supply info the first one, and
> not the latter.
>
> I think that if people see nr_burst increasing, that they think they _have_
> to use cfs burst in order to avoid being throttled, even though that might
> not be the case. It is probably fine as is, as long as it is explicitly stated

It can't be seeing nr_burst incresing first, and using cfs burst feature afterwards.
Do you mean people see nr_throttled increasing and use cfs burst, while the actual usage
is below quota? In that case, tasks get throttled because there are runtime to be returned from
cfs_rq, and get unthrottled shortly. That is a false positive for nr_throttled. When users see that,
using burst can help improve.

> what the values mean and imply, and what they do not. I cannot see another
> way to calculate it as it is now, but maybe someone else has some thoughts.
>
> Thanks
> Odin

2021-05-21 20:17:30

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH v5 2/3] sched/fair: Add cfs bandwidth burst statistics

On Thu, May 20, 2021 at 04:11:52PM +0200, Odin Ugedal wrote:
> I am a bit sceptical about both the nr_burst and burst_time as they are now.
>
> As an example; a control group using "99.9%" of the quota each period
> and that is never throttled. Such group would with this patch with a burst of X
> still get nr_throttled = 0 (as before), but it would get a nr_burst
> and burst_time that
> will keep increasing.
>
> I think there is a big difference between runtime moved/taken from
> cfs_b->runtime to cfs_rq->runtime_remaining and the actual runtime used
> in the period. Currently, cfs bw can only supply info the first one, and
> not the latter.
>
> I think that if people see nr_burst increasing, that they think they _have_
> to use cfs burst in order to avoid being throttled, even though that might
> not be the case. It is probably fine as is, as long as it is explicitly stated
> what the values mean and imply, and what they do not. I cannot see another
> way to calculate it as it is now, but maybe someone else has some thoughts.

You can always trace the system. I don't think we have nice tracepoints
for any of this, but much can be inferred from the scheduler and hrtimer
tracepoints. Also kprobe might be empoloyed to stick in more appropriate
thingies I suppose.

You can also run the workload without bandwidth controls and measure
it's job execution times, and from that compute the bandwidth settings,
all without tracepoints.