by Benjamin Segall

[permalink] [raw]

Subject: Re: [PATCH 0/4] sched/fair: Burstable CFS bandwidth controller

The code for this looks fine, and the feature is something people do
seem to ask for occasionally. I agree with peterz that using this
generally means you lose any guarantees (which are already imperfect
given CFS), but I suspect that cfsb is being used in overload anyways.

The docs could use a grammar/wording pass maybe, but that's easy enough.

2020-12-18 09:59:09

2021-01-20 13:15:14

by changhuaixin

[permalink] [raw]

Subject: [PATCH 4/4] sched/fair: Add document for burstable CFS bandwidth control

Basic description of usage and effect for CFS Bandwidth Control Burst.

Signed-off-by: Huaixin Chang <[email protected]>
Signed-off-by: Shanpei Chen <[email protected]>
---
Documentation/scheduler/sched-bwc.rst | 70 +++++++++++++++++++++++++++++++++--
1 file changed, 66 insertions(+), 4 deletions(-)

diff --git a/Documentation/scheduler/sched-bwc.rst b/Documentation/scheduler/sched-bwc.rst
index 9801d6b284b1..2214ecaad393 100644
--- a/Documentation/scheduler/sched-bwc.rst
+++ b/Documentation/scheduler/sched-bwc.rst
@@ -21,18 +21,46 @@ cfs_quota units at each period boundary. As threads consume this bandwidth it
is transferred to cpu-local "silos" on a demand basis. The amount transferred
within each of these updates is tunable and described as the "slice".

+By default, CPU bandwidth consumption is strictly limited to quota within each
+given period. For the sequence of CPU usage u_i served under CFS bandwidth
+control, if for any j <= k N(j,k) is the number of periods from u_j to u_k:
+
+ u_j+...+u_k <= quota * N(j,k)
+
+For a bursty sequence among which interval u_j...u_k are at the peak, CPU
+requests might have to wait for more periods to replenish enough quota.
+Otherwise, larger quota is required.
+
+With "burst" buffer, CPU requests might be served as long as:
+
+ u_j+...+u_k <= B_j + quota * N(j,k)
+
+if for any j <= k N(j,k) is the number of periods from u_j to u_k and B_j is
+the accumulated quota from previous periods in burst buffer serving u_j.
+Burst buffer helps in that serving whole bursty CPU requests without throttling
+them can be done with moderate quota setting and accumulated quota in burst
+buffer, if:
+
+ u_0+...+u_n <= B_0 + quota * N(0,n)
+
+where B_0 is the initial state of burst buffer. The maximum accumulated quota in
+the burst buffer is capped by burst. With proper burst setting, the available
+bandwidth is still determined by quota and period on the long run.
+
Management
----------
-Quota and period are managed within the cpu subsystem via cgroupfs.
+Quota, period and burst are managed within the cpu subsystem via cgroupfs.

-cpu.cfs_quota_us: the total available run-time within a period (in microseconds)
+cpu.cfs_quota_us: run-time replenished within a period (in microseconds)
cpu.cfs_period_us: the length of a period (in microseconds)
+cpu.cfs_burst_us: the maximum accumulated run-time (in microseconds)
cpu.stat: exports throttling statistics [explained further below]

The default values are::

cpu.cfs_period_us=100ms
- cpu.cfs_quota=-1
+ cpu.cfs_quota_us=-1
+ cpu.cfs_burst_us=0

A value of -1 for cpu.cfs_quota_us indicates that the group does not have any
bandwidth restriction in place, such a group is described as an unconstrained
@@ -48,6 +76,11 @@ more detail below.
Writing any negative value to cpu.cfs_quota_us will remove the bandwidth limit
and return the group to an unconstrained state once more.

+A value of 0 for cpu.cfs_burst_us indicates that the group can not accumulate
+any unused bandwidth. It makes the traditional bandwidth control behavior for
+CFS unchanged. Writing any (valid) positive value(s) into cpu.cfs_burst_us
+will enact the cap on unused bandwidth accumulation.
+
Any updates to a group's bandwidth specification will result in it becoming
unthrottled if it is in a constrained state.

@@ -65,9 +98,21 @@ This is tunable via procfs::
Larger slice values will reduce transfer overheads, while smaller values allow
for more fine-grained consumption.

+There is also a global switch to turn off burst for all groups::
+ /proc/sys/kernel/sched_cfs_bw_burst_enabled (default=1)
+
+By default it is enabled. Write 0 values means no accumulated CPU time can be
+used for any group, even if cpu.cfs_burst_us is configured.
+
+Sometimes users might want a group to burst without accumulation. This is
+tunable via::
+ /proc/sys/kernel/sched_cfs_bw_burst_onset_percent (default=0)
+
+Up to 100% runtime of cpu.cfs_burst_us might be given on setting bandwidth.
+
Statistics
----------
-A group's bandwidth statistics are exported via 3 fields in cpu.stat.
+A group's bandwidth statistics are exported via 6 fields in cpu.stat.

cpu.stat:

@@ -75,6 +120,11 @@ cpu.stat:
- nr_throttled: Number of times the group has been throttled/limited.
- throttled_time: The total time duration (in nanoseconds) for which entities
of the group have been throttled.
+- current_bw: Current runtime in global pool.
+- nr_burst: Number of periods burst occurs.
+- burst_time: Cumulative wall-time that any cpus has used above quota in
+ respective periods
+

This interface is read-only.

@@ -172,3 +222,15 @@ Examples

By using a small period here we are ensuring a consistent latency
response at the expense of burst capacity.
+
+4. Limit a group to 20% of 1 CPU, and allow accumulate up to 60% of 1 CPU
+ addtionally, in case accumulation has been done.
+
+ With 50ms period, 10ms quota will be equivalent to 20% of 1 CPU.
+ And 30ms burst will be equivalent to 60% of 1 CPU.
+
+ # echo 10000 > cpu.cfs_quota_us /* quota = 10ms */
+ # echo 50000 > cpu.cfs_period_us /* period = 50ms */
+ # echo 30000 > cpu.cfs_burst_us /* burst = 30ms */
+
+ Larger buffer setting allows greater burst capacity.
--
2.14.4.44.g2045bb6

2021-01-20 13:46:48

by changhuaixin

[permalink] [raw]

Subject: [PATCH 2/4] sched/fair: Make CFS bandwidth controller burstable

Accumulate unused quota from previous periods, thus accumulated
bandwidth runtime can be used in the following periods. During
accumulation, take care of runtime overflow. Previous non-burstable
CFS bandwidth controller only assign quota to runtime, that saves a lot.

A sysctl parameter sysctl_sched_cfs_bw_burst_onset_percent is introduced to
denote how many percent of burst is given on setting cfs bandwidth. By
default it is 0, which means on burst is allowed unless accumulated.

Also, parameter sysctl_sched_cfs_bw_burst_enabled is introduced as a
switch for burst. It is enabled by default.

Signed-off-by: Huaixin Chang <[email protected]>
Signed-off-by: Shanpei Chen <[email protected]>
Reported-by: kernel test robot <[email protected]>
---
include/linux/sched/sysctl.h | 2 ++
kernel/sched/core.c | 31 +++++++++++++++++++++++++----
kernel/sched/fair.c | 46 ++++++++++++++++++++++++++++++++++++--------
kernel/sched/sched.h | 4 ++--
kernel/sysctl.c | 18 +++++++++++++++++
5 files changed, 87 insertions(+), 14 deletions(-)

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 3c31ba88aca5..3400828eaf2d 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -72,6 +72,8 @@ extern unsigned int sysctl_sched_uclamp_util_min_rt_default;

#ifdef CONFIG_CFS_BANDWIDTH
extern unsigned int sysctl_sched_cfs_bandwidth_slice;
+extern unsigned int sysctl_sched_cfs_bw_burst_onset_percent;
+extern unsigned int sysctl_sched_cfs_bw_burst_enabled;
#endif

#ifdef CONFIG_SCHED_AUTOGROUP
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 48d3bad12be2..fecf0f05ef0c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -66,6 +66,16 @@ const_debug unsigned int sysctl_sched_features =
*/
const_debug unsigned int sysctl_sched_nr_migrate = 32;

+#ifdef CONFIG_CFS_BANDWIDTH
+/*
+ * Percent of burst assigned to cfs_b->runtime on tg_set_cfs_bandwidth,
+ * 0 by default.
+ */
+unsigned int sysctl_sched_cfs_bw_burst_onset_percent;
+
+unsigned int sysctl_sched_cfs_bw_burst_enabled = 1;
+#endif
+
/*
* period over which we measure -rt task CPU usage in us.
* default: 1s
@@ -7891,7 +7901,7 @@ static DEFINE_MUTEX(cfs_constraints_mutex);
const u64 max_cfs_quota_period = 1 * NSEC_PER_SEC; /* 1s */
static const u64 min_cfs_quota_period = 1 * NSEC_PER_MSEC; /* 1ms */
/* More than 203 days if BW_SHIFT equals 20. */
-static const u64 max_cfs_runtime = MAX_BW * NSEC_PER_USEC;
+const u64 max_cfs_runtime = MAX_BW * NSEC_PER_USEC;

static int __cfs_schedulable(struct task_group *tg, u64 period, u64 runtime);

@@ -7900,7 +7910,7 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota,
{
int i, ret = 0, runtime_enabled, runtime_was_enabled;
struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
- u64 buffer;
+ u64 buffer, burst_onset;

if (tg == &root_task_group)
return -EINVAL;
@@ -7961,11 +7971,24 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota,
cfs_b->burst = burst;
cfs_b->buffer = buffer;

- __refill_cfs_bandwidth_runtime(cfs_b);
+ cfs_b->max_overrun = DIV_ROUND_UP_ULL(max_cfs_runtime, quota);
+ cfs_b->runtime = cfs_b->quota;
+
+ /* burst_onset needed */
+ if (cfs_b->quota != RUNTIME_INF &&
+ sysctl_sched_cfs_bw_burst_enabled &&
+ sysctl_sched_cfs_bw_burst_onset_percent > 0) {
+
+ burst_onset = do_div(burst, 100) *
+ sysctl_sched_cfs_bw_burst_onset_percent;
+
+ cfs_b->runtime += burst_onset;
+ cfs_b->runtime = min(max_cfs_runtime, cfs_b->runtime);
+ }

/* Restart the period timer (if active) to handle new period expiry: */
if (runtime_enabled)
- start_cfs_bandwidth(cfs_b);
+ start_cfs_bandwidth(cfs_b, 1);

raw_spin_unlock_irq(&cfs_b->lock);

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6bb4f89259fd..38a726f77783 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4598,10 +4598,22 @@ static inline u64 sched_cfs_bandwidth_slice(void)
*
* requires cfs_b->lock
*/
-void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
+void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b, u64 overrun)
{
- if (cfs_b->quota != RUNTIME_INF)
- cfs_b->runtime = cfs_b->quota;
+ u64 refill;
+
+ if (cfs_b->quota != RUNTIME_INF) {
+
+ if (!sysctl_sched_cfs_bw_burst_enabled) {
+ cfs_b->runtime = cfs_b->quota;
+ return;
+ }
+
+ overrun = min(overrun, cfs_b->max_overrun);
+ refill = cfs_b->quota * overrun;
+ cfs_b->runtime += refill;
+ cfs_b->runtime = min(cfs_b->runtime, cfs_b->buffer);
+ }
}

static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
@@ -4623,7 +4635,7 @@ static int __assign_cfs_rq_runtime(struct cfs_bandwidth *cfs_b,
if (cfs_b->quota == RUNTIME_INF)
amount = min_amount;
else {
- start_cfs_bandwidth(cfs_b);
+ start_cfs_bandwidth(cfs_b, 0);

if (cfs_b->runtime > 0) {
amount = min(cfs_b->runtime, min_amount);
@@ -4957,7 +4969,7 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
if (cfs_b->idle && !throttled)
goto out_deactivate;

- __refill_cfs_bandwidth_runtime(cfs_b);
+ __refill_cfs_bandwidth_runtime(cfs_b, overrun);

if (!throttled) {
/* mark as potentially idle for the upcoming period */
@@ -5181,6 +5193,7 @@ static enum hrtimer_restart sched_cfs_slack_timer(struct hrtimer *timer)
}

extern const u64 max_cfs_quota_period;
+extern const u64 max_cfs_runtime;

static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
{
@@ -5210,7 +5223,14 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
new = old * 2;
if (new < max_cfs_quota_period) {
cfs_b->period = ns_to_ktime(new);
- cfs_b->quota *= 2;
+ cfs_b->quota = min(cfs_b->quota * 2,
+ max_cfs_runtime);
+
+ cfs_b->buffer = min(max_cfs_runtime,
+ cfs_b->quota + cfs_b->burst);
+ /* Add 1 in case max_overrun becomes 0. */
+ cfs_b->max_overrun >>= 1;
+ cfs_b->max_overrun++;

pr_warn_ratelimited(
"cfs_period_timer[cpu%d]: period too short, scaling up (new cfs_period_us = %lld, cfs_quota_us = %lld)\n",
@@ -5259,16 +5279,26 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
INIT_LIST_HEAD(&cfs_rq->throttled_list);
}

-void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
+void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b, int init)
{
+ u64 overrun;
+
lockdep_assert_held(&cfs_b->lock);

if (cfs_b->period_active)
return;

cfs_b->period_active = 1;
- hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
+ overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
+
+ /*
+ * When period timer stops, quota for the following period is not
+ * refilled, however period timer is already forwarded. We should
+ * accumulate quota once more than overrun here.
+ */
+ if (!init)
+ __refill_cfs_bandwidth_runtime(cfs_b, overrun + 1);
}

static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a8772eca8cbb..ff8b5382485d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -366,6 +366,7 @@ struct cfs_bandwidth {
u64 runtime;
u64 burst;
u64 buffer;
+ u64 max_overrun;
s64 hierarchical_quota;

u8 idle;
@@ -476,8 +477,7 @@ extern void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
struct sched_entity *parent);
extern void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b);

-extern void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b);
-extern void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
+extern void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b, int init);
extern void unthrottle_cfs_rq(struct cfs_rq *cfs_rq);

extern void free_rt_sched_group(struct task_group *tg);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index afad085960b8..291dca62a571 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1842,6 +1842,24 @@ static struct ctl_table kern_table[] = {
.proc_handler = proc_dointvec_minmax,
.extra1 = SYSCTL_ONE,
},
+ {
+ .procname = "sched_cfs_bw_burst_onset_percent",
+ .data = &sysctl_sched_cfs_bw_burst_onset_percent,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = SYSCTL_ZERO,
+ .extra2 = &one_hundred,
+ },
+ {
+ .procname = "sched_cfs_bw_burst_enabled",
+ .data = &sysctl_sched_cfs_bw_burst_enabled,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = SYSCTL_ZERO,
+ .extra2 = SYSCTL_ONE,
+ },
#endif
#if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
{
--
2.14.4.44.g2045bb6

2021-01-20 17:18:44

by kernel test robot

[permalink] [raw]

Subject: Re: [PATCH 2/4] sched/fair: Make CFS bandwidth controller burstable

Attachments:

(No filename) (3.95 kB)
.config.gz (42.35 kB)
Download all attachments

Subject: Re: [PATCH v3 4/4] sched/fair: Add document for burstable CFS bandwidth control

On Thu, Jan 21, 2021 at 07:04:53PM +0800, Huaixin Chang wrote:
> Basic description of usage and effect for CFS Bandwidth Control Burst.
>
> Signed-off-by: Huaixin Chang <[email protected]>
> Signed-off-by: Shanpei Chen <[email protected]>

Guess :-)

> +Sometimes users might want a group to burst without accumulation. This is
> +tunable via::
> + /proc/sys/kernel/sched_cfs_bw_burst_onset_percent (default=0)
> +
> +Up to 100% runtime of cpu.cfs_burst_us might be given on setting bandwidth.

Sometimes is a very crap reason for code to exist. Also, everything is
in _us, why do we have this one thing as a percent?

2021-03-12 13:28:38

by changhuaixin

[permalink] [raw]

Subject: Re: [PATCH v3 0/4] sched/fair: Burstable CFS bandwidth controller

> On Mar 10, 2021, at 7:11 PM, Odin Ugedal <[email protected]> wrote:
>
> Hi,
>
>> If there are cases where the "start bandwidth" matters, I think there is need to expose the
>> "start bandwidth" explicitly too. However, I doubt the existence of such cases from my view
>> and the two examples above.
>
> Yeah, I don't think there will be any cases where users will be
> "depending" on having burst available,
> so I agree in that sense.
>
>> In my thoughts, this patchset keeps cgroup usage within the quota in the longer term, and allows
>> cgroup to respond to a burst of work with the help of a reasonable burst buffer. If quota is set correctly
>> above average usage, and enough burst buffer is set to meet the needs of bursty work. In this
>> case, it makes no difference whether this cgroup runs with 0 start bandwidth or all of it.
>> Thus I used sysctl_sched_cfs_bw_burst_onset_percent to decided the start bandwidth
>> to leave some convenience here. If this sysctl interface is confusing, I wonder whether it
>> is a good idea not to expose this interface.
>>
>> For the first case mentioned above, if Kubernet users care the "start bandwidth" for process startup,
>> maybe it is better to give all of it rather than a part?
>
> Yeah, I am a bit afraid there will be some confusion, so not sure if
> the sysctl is the best way to do it.
>
> But I would like feedback from others to highlight the problem as
> well, that would be helpful. I think a simple "API"
> where you get 0 burst or full burst on "set" (the one we decide on)
> would be best to avoid unnecessary complexity.
>
> Start burst when starting up a new process in a new cgroup might be
> helpful, so maybe that is a vote for
> full burst? However, in long term that doesn't matter, so 0 burst on
> start would work as well.
>
>> For the second case with quota changes over time, I think it is important making sure each change works
>> long enough to enforce average quota limit. Does it really matter to control "start burst" on each change?
>
> No, I don't think so. Doing so would be another thing to set per
> cgroup, and that would just clutter the api
> more than necessary imo., since we cannot come up with any real use cases.
>
>> It is an copy of runtime at period start, and used to calculate burst time during a period.
>> Not quite remaining_runtime_prev_period.
>
> Ahh, I see, I misunderstood the code. So in a essence it is
> "runtime_at_period_start"?
>

Yes, it is "runtime_at_preiod_start".

>> Yeah, there is the updating problem. It is okey not to expose cfs_b->runtime then.
>
> Yeah, I think dropping it all together is the best solution.
>
>
>> This comment does not mean any loss any unnecessary throttle for present cfsb.
>> All this means is that all quota refilling that is not done during timer stop should be
>> refilled on timer start, for the burstable cfsb.
>>
>> Maybe I shall change this comment in some way if it is misleading?
>
> I think I formulated my question badly. The comment makes sense, I am
> just trying to compare how "start_cfs_bandwidth"
> works after your patch compared to how it works currently. As I
> understand, without this patch "start_cfs_bandwidth" will
> never refill runtime, while with your patch, it will refill even when
> overrun=0 with burst disabled. Is that an intended change in
> behavior, or am I not understanding the patch?
>

Good point. The way "start_cfs_bandwidth" works is changed indeed. The present cfs_b doesn't
have to refill bandwidth because quota is not used during the period before timer stops. With this patch,
runtime is refilled no matter burst is enabled or not. Do you suggest not refilling runtime unless burst
is enabled here?

>
> On another note, I have also been testing this patch, and I am not
> able to reproduce your schbench results. Both with and without burst,
> it gets the same result, and no nr_throttled stays at 0 (tested on a
> 32-core system). Can you try to rerun your tests with the mainline
> to see if you still get the same results? (Also, I see you are running
> with 30 threads. How many cores do your test setup have?). To actually
> say that the result is real, all cores used should maybe be
> exclusively reserved as well, to avoid issues where other processes
> cause a
> spike in latency.
>

Spikes indeed cause trouble. If nr_throttle stays at 0, I suggest change quota from 700000 to 600000,
which is still above the average utilization 500%. I have rerun on a 64-core system and reproduced the
results. And I think it should work on a 32-core system too, as there are 20 active workers in each round.

If you still have trouble, I suggest test in the following way. And it should work on a two-core system.

mkdir /sys/fs/cgroup/cpu/test
echo $$ > /sys/fs/cgroup/cpu/test/cgroup.procs
echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_quota_us
echo 300000 > /sys/fs/cgroup/cpu/test/cpu.cfs_burst_us

./schbench -m 1 -t 3 -r 20 -c 200000 -R 4

On my machine, two workers work for 200ms and sleep for 300ms in each round. The average utilization is
around 80%.

>
> Odin