Changelog:
v5:
- Rearrange into 3 patches, one less than the previous version.
- The interference to other groups are valued.
- Put a limit on burst, so that code is further simplified.
- Rebase upon v5.13-rc3.
v4:
- Adjust assignments in tg_set_cfs_bandwidth(), saving unnecessary
assignemnts when quota == RUNTIME_INF.
- Getting rid of sysctl_sched_cfs_bw_burst_onset_percent, as there seems
no justification for both controlling start bandwidth and a percent
way.
- Comment improvement in sched_cfs_period_timer() shifts on explaining
why max_overrun shifting to 0 is a problem.
- Rename previous_runtime to runtime_at_period_start.
- Add cgroup2 interface and documentation.
- Getting rid of exposing current_bw as there are not enough
justification and the updating problem.
- Add justification on cpu.stat change in the changelog.
- Rebase upon v5.12-rc3.
- Correct SoB chain.
- Several indentation fixes.
- Adjust quota in schbench test from 700000 to 600000.
Link:
https://lore.kernel.org/lkml/[email protected]/
v3:
- Fix another issue reported by test robot.
- Update docs as Randy Dunlap suggested.
Link:
https://lore.kernel.org/lkml/[email protected]/
v2:
- Fix an issue reported by test robot.
- Rewriting docs. Appreciate any further suggestions or help.
Link:
https://lore.kernel.org/lkml/[email protected]/
v1 Link:
https://lore.kernel.org/lkml/[email protected]/
Previously, Cong Wang and Konstantin Khlebnikov proposed similar
feature:
https://lore.kernel.org/lkml/[email protected]/
https://lore.kernel.org/lkml/157476581065.5793.4518979877345136813.stgit@buzz/
This time we present more latency statistics and handle overflow while
accumulating.
Huaixin Chang (3):
sched/fair: Introduce the burstable CFS controller
sched/fair: Add cfs bandwidth burst statistics
sched/fair: Add document for burstable CFS bandwidth
Documentation/admin-guide/cgroup-v2.rst | 17 +++---
Documentation/scheduler/sched-bwc.rst | 76 ++++++++++++++++++++++----
include/linux/sched/sysctl.h | 1 +
kernel/sched/core.c | 96 ++++++++++++++++++++++++++-------
kernel/sched/fair.c | 32 ++++++++++-
kernel/sched/sched.h | 4 ++
kernel/sysctl.c | 9 ++++
7 files changed, 200 insertions(+), 35 deletions(-)
--
2.14.4.44.g2045bb6
When using cfs_b and meeting with some throttled periods, users shall
use burst buffer to allow bursty workloads. Apart from configuring some
burst buffer and watch whether throttled periods disappears, some
statistics on burst buffer using are also helpful. Thus expose the
following statistics into cpu.stat file:
nr_burst: number of periods bandwidth burst occurs
burst_time: cumulative wall-time that any cpus has
used above quota in respective periods
Co-developed-by: Shanpei Chen <[email protected]>
Signed-off-by: Shanpei Chen <[email protected]>
Co-developed-by: Tianchen Ding <[email protected]>
Signed-off-by: Tianchen Ding <[email protected]>
Signed-off-by: Huaixin Chang <[email protected]>
---
kernel/sched/core.c | 13 ++++++++++---
kernel/sched/fair.c | 11 +++++++++++
kernel/sched/sched.h | 3 +++
3 files changed, 24 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7d34b08ee0e5..d442fcd85374 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9265,6 +9265,9 @@ static int cpu_cfs_stat_show(struct seq_file *sf, void *v)
seq_printf(sf, "wait_sum %llu\n", ws);
}
+ seq_printf(sf, "nr_burst %d\n", cfs_b->nr_burst);
+ seq_printf(sf, "burst_time %llu\n", cfs_b->burst_time);
+
return 0;
}
#endif /* CONFIG_CFS_BANDWIDTH */
@@ -9361,16 +9364,20 @@ static int cpu_extra_stat_show(struct seq_file *sf,
{
struct task_group *tg = css_tg(css);
struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
- u64 throttled_usec;
+ u64 throttled_usec, burst_usec;
throttled_usec = cfs_b->throttled_time;
do_div(throttled_usec, NSEC_PER_USEC);
+ burst_usec = cfs_b->burst_time;
+ do_div(burst_usec, NSEC_PER_USEC);
seq_printf(sf, "nr_periods %d\n"
"nr_throttled %d\n"
- "throttled_usec %llu\n",
+ "throttled_usec %llu\n"
+ "nr_burst %d\n"
+ "burst_usec %llu\n",
cfs_b->nr_periods, cfs_b->nr_throttled,
- throttled_usec);
+ throttled_usec, cfs_b->nr_burst, burst_usec);
}
#endif
return 0;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 48fad5cf0f7a..d4783b62a010 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4635,6 +4635,8 @@ static inline u64 sched_cfs_bandwidth_slice(void)
*/
void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
{
+ u64 runtime;
+
if (unlikely(cfs_b->quota == RUNTIME_INF))
return;
@@ -4643,8 +4645,17 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
return;
}
+ if (cfs_b->runtime_at_period_start > cfs_b->runtime) {
+ runtime = cfs_b->runtime_at_period_start - cfs_b->runtime;
+ if (runtime > cfs_b->quota) {
+ cfs_b->burst_time += runtime - cfs_b->quota;
+ cfs_b->nr_burst++;
+ }
+ }
+
cfs_b->runtime += cfs_b->quota;
cfs_b->runtime = min(cfs_b->runtime, cfs_b->quota + cfs_b->burst);
+ cfs_b->runtime_at_period_start = cfs_b->runtime;
}
static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d317ca74a48c..b770b553dfbb 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -367,6 +367,7 @@ struct cfs_bandwidth {
u64 quota;
u64 runtime;
u64 burst;
+ u64 runtime_at_period_start;
s64 hierarchical_quota;
u8 idle;
@@ -379,7 +380,9 @@ struct cfs_bandwidth {
/* Statistics: */
int nr_periods;
int nr_throttled;
+ int nr_burst;
u64 throttled_time;
+ u64 burst_time;
#endif
};
--
2.14.4.44.g2045bb6
The CFS bandwidth controller limits CPU requests of a task group to
quota during each period. However, parallel workloads might be bursty
so that they get throttled even when their average utilization is under
quota. And they are latency sensitive at the same time so that
throttling them is undesired.
Scaling up period and quota allows greater burst capacity. But it might
cause longer stuck till next refill. Introduce "burst" to allow
accumulating unused quota from previous periods, and to be assigned when
a task group requests more CPU than quota during a specific period.
Introducing burst buffer might also cause interference to other groups.
Thus limit the maximum accumulated buffer by "burst", and limit
the maximum allowed burst by quota, too.
The benefit of burst is seen when testing with schbench:
echo $$ > /sys/fs/cgroup/cpu/test/cgroup.procs
echo 600000 > /sys/fs/cgroup/cpu/test/cpu.cfs_quota_us
echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_period_us
echo 400000 > /sys/fs/cgroup/cpu/test/cpu.cfs_burst_us
# The average CPU usage is around 500%, which is 200ms CPU time
# every 40ms.
./schbench -m 1 -t 30 -r 10 -c 10000 -R 500
Without burst:
Latency percentiles (usec)
50.0000th: 7
75.0000th: 8
90.0000th: 9
95.0000th: 10
*99.0000th: 933
99.5000th: 981
99.9000th: 3068
min=0, max=20054
rps: 498.31 p95 (usec) 10 p99 (usec) 933 p95/cputime 0.10% p99/cputime 9.33%
With burst:
Latency percentiles (usec)
50.0000th: 7
75.0000th: 8
90.0000th: 9
95.0000th: 9
*99.0000th: 12
99.5000th: 13
99.9000th: 19
min=0, max=406
rps: 498.36 p95 (usec) 9 p99 (usec) 12 p95/cputime 0.09% p99/cputime 0.12%
The interferenece when using burst is valued by the possibilities for
missing the deadline and the average WCET. Test results showed that when
there many cgroups or CPU is under utilized, the interference is
limited. More details are shown in:
https://lore.kernel.org/lkml/[email protected]/
Co-developed-by: Shanpei Chen <[email protected]>
Signed-off-by: Shanpei Chen <[email protected]>
Co-developed-by: Tianchen Ding <[email protected]>
Signed-off-by: Tianchen Ding <[email protected]>
Signed-off-by: Huaixin Chang <[email protected]>
---
include/linux/sched/sysctl.h | 1 +
kernel/sched/core.c | 83 ++++++++++++++++++++++++++++++++++++--------
kernel/sched/fair.c | 21 ++++++++++-
kernel/sched/sched.h | 1 +
kernel/sysctl.c | 9 +++++
5 files changed, 99 insertions(+), 16 deletions(-)
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index db2c0f34aaaf..08432aeb742e 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -73,6 +73,7 @@ extern unsigned int sysctl_sched_uclamp_util_min_rt_default;
#ifdef CONFIG_CFS_BANDWIDTH
extern unsigned int sysctl_sched_cfs_bandwidth_slice;
+extern unsigned int sysctl_sched_cfs_bw_burst_enabled;
#endif
#ifdef CONFIG_SCHED_AUTOGROUP
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5226cc26a095..7d34b08ee0e5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8962,7 +8962,8 @@ static const u64 max_cfs_runtime = MAX_BW * NSEC_PER_USEC;
static int __cfs_schedulable(struct task_group *tg, u64 period, u64 runtime);
-static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
+static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota,
+ u64 burst)
{
int i, ret = 0, runtime_enabled, runtime_was_enabled;
struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
@@ -8992,6 +8993,10 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
if (quota != RUNTIME_INF && quota > max_cfs_runtime)
return -EINVAL;
+ if (quota != RUNTIME_INF && (burst > quota ||
+ burst + quota > max_cfs_runtime))
+ return -EINVAL;
+
/*
* Prevent race between setting of cfs_rq->runtime_enabled and
* unthrottle_offline_cfs_rqs().
@@ -9013,6 +9018,7 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
raw_spin_lock_irq(&cfs_b->lock);
cfs_b->period = ns_to_ktime(period);
cfs_b->quota = quota;
+ cfs_b->burst = burst;
__refill_cfs_bandwidth_runtime(cfs_b);
@@ -9046,9 +9052,10 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
static int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us)
{
- u64 quota, period;
+ u64 quota, period, burst;
period = ktime_to_ns(tg->cfs_bandwidth.period);
+ burst = tg->cfs_bandwidth.burst;
if (cfs_quota_us < 0)
quota = RUNTIME_INF;
else if ((u64)cfs_quota_us <= U64_MAX / NSEC_PER_USEC)
@@ -9056,7 +9063,7 @@ static int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us)
else
return -EINVAL;
- return tg_set_cfs_bandwidth(tg, period, quota);
+ return tg_set_cfs_bandwidth(tg, period, quota, burst);
}
static long tg_get_cfs_quota(struct task_group *tg)
@@ -9074,15 +9081,16 @@ static long tg_get_cfs_quota(struct task_group *tg)
static int tg_set_cfs_period(struct task_group *tg, long cfs_period_us)
{
- u64 quota, period;
+ u64 quota, period, burst;
if ((u64)cfs_period_us > U64_MAX / NSEC_PER_USEC)
return -EINVAL;
period = (u64)cfs_period_us * NSEC_PER_USEC;
quota = tg->cfs_bandwidth.quota;
+ burst = tg->cfs_bandwidth.burst;
- return tg_set_cfs_bandwidth(tg, period, quota);
+ return tg_set_cfs_bandwidth(tg, period, quota, burst);
}
static long tg_get_cfs_period(struct task_group *tg)
@@ -9095,6 +9103,30 @@ static long tg_get_cfs_period(struct task_group *tg)
return cfs_period_us;
}
+static int tg_set_cfs_burst(struct task_group *tg, long cfs_burst_us)
+{
+ u64 quota, period, burst;
+
+ if ((u64)cfs_burst_us > U64_MAX / NSEC_PER_USEC)
+ return -EINVAL;
+
+ burst = (u64)cfs_burst_us * NSEC_PER_USEC;
+ period = ktime_to_ns(tg->cfs_bandwidth.period);
+ quota = tg->cfs_bandwidth.quota;
+
+ return tg_set_cfs_bandwidth(tg, period, quota, burst);
+}
+
+static long tg_get_cfs_burst(struct task_group *tg)
+{
+ u64 burst_us;
+
+ burst_us = tg->cfs_bandwidth.burst;
+ do_div(burst_us, NSEC_PER_USEC);
+
+ return burst_us;
+}
+
static s64 cpu_cfs_quota_read_s64(struct cgroup_subsys_state *css,
struct cftype *cft)
{
@@ -9119,6 +9151,18 @@ static int cpu_cfs_period_write_u64(struct cgroup_subsys_state *css,
return tg_set_cfs_period(css_tg(css), cfs_period_us);
}
+static s64 cpu_cfs_burst_read_s64(struct cgroup_subsys_state *css,
+ struct cftype *cft)
+{
+ return tg_get_cfs_burst(css_tg(css));
+}
+
+static int cpu_cfs_burst_write_s64(struct cgroup_subsys_state *css,
+ struct cftype *cftype, s64 cfs_burst_us)
+{
+ return tg_set_cfs_burst(css_tg(css), cfs_burst_us);
+}
+
struct cfs_schedulable_data {
struct task_group *tg;
u64 period, quota;
@@ -9271,6 +9315,11 @@ static struct cftype cpu_legacy_files[] = {
.read_u64 = cpu_cfs_period_read_u64,
.write_u64 = cpu_cfs_period_write_u64,
},
+ {
+ .name = "cfs_burst_us",
+ .read_s64 = cpu_cfs_burst_read_s64,
+ .write_s64 = cpu_cfs_burst_write_s64,
+ },
{
.name = "stat",
.seq_show = cpu_cfs_stat_show,
@@ -9390,27 +9439,29 @@ static int cpu_weight_nice_write_s64(struct cgroup_subsys_state *css,
}
#endif
-static void __maybe_unused cpu_period_quota_print(struct seq_file *sf,
- long period, long quota)
+static void __maybe_unused
+cpu_period_quota_print(struct seq_file *sf, long period, long quota,
+ long burst)
{
if (quota < 0)
seq_puts(sf, "max");
else
seq_printf(sf, "%ld", quota);
- seq_printf(sf, " %ld\n", period);
+ seq_printf(sf, " %ld %ld\n", period, burst);
}
-/* caller should put the current value in *@periodp before calling */
-static int __maybe_unused cpu_period_quota_parse(char *buf,
- u64 *periodp, u64 *quotap)
+/* caller should put the current value in *@periodp and *@burstp before calling */
+static int __maybe_unused cpu_period_quota_parse(char *buf, u64 *periodp,
+ u64 *quotap, u64 *burstp)
{
char tok[21]; /* U64_MAX */
- if (sscanf(buf, "%20s %llu", tok, periodp) < 1)
+ if (sscanf(buf, "%20s %llu %llu", tok, periodp, burstp) < 1)
return -EINVAL;
*periodp *= NSEC_PER_USEC;
+ *burstp *= NSEC_PER_USEC;
if (sscanf(tok, "%llu", quotap))
*quotap *= NSEC_PER_USEC;
@@ -9427,7 +9478,8 @@ static int cpu_max_show(struct seq_file *sf, void *v)
{
struct task_group *tg = css_tg(seq_css(sf));
- cpu_period_quota_print(sf, tg_get_cfs_period(tg), tg_get_cfs_quota(tg));
+ cpu_period_quota_print(sf, tg_get_cfs_period(tg), tg_get_cfs_quota(tg),
+ tg_get_cfs_burst(tg));
return 0;
}
@@ -9436,12 +9488,13 @@ static ssize_t cpu_max_write(struct kernfs_open_file *of,
{
struct task_group *tg = css_tg(of_css(of));
u64 period = tg_get_cfs_period(tg);
+ u64 burst = tg_get_cfs_burst(tg);
u64 quota;
int ret;
- ret = cpu_period_quota_parse(buf, &period, "a);
+ ret = cpu_period_quota_parse(buf, &period, "a, &burst);
if (!ret)
- ret = tg_set_cfs_bandwidth(tg, period, quota);
+ ret = tg_set_cfs_bandwidth(tg, period, quota, burst);
return ret ?: nbytes;
}
#endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3248e24a90b0..48fad5cf0f7a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -134,6 +134,13 @@ int __weak arch_asym_cpu_priority(int cpu)
* (default: 5 msec, units: microseconds)
*/
unsigned int sysctl_sched_cfs_bandwidth_slice = 5000UL;
+
+/*
+ * A switch for cfs bandwidth burst.
+ *
+ * (default: 1, enabled)
+ */
+unsigned int sysctl_sched_cfs_bw_burst_enabled = 1;
#endif
static inline void update_load_add(struct load_weight *lw, unsigned long inc)
@@ -4628,8 +4635,16 @@ static inline u64 sched_cfs_bandwidth_slice(void)
*/
void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
{
- if (cfs_b->quota != RUNTIME_INF)
+ if (unlikely(cfs_b->quota == RUNTIME_INF))
+ return;
+
+ if (!sysctl_sched_cfs_bw_burst_enabled) {
cfs_b->runtime = cfs_b->quota;
+ return;
+ }
+
+ cfs_b->runtime += cfs_b->quota;
+ cfs_b->runtime = min(cfs_b->runtime, cfs_b->quota + cfs_b->burst);
}
static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
@@ -4651,6 +4666,9 @@ static int __assign_cfs_rq_runtime(struct cfs_bandwidth *cfs_b,
if (cfs_b->quota == RUNTIME_INF)
amount = min_amount;
else {
+ if (!cfs_b->period_active)
+ __refill_cfs_bandwidth_runtime(cfs_b);
+
start_cfs_bandwidth(cfs_b);
if (cfs_b->runtime > 0) {
@@ -5279,6 +5297,7 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
cfs_b->runtime = 0;
cfs_b->quota = RUNTIME_INF;
cfs_b->period = ns_to_ktime(default_cfs_period());
+ cfs_b->burst = 0;
INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a189bec13729..d317ca74a48c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -366,6 +366,7 @@ struct cfs_bandwidth {
ktime_t period;
u64 quota;
u64 runtime;
+ u64 burst;
s64 hierarchical_quota;
u8 idle;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 14edf84cc571..fb27bae7ace2 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1816,6 +1816,15 @@ static struct ctl_table kern_table[] = {
.proc_handler = proc_dointvec_minmax,
.extra1 = SYSCTL_ONE,
},
+ {
+ .procname = "sched_cfs_bw_burst_enabled",
+ .data = &sysctl_sched_cfs_bw_burst_enabled,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = SYSCTL_ZERO,
+ .extra2 = SYSCTL_ONE,
+ },
#endif
#if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
{
--
2.14.4.44.g2045bb6
On Thu, May 20, 2021 at 08:34:17PM +0800, Huaixin Chang wrote:
> The CFS bandwidth controller limits CPU requests of a task group to
> quota during each period. However, parallel workloads might be bursty
> so that they get throttled even when their average utilization is under
> quota. And they are latency sensitive at the same time so that
> throttling them is undesired.
>
> Scaling up period and quota allows greater burst capacity. But it might
> cause longer stuck till next refill. Introduce "burst" to allow
> accumulating unused quota from previous periods, and to be assigned when
> a task group requests more CPU than quota during a specific period.
>
> Introducing burst buffer might also cause interference to other groups.
> Thus limit the maximum accumulated buffer by "burst", and limit
> the maximum allowed burst by quota, too.
Overall, *much* better than before.
However I would like a little bit better discussion of how exactly
people are supposed to reason about this. That will also help with the
question from Odin on how people are supposed to set/compute this burst
value.
So traditional (UP-EDF) bandwidth control is something like:
(U = \Sum u_i) <= 1
This guaranteeds both that every deadline is met and that the system is
stable. After all, if U were > 1, then for every second of walltime,
we'd have to run more than a second of program time, and obviously miss
our deadline, but the next deadline will be further out still, there is
never time to catch up, unbounded fail.
This work observes that a workload doesn't always executes the full
quota; this enables one to describe u_i as a statistical distribution.
For example, have u_i = {x,e}_i, where x is the p(95) and x+e p(100)
(the traditional WCET). This effectively allows u to be smaller,
increasing the efficiency (we can pack more tasks in the system), but at
the cost of missing deadlines when all the odds line up. However, it
does maintain stability, since every overrun must be paired with an
underrun as long as our x is above the average.
That is, suppose we have 2 tasks, both specify a p(95) value, then we
have a p(95)*p(95) = 90.25% chance both tasks are within their quota and
everything is good. At the same time we have a p(5)p(5) = 0.25% chance
both tasks will exceed their quota at the same time (guaranteed deadline
fail). Somewhere in between there's a threshold where one exceeds and
the other doesn't underrun enough to compensate; this depends on the
specific CDFs.
At the same time, we can say that the worst case deadline miss, will be
\Sum e_i; that is, there is a bounded tardiness (under the assumption
that x+e is indeed WCET).
And I think you can compute more fun properties.
Now, CFS bandwidth control is not EDF, and the above doesn't fully
translate, but much does I think.
We borrow time now against our future underrun, at the cost of increased
interference against the other system users. All nicely bounded etc..
> Co-developed-by: Shanpei Chen <[email protected]>
> Signed-off-by: Shanpei Chen <[email protected]>
> Co-developed-by: Tianchen Ding <[email protected]>
> Signed-off-by: Tianchen Ding <[email protected]>
> Signed-off-by: Huaixin Chang <[email protected]>
> ---
> include/linux/sched/sysctl.h | 1 +
> kernel/sched/core.c | 83 ++++++++++++++++++++++++++++++++++++--------
> kernel/sched/fair.c | 21 ++++++++++-
> kernel/sched/sched.h | 1 +
> kernel/sysctl.c | 9 +++++
> 5 files changed, 99 insertions(+), 16 deletions(-)
>
> diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
> index db2c0f34aaaf..08432aeb742e 100644
> --- a/include/linux/sched/sysctl.h
> +++ b/include/linux/sched/sysctl.h
> @@ -73,6 +73,7 @@ extern unsigned int sysctl_sched_uclamp_util_min_rt_default;
>
> #ifdef CONFIG_CFS_BANDWIDTH
> extern unsigned int sysctl_sched_cfs_bandwidth_slice;
> +extern unsigned int sysctl_sched_cfs_bw_burst_enabled;
> #endif
>
> #ifdef CONFIG_SCHED_AUTOGROUP
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 5226cc26a095..7d34b08ee0e5 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -8962,7 +8962,8 @@ static const u64 max_cfs_runtime = MAX_BW * NSEC_PER_USEC;
>
> static int __cfs_schedulable(struct task_group *tg, u64 period, u64 runtime);
>
> -static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
> +static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota,
> + u64 burst)
> {
> int i, ret = 0, runtime_enabled, runtime_was_enabled;
> struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
> @@ -8992,6 +8993,10 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
> if (quota != RUNTIME_INF && quota > max_cfs_runtime)
> return -EINVAL;
>
> + if (quota != RUNTIME_INF && (burst > quota ||
> + burst + quota > max_cfs_runtime))
> + return -EINVAL;
> +
> /*
> * Prevent race between setting of cfs_rq->runtime_enabled and
> * unthrottle_offline_cfs_rqs().
<snip all API crud>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 3248e24a90b0..48fad5cf0f7a 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -134,6 +134,13 @@ int __weak arch_asym_cpu_priority(int cpu)
> * (default: 5 msec, units: microseconds)
> */
> unsigned int sysctl_sched_cfs_bandwidth_slice = 5000UL;
> +
> +/*
> + * A switch for cfs bandwidth burst.
> + *
> + * (default: 1, enabled)
> + */
> +unsigned int sysctl_sched_cfs_bw_burst_enabled = 1;
> #endif
>
> static inline void update_load_add(struct load_weight *lw, unsigned long inc)
> @@ -4628,8 +4635,16 @@ static inline u64 sched_cfs_bandwidth_slice(void)
> */
> void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
> {
> - if (cfs_b->quota != RUNTIME_INF)
> + if (unlikely(cfs_b->quota == RUNTIME_INF))
> + return;
> +
> + if (!sysctl_sched_cfs_bw_burst_enabled) {
> cfs_b->runtime = cfs_b->quota;
> + return;
> + }
> +
> + cfs_b->runtime += cfs_b->quota;
> + cfs_b->runtime = min(cfs_b->runtime, cfs_b->quota + cfs_b->burst);
> }
>
> static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
> @@ -4651,6 +4666,9 @@ static int __assign_cfs_rq_runtime(struct cfs_bandwidth *cfs_b,
> if (cfs_b->quota == RUNTIME_INF)
> amount = min_amount;
> else {
> + if (!cfs_b->period_active)
> + __refill_cfs_bandwidth_runtime(cfs_b);
Why this call?
> start_cfs_bandwidth(cfs_b);
>
> if (cfs_b->runtime > 0) {
> @@ -5279,6 +5297,7 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
> cfs_b->runtime = 0;
> cfs_b->quota = RUNTIME_INF;
> cfs_b->period = ns_to_ktime(default_cfs_period());
> + cfs_b->burst = 0;
>
> INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
> hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index a189bec13729..d317ca74a48c 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -366,6 +366,7 @@ struct cfs_bandwidth {
> ktime_t period;
> u64 quota;
> u64 runtime;
> + u64 burst;
> s64 hierarchical_quota;
>
> u8 idle;
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 14edf84cc571..fb27bae7ace2 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1816,6 +1816,15 @@ static struct ctl_table kern_table[] = {
> .proc_handler = proc_dointvec_minmax,
> .extra1 = SYSCTL_ONE,
> },
> + {
> + .procname = "sched_cfs_bw_burst_enabled",
> + .data = &sysctl_sched_cfs_bw_burst_enabled,
> + .maxlen = sizeof(unsigned int),
> + .mode = 0644,
> + .proc_handler = proc_dointvec_minmax,
> + .extra1 = SYSCTL_ZERO,
> + .extra2 = SYSCTL_ONE,
> + },
> #endif
> #if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
> {
What's the purpose of this new sysctl? By default it is disabled because
burst==0, only if you set burst to some !0 value does this actually do
anything.
On Thu, May 20, 2021 at 08:34:18PM +0800, Huaixin Chang wrote:
> When using cfs_b and meeting with some throttled periods, users shall
> use burst buffer to allow bursty workloads. Apart from configuring some
> burst buffer and watch whether throttled periods disappears, some
> statistics on burst buffer using are also helpful. Thus expose the
> following statistics into cpu.stat file:
>
> nr_burst: number of periods bandwidth burst occurs
nr_bursts
> burst_time: cumulative wall-time that any cpus has
> used above quota in respective periods
burst_usec may be more in line with other fields in that file.
Thanks.
--
tejun
On Thu, May 20, 2021 at 08:34:18PM +0800, Huaixin Chang wrote:
> When using cfs_b and meeting with some throttled periods, users shall
> use burst buffer to allow bursty workloads. Apart from configuring some
> burst buffer and watch whether throttled periods disappears, some
> statistics on burst buffer using are also helpful. Thus expose the
> following statistics into cpu.stat file:
>
Helpful how.. the above is a bunch of words without any actual
justification for any of this.
> On May 21, 2021, at 10:00 PM, Peter Zijlstra <[email protected]> wrote:
>
> On Thu, May 20, 2021 at 08:34:17PM +0800, Huaixin Chang wrote:
>> The CFS bandwidth controller limits CPU requests of a task group to
>> quota during each period. However, parallel workloads might be bursty
>> so that they get throttled even when their average utilization is under
>> quota. And they are latency sensitive at the same time so that
>> throttling them is undesired.
>>
>> Scaling up period and quota allows greater burst capacity. But it might
>> cause longer stuck till next refill. Introduce "burst" to allow
>> accumulating unused quota from previous periods, and to be assigned when
>> a task group requests more CPU than quota during a specific period.
>>
>> Introducing burst buffer might also cause interference to other groups.
>> Thus limit the maximum accumulated buffer by "burst", and limit
>> the maximum allowed burst by quota, too.
>
> Overall, *much* better than before.
>
> However I would like a little bit better discussion of how exactly
> people are supposed to reason about this. That will also help with the
> question from Odin on how people are supposed to set/compute this burst
> value.
>
> So traditional (UP-EDF) bandwidth control is something like:
>
> (U = \Sum u_i) <= 1
>
> This guaranteeds both that every deadline is met and that the system is
> stable. After all, if U were > 1, then for every second of walltime,
> we'd have to run more than a second of program time, and obviously miss
> our deadline, but the next deadline will be further out still, there is
> never time to catch up, unbounded fail.
>
> This work observes that a workload doesn't always executes the full
> quota; this enables one to describe u_i as a statistical distribution.
>
> For example, have u_i = {x,e}_i, where x is the p(95) and x+e p(100)
> (the traditional WCET). This effectively allows u to be smaller,
> increasing the efficiency (we can pack more tasks in the system), but at
> the cost of missing deadlines when all the odds line up. However, it
> does maintain stability, since every overrun must be paired with an
> underrun as long as our x is above the average.
>
> That is, suppose we have 2 tasks, both specify a p(95) value, then we
> have a p(95)*p(95) = 90.25% chance both tasks are within their quota and
> everything is good. At the same time we have a p(5)p(5) = 0.25% chance
> both tasks will exceed their quota at the same time (guaranteed deadline
> fail). Somewhere in between there's a threshold where one exceeds and
> the other doesn't underrun enough to compensate; this depends on the
> specific CDFs.
>
> At the same time, we can say that the worst case deadline miss, will be
> \Sum e_i; that is, there is a bounded tardiness (under the assumption
> that x+e is indeed WCET).
>
> And I think you can compute more fun properties.
>
> Now, CFS bandwidth control is not EDF, and the above doesn't fully
> translate, but much does I think.
>
> We borrow time now against our future underrun, at the cost of increased
> interference against the other system users. All nicely bounded etc..
>
I shall improve the commit log then.
We did some compute on the probability that deadline is missed, and the expected
boundary. These values are calculated with different number of control groups and
variable CPU utilization when runtime is under exponential distribution, poisson
distribution or pareto distribution.
The more control groups there are, the more likely deadline is made and the smaller average
WCET to expect. Because many equal control groups means small chance of U > 1.
And the more under utilized the whole system is, the more likely deadline is made and the smaller
average WCET to expect.
More details are posted in
https://lore.kernel.org/lkml/[email protected]/.
>
>
>> Co-developed-by: Shanpei Chen <[email protected]>
>> Signed-off-by: Shanpei Chen <[email protected]>
>> Co-developed-by: Tianchen Ding <[email protected]>
>> Signed-off-by: Tianchen Ding <[email protected]>
>> Signed-off-by: Huaixin Chang <[email protected]>
>> ---
>> include/linux/sched/sysctl.h | 1 +
>> kernel/sched/core.c | 83 ++++++++++++++++++++++++++++++++++++--------
>> kernel/sched/fair.c | 21 ++++++++++-
>> kernel/sched/sched.h | 1 +
>> kernel/sysctl.c | 9 +++++
>> 5 files changed, 99 insertions(+), 16 deletions(-)
>>
>> diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
>> index db2c0f34aaaf..08432aeb742e 100644
>> --- a/include/linux/sched/sysctl.h
>> +++ b/include/linux/sched/sysctl.h
>> @@ -73,6 +73,7 @@ extern unsigned int sysctl_sched_uclamp_util_min_rt_default;
>>
>> #ifdef CONFIG_CFS_BANDWIDTH
>> extern unsigned int sysctl_sched_cfs_bandwidth_slice;
>> +extern unsigned int sysctl_sched_cfs_bw_burst_enabled;
>> #endif
>>
>> #ifdef CONFIG_SCHED_AUTOGROUP
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 5226cc26a095..7d34b08ee0e5 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -8962,7 +8962,8 @@ static const u64 max_cfs_runtime = MAX_BW * NSEC_PER_USEC;
>>
>> static int __cfs_schedulable(struct task_group *tg, u64 period, u64 runtime);
>>
>> -static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
>> +static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota,
>> + u64 burst)
>> {
>> int i, ret = 0, runtime_enabled, runtime_was_enabled;
>> struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
>> @@ -8992,6 +8993,10 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
>> if (quota != RUNTIME_INF && quota > max_cfs_runtime)
>> return -EINVAL;
>>
>> + if (quota != RUNTIME_INF && (burst > quota ||
>> + burst + quota > max_cfs_runtime))
>> + return -EINVAL;
>> +
>> /*
>> * Prevent race between setting of cfs_rq->runtime_enabled and
>> * unthrottle_offline_cfs_rqs().
>
> <snip all API crud>
>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 3248e24a90b0..48fad5cf0f7a 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -134,6 +134,13 @@ int __weak arch_asym_cpu_priority(int cpu)
>> * (default: 5 msec, units: microseconds)
>> */
>> unsigned int sysctl_sched_cfs_bandwidth_slice = 5000UL;
>> +
>> +/*
>> + * A switch for cfs bandwidth burst.
>> + *
>> + * (default: 1, enabled)
>> + */
>> +unsigned int sysctl_sched_cfs_bw_burst_enabled = 1;
>> #endif
>>
>> static inline void update_load_add(struct load_weight *lw, unsigned long inc)
>> @@ -4628,8 +4635,16 @@ static inline u64 sched_cfs_bandwidth_slice(void)
>> */
>> void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
>> {
>> - if (cfs_b->quota != RUNTIME_INF)
>> + if (unlikely(cfs_b->quota == RUNTIME_INF))
>> + return;
>> +
>> + if (!sysctl_sched_cfs_bw_burst_enabled) {
>> cfs_b->runtime = cfs_b->quota;
>> + return;
>> + }
>> +
>> + cfs_b->runtime += cfs_b->quota;
>> + cfs_b->runtime = min(cfs_b->runtime, cfs_b->quota + cfs_b->burst);
>> }
>>
>> static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
>> @@ -4651,6 +4666,9 @@ static int __assign_cfs_rq_runtime(struct cfs_bandwidth *cfs_b,
>> if (cfs_b->quota == RUNTIME_INF)
>> amount = min_amount;
>> else {
>> + if (!cfs_b->period_active)
>> + __refill_cfs_bandwidth_runtime(cfs_b);
>
> Why this call?
As the cfs bandwidth timer stops on idle with runtime unfilled, refill runtime when it restarts to make
use of the underrun when period timer stops. Another way to do this might be:
throttled = !list_empty(&cfs_b->throttled_cfs_rq);
cfs_b->nr_periods += overrun;
+ __refill_cfs_bandwidth_runtime(cfs_b);
+
/*
* idle depends on !throttled (for the case of a large deficit), and if
* we're going inactive then everything else can be deferred
*/
if (cfs_b->idle && !throttled)
goto out_deactivate;
- __refill_cfs_bandwidth_runtime(cfs_b);
-
if (!throttled) {
>
>> start_cfs_bandwidth(cfs_b);
>>
>> if (cfs_b->runtime > 0) {
>> @@ -5279,6 +5297,7 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
>> cfs_b->runtime = 0;
>> cfs_b->quota = RUNTIME_INF;
>> cfs_b->period = ns_to_ktime(default_cfs_period());
>> + cfs_b->burst = 0;
>>
>> INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
>> hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index a189bec13729..d317ca74a48c 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -366,6 +366,7 @@ struct cfs_bandwidth {
>> ktime_t period;
>> u64 quota;
>> u64 runtime;
>> + u64 burst;
>> s64 hierarchical_quota;
>>
>> u8 idle;
>> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
>> index 14edf84cc571..fb27bae7ace2 100644
>> --- a/kernel/sysctl.c
>> +++ b/kernel/sysctl.c
>> @@ -1816,6 +1816,15 @@ static struct ctl_table kern_table[] = {
>> .proc_handler = proc_dointvec_minmax,
>> .extra1 = SYSCTL_ONE,
>> },
>> + {
>> + .procname = "sched_cfs_bw_burst_enabled",
>> + .data = &sysctl_sched_cfs_bw_burst_enabled,
>> + .maxlen = sizeof(unsigned int),
>> + .mode = 0644,
>> + .proc_handler = proc_dointvec_minmax,
>> + .extra1 = SYSCTL_ZERO,
>> + .extra2 = SYSCTL_ONE,
>> + },
>> #endif
>> #if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
>> {
>
> What's the purpose of this new sysctl? By default it is disabled because
> burst==0, only if you set burst to some !0 value does this actually do
> anything.
Originally, this is designed to turn burst feature off when the system becomes unstable.
Guess we can remove this as you have questioned it.
On Mon, May 24, 2021 at 08:42:03PM +0800, changhuaixin wrote:
> >> --- a/kernel/sysctl.c
> >> +++ b/kernel/sysctl.c
> >> @@ -1816,6 +1816,15 @@ static struct ctl_table kern_table[] = {
> >> .proc_handler = proc_dointvec_minmax,
> >> .extra1 = SYSCTL_ONE,
> >> },
> >> + {
> >> + .procname = "sched_cfs_bw_burst_enabled",
> >> + .data = &sysctl_sched_cfs_bw_burst_enabled,
> >> + .maxlen = sizeof(unsigned int),
> >> + .mode = 0644,
> >> + .proc_handler = proc_dointvec_minmax,
> >> + .extra1 = SYSCTL_ZERO,
> >> + .extra2 = SYSCTL_ONE,
> >> + },
> >> #endif
> >> #if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
> >> {
> >
> > What's the purpose of this new sysctl? By default it is disabled because
> > burst==0, only if you set burst to some !0 value does this actually do
> > anything.
>
> Originally, this is designed to turn burst feature off when the system becomes unstable.
> Guess we can remove this as you have questioned it.
Is stability a concern? This is CFS after all, if we overload, we simply
share time as per usual.
If there is a real use-case for a global knob to limit/disable this I
don't object too much, but then please explicitly mention it.
On Mon, May 24, 2021 at 08:42:03PM +0800, changhuaixin wrote:
> >> static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
> >> @@ -4651,6 +4666,9 @@ static int __assign_cfs_rq_runtime(struct cfs_bandwidth *cfs_b,
> >> if (cfs_b->quota == RUNTIME_INF)
> >> amount = min_amount;
> >> else {
> >> + if (!cfs_b->period_active)
> >> + __refill_cfs_bandwidth_runtime(cfs_b);
> >
> > Why this call?
>
> As the cfs bandwidth timer stops on idle with runtime unfilled, refill runtime when it restarts to make
> use of the underrun when period timer stops. Another way to do this might be:
>
> throttled = !list_empty(&cfs_b->throttled_cfs_rq);
> cfs_b->nr_periods += overrun;
>
> + __refill_cfs_bandwidth_runtime(cfs_b);
> +
> /*
> * idle depends on !throttled (for the case of a large deficit), and if
> * we're going inactive then everything else can be deferred
> */
> if (cfs_b->idle && !throttled)
> goto out_deactivate;
>
> - __refill_cfs_bandwidth_runtime(cfs_b);
> -
> if (!throttled) {
>
Ben, do you have a preference?
On Mon, May 24, 2021 at 08:42:03PM +0800, changhuaixin wrote:
> > On May 21, 2021, at 10:00 PM, Peter Zijlstra <[email protected]> wrote:
> >
> > On Thu, May 20, 2021 at 08:34:17PM +0800, Huaixin Chang wrote:
> >> The CFS bandwidth controller limits CPU requests of a task group to
> >> quota during each period. However, parallel workloads might be bursty
> >> so that they get throttled even when their average utilization is under
> >> quota. And they are latency sensitive at the same time so that
> >> throttling them is undesired.
> >>
> >> Scaling up period and quota allows greater burst capacity. But it might
> >> cause longer stuck till next refill. Introduce "burst" to allow
> >> accumulating unused quota from previous periods, and to be assigned when
> >> a task group requests more CPU than quota during a specific period.
> >>
> >> Introducing burst buffer might also cause interference to other groups.
> >> Thus limit the maximum accumulated buffer by "burst", and limit
> >> the maximum allowed burst by quota, too.
> >
> > Overall, *much* better than before.
> >
> > However I would like a little bit better discussion of how exactly
> > people are supposed to reason about this. That will also help with the
> > question from Odin on how people are supposed to set/compute this burst
> > value.
> >
> > So traditional (UP-EDF) bandwidth control is something like:
> >
> > (U = \Sum u_i) <= 1
> >
> > This guaranteeds both that every deadline is met and that the system is
> > stable. After all, if U were > 1, then for every second of walltime,
> > we'd have to run more than a second of program time, and obviously miss
> > our deadline, but the next deadline will be further out still, there is
> > never time to catch up, unbounded fail.
> >
> > This work observes that a workload doesn't always executes the full
> > quota; this enables one to describe u_i as a statistical distribution.
> >
> > For example, have u_i = {x,e}_i, where x is the p(95) and x+e p(100)
> > (the traditional WCET). This effectively allows u to be smaller,
> > increasing the efficiency (we can pack more tasks in the system), but at
> > the cost of missing deadlines when all the odds line up. However, it
> > does maintain stability, since every overrun must be paired with an
> > underrun as long as our x is above the average.
> >
> > That is, suppose we have 2 tasks, both specify a p(95) value, then we
> > have a p(95)*p(95) = 90.25% chance both tasks are within their quota and
> > everything is good. At the same time we have a p(5)p(5) = 0.25% chance
> > both tasks will exceed their quota at the same time (guaranteed deadline
> > fail). Somewhere in between there's a threshold where one exceeds and
> > the other doesn't underrun enough to compensate; this depends on the
> > specific CDFs.
> >
> > At the same time, we can say that the worst case deadline miss, will be
> > \Sum e_i; that is, there is a bounded tardiness (under the assumption
> > that x+e is indeed WCET).
Having second thoughts about this exact claim; lightning can strike
twice, and if we exceed bounds again before having recovered from the
last time we might exceed the bound mentioned. I _think_ the property
holds, but the bound might need work.
> > And I think you can compute more fun properties.
> >
> > Now, CFS bandwidth control is not EDF, and the above doesn't fully
> > translate, but much does I think.
> >
> > We borrow time now against our future underrun, at the cost of increased
> > interference against the other system users. All nicely bounded etc..
> >
>
> I shall improve the commit log then.
Thanks!
> We did some compute on the probability that deadline is missed, and the expected
> boundary. These values are calculated with different number of control groups and
> variable CPU utilization when runtime is under exponential distribution, poisson
> distribution or pareto distribution.
>
> The more control groups there are, the more likely deadline is made and the smaller average
> WCET to expect. Because many equal control groups means small chance of U > 1.
>
> And the more under utilized the whole system is, the more likely deadline is made and the smaller
> average WCET to expect.
>
> More details are posted in
> https://lore.kernel.org/lkml/[email protected]/.
Indeed you did; I'm a bit sad it's so hard to find papers that cover
this. When one Googles for 'Probabilistic WCET' there's a fair number of
papers about using Extreme Value Theory to estimate the traditional WCET
given measurement based input. Many from the excellent WCET track at
ECRTS.
The thing is, the last time I attended that conference (which appears to
be almost 4 years ago :/), I'm sure I spoke to people about exactly the
thing explored here. Albeit, at the time we discussed this as a
SCHED_DEADLINE task model extension.
Let me Cc a bunch of people that might know more..,
Peter Zijlstra <[email protected]> writes:
> On Mon, May 24, 2021 at 08:42:03PM +0800, changhuaixin wrote:
>
>> >> static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
>> >> @@ -4651,6 +4666,9 @@ static int __assign_cfs_rq_runtime(struct cfs_bandwidth *cfs_b,
>> >> if (cfs_b->quota == RUNTIME_INF)
>> >> amount = min_amount;
>> >> else {
>> >> + if (!cfs_b->period_active)
>> >> + __refill_cfs_bandwidth_runtime(cfs_b);
>> >
>> > Why this call?
>>
>> As the cfs bandwidth timer stops on idle with runtime unfilled, refill runtime when it restarts to make
>> use of the underrun when period timer stops. Another way to do this might be:
>>
>> throttled = !list_empty(&cfs_b->throttled_cfs_rq);
>> cfs_b->nr_periods += overrun;
>>
>> + __refill_cfs_bandwidth_runtime(cfs_b);
>> +
>> /*
>> * idle depends on !throttled (for the case of a large deficit), and if
>> * we're going inactive then everything else can be deferred
>> */
>> if (cfs_b->idle && !throttled)
>> goto out_deactivate;
>>
>> - __refill_cfs_bandwidth_runtime(cfs_b);
>> -
>> if (!throttled) {
>>
>
> Ben, do you have a preference?
I think I prefer the latter, possibly with a
/* Refill extra burst quota even if cfs_b->idle */
Hi all,
On Tue, 25 May 2021 12:46:52 +0200
Peter Zijlstra <[email protected]> wrote:
[...]
> > We did some compute on the probability that deadline is missed, and the expected
> > boundary. These values are calculated with different number of control groups and
> > variable CPU utilization when runtime is under exponential distribution, poisson
> > distribution or pareto distribution.
> >
> > The more control groups there are, the more likely deadline is made and the smaller average
> > WCET to expect. Because many equal control groups means small chance of U > 1.
> >
> > And the more under utilized the whole system is, the more likely deadline is made and the smaller
> > average WCET to expect.
> >
> > More details are posted in
> > https://lore.kernel.org/lkml/[email protected]/.
>
> Indeed you did; I'm a bit sad it's so hard to find papers that cover
> this. When one Googles for 'Probabilistic WCET' there's a fair number of
> papers about using Extreme Value Theory to estimate the traditional WCET
> given measurement based input. Many from the excellent WCET track at
> ECRTS.
If I understand well the context, you do not need probabilistic WCET
here...
If you assume to know the probability distribution of the inter-arrival
times and execution times (this is what is assumed at
https://lore.kernel.org/lkml/[email protected]/,
right?), then you can use "standard" queuing theory to compute the
response time distribution.
If I understand well, in the link mentioned above the response times are
measured by simulating a model of the scheduler. Queuing theory can be
used instead, as shown (for example) in these papers:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.22.7683&rep=rep1&type=pdf
http://retis.sssup.it/~giorgio/paps/2001/wpdrts01.pdf
(these papers consider a scheduler similar to SCHED_DEADLINE, but the
approach can be easily applied to every scheduler that guarantees a
runtime in a period --- I think the CFS controller falls in this
category, right?)
I think the burst mentioned above can be added to this queuing model;
I'll have a look at this in the next days.
The problem with this approach is that the execution times of different
activation of a task are considered to be independent and identically
distributed (this is the infamous "IID assumption"). And this
assumption is often unrealistic...
The probabilistic WCET approach mentioned above allows you to analyze
the behaviour of a scheduler without assuming that the execution (and/or
inter-activation) times are IID.
Luca
> The thing is, the last time I attended that conference (which appears to
> be almost 4 years ago :/), I'm sure I spoke to people about exactly the
> thing explored here. Albeit, at the time we discussed this as a
> SCHED_DEADLINE task model extension.
>
> Let me Cc a bunch of people that might know more..,