2019-05-17 20:23:49

by Dave Chiluk

[permalink] [raw]
Subject: [PATCH] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu slices

It has been observed, that highly-threaded, non-cpu-bound applications
running under cpu.cfs_quota_us constraints can hit a high percentage of
periods throttled while simultaneously not consuming the allocated
amount of quota. This use case is typical of user-interactive non-cpu
bound applications, such as those running in kubernetes or mesos when
run on multiple cpu cores.

This has been root caused to threads being allocated per cpu bandwidth
slices, and then not fully using that slice within the period. At which
point the slice and quota expires. This expiration of unused slice
results in applications not being able to utilize the quota for which
they are allocated.

The expiration of per-cpu slices was recently fixed by commit
512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition").
Prior to that it appears that this has been broken since a least commit
51f2176d74ac ("sched/fair: Fix unlocked reads of some
cfs_b->quota/period") which was introduced in v3.16-rc1 in 2014. That
commit added the following conditional which resulted in slices never
being expired.

if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
/* extend local deadline, drift is bounded above by 2 ticks */
cfs_rq->runtime_expires += TICK_NSEC;

Because this was broken for nearly 5 years, and has recently been fixed
and is now being noticed by many users running kubernetes
(https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion
that the mechanisms around expiring runtime should be removed
altogether.

This allows only per-cpu slices to live longer than the period boundary.
This allows threads on runqueues that do not use much CPU to continue to
use their remaining slice over a longer period of time than
cpu.cfs_period_us. However, this helps prevents the above condition of
hitting throttling while also not fully utilizing your cpu quota.

This theoretically allows a machine to use slightly more than it's
allotted quota in some periods. This overflow would be bounded by the
remaining per-cpu slice that was left un-used in the previous period.
For CPU bound tasks this will change nothing, as they should
theoretically fully utilize all of their quota and slices in each
period. For user-interactive tasks as described above this provides a
much better user/application experience as their cpu utilization will
more closely match the amount they requested when they hit throttling.

This greatly improves performance of high-thread-count, non-cpu bound
applications with low cfs_quota_us allocation on high-core-count
machines. In the case of an artificial testcase, this performance
discrepancy has been observed to be almost 30x performance improvement,
while still maintaining correct cpu quota restrictions albeit over
longer time intervals than cpu.cfs_period_us. That testcase is available
at https://github.com/indeedeng/fibtest.

Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition")
Signed-off-by: Dave Chiluk <[email protected]>
---
Documentation/scheduler/sched-bwc.txt | 29 +++++++++++---
kernel/sched/fair.c | 71 +++--------------------------------
kernel/sched/sched.h | 4 --
3 files changed, 29 insertions(+), 75 deletions(-)

diff --git a/Documentation/scheduler/sched-bwc.txt b/Documentation/scheduler/sched-bwc.txt
index f6b1873..4ded8ae 100644
--- a/Documentation/scheduler/sched-bwc.txt
+++ b/Documentation/scheduler/sched-bwc.txt
@@ -8,16 +8,33 @@ CFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED extension which allows the
specification of the maximum CPU bandwidth available to a group or hierarchy.

The bandwidth allowed for a group is specified using a quota and period. Within
-each given "period" (microseconds), a group is allowed to consume only up to
-"quota" microseconds of CPU time. When the CPU bandwidth consumption of a
-group exceeds this limit (for that period), the tasks belonging to its
-hierarchy will be throttled and are not allowed to run again until the next
-period.
+each given "period" (microseconds), a task group is allocated up to "quota"
+microseconds of CPU time. When the CPU bandwidth consumption of a group
+exceeds this limit (for that period), the tasks belonging to its hierarchy will
+be throttled and are not allowed to run again until the next period.

A group's unused runtime is globally tracked, being refreshed with quota units
above at each period boundary. As threads consume this bandwidth it is
transferred to cpu-local "silos" on a demand basis. The amount transferred
-within each of these updates is tunable and described as the "slice".
+within each of these updates is tunable and described as the "slice". Slices
+that are allocated to cpu-local silos do not expire at the end of the period,
+but unallocated quota does. This doesn't affect cpu-bound applications as they
+by definition consume all of their bandwidth in each each period.
+
+However for highly-threaded user-interactive/non-cpu bound applications this
+non-expiration nuance allows applications to burst past their quota limits
+equal to the amount of unused slice per cpu that the task group is running on.
+This slight burst requires that quota had gone unused in previous periods.
+Additionally this burst amount is limited to the size of a slice for every cpu
+a task group is run on. As a result, this mechanism still strictly limits the
+task group to quota average usage over a longer time windows. This provides
+better more predictable user experience for highly threaded applications with
+small quota limits on high core count machines. It also eliminates the
+propensity to throttle these applications while simultanously using less than
+quota amounts of cpu. Another way to say this, is that by allowing the unused
+portion of a slice to be used in following periods we have decreased the
+possibility of wasting unused quota on cpu-local silos that don't need much cpu
+time.

Management
----------
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f35930f..a675c69 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4295,8 +4295,6 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)

now = sched_clock_cpu(smp_processor_id());
cfs_b->runtime = cfs_b->quota;
- cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period);
- cfs_b->expires_seq++;
}

static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
@@ -4318,8 +4316,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
{
struct task_group *tg = cfs_rq->tg;
struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
- u64 amount = 0, min_amount, expires;
- int expires_seq;
+ u64 amount = 0, min_amount;

/* note: this is a positive sum as runtime_remaining <= 0 */
min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining;
@@ -4336,61 +4333,17 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
cfs_b->idle = 0;
}
}
- expires_seq = cfs_b->expires_seq;
- expires = cfs_b->runtime_expires;
raw_spin_unlock(&cfs_b->lock);

cfs_rq->runtime_remaining += amount;
- /*
- * we may have advanced our local expiration to account for allowed
- * spread between our sched_clock and the one on which runtime was
- * issued.
- */
- if (cfs_rq->expires_seq != expires_seq) {
- cfs_rq->expires_seq = expires_seq;
- cfs_rq->runtime_expires = expires;
- }

return cfs_rq->runtime_remaining > 0;
}

-/*
- * Note: This depends on the synchronization provided by sched_clock and the
- * fact that rq->clock snapshots this value.
- */
-static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
-{
- struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
-
- /* if the deadline is ahead of our clock, nothing to do */
- if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0))
- return;
-
- if (cfs_rq->runtime_remaining < 0)
- return;
-
- /*
- * If the local deadline has passed we have to consider the
- * possibility that our sched_clock is 'fast' and the global deadline
- * has not truly expired.
- *
- * Fortunately we can check determine whether this the case by checking
- * whether the global deadline(cfs_b->expires_seq) has advanced.
- */
- if (cfs_rq->expires_seq == cfs_b->expires_seq) {
- /* extend local deadline, drift is bounded above by 2 ticks */
- cfs_rq->runtime_expires += TICK_NSEC;
- } else {
- /* global deadline is ahead, expiration has passed */
- cfs_rq->runtime_remaining = 0;
- }
-}
-
static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
{
/* dock delta_exec before expiring quota (as it could span periods) */
cfs_rq->runtime_remaining -= delta_exec;
- expire_cfs_rq_runtime(cfs_rq);

if (likely(cfs_rq->runtime_remaining > 0))
return;
@@ -4581,8 +4534,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
resched_curr(rq);
}

-static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
- u64 remaining, u64 expires)
+static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining)
{
struct cfs_rq *cfs_rq;
u64 runtime;
@@ -4604,7 +4556,6 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
remaining -= runtime;

cfs_rq->runtime_remaining += runtime;
- cfs_rq->runtime_expires = expires;

/* we check whether we're throttled above */
if (cfs_rq->runtime_remaining > 0)
@@ -4629,7 +4580,7 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
*/
static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, unsigned long flags)
{
- u64 runtime, runtime_expires;
+ u64 runtime;
int throttled;

/* no need to continue the timer with no bandwidth constraint */
@@ -4657,8 +4608,6 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
/* account preceding periods in which throttling occurred */
cfs_b->nr_throttled += overrun;

- runtime_expires = cfs_b->runtime_expires;
-
/*
* This check is repeated as we are holding onto the new bandwidth while
* we unthrottle. This can potentially race with an unthrottled group
@@ -4671,8 +4620,7 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
cfs_b->distribute_running = 1;
raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
/* we can't nest cfs_b->lock while distributing bandwidth */
- runtime = distribute_cfs_runtime(cfs_b, runtime,
- runtime_expires);
+ runtime = distribute_cfs_runtime(cfs_b, runtime);
raw_spin_lock_irqsave(&cfs_b->lock, flags);

cfs_b->distribute_running = 0;
@@ -4749,8 +4697,7 @@ static void __return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
return;

raw_spin_lock(&cfs_b->lock);
- if (cfs_b->quota != RUNTIME_INF &&
- cfs_rq->runtime_expires == cfs_b->runtime_expires) {
+ if (cfs_b->quota != RUNTIME_INF) {
cfs_b->runtime += slack_runtime;

/* we are under rq->lock, defer unthrottling using a timer */
@@ -4783,7 +4730,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
{
u64 runtime = 0, slice = sched_cfs_bandwidth_slice();
unsigned long flags;
- u64 expires;

/* confirm we're still not at a refresh boundary */
raw_spin_lock_irqsave(&cfs_b->lock, flags);
@@ -4800,7 +4746,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice)
runtime = cfs_b->runtime;

- expires = cfs_b->runtime_expires;
if (runtime)
cfs_b->distribute_running = 1;

@@ -4809,11 +4754,9 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
if (!runtime)
return;

- runtime = distribute_cfs_runtime(cfs_b, runtime, expires);
+ runtime = distribute_cfs_runtime(cfs_b, runtime);

raw_spin_lock_irqsave(&cfs_b->lock, flags);
- if (expires == cfs_b->runtime_expires)
- lsub_positive(&cfs_b->runtime, runtime);
cfs_b->distribute_running = 0;
raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
}
@@ -4969,8 +4912,6 @@ void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)

cfs_b->period_active = 1;
overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
- cfs_b->runtime_expires += (overrun + 1) * ktime_to_ns(cfs_b->period);
- cfs_b->expires_seq++;
hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b52ed1a..0c0ed23 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -341,8 +341,6 @@ struct cfs_bandwidth {
u64 quota;
u64 runtime;
s64 hierarchical_quota;
- u64 runtime_expires;
- int expires_seq;

short idle;
short period_active;
@@ -562,8 +560,6 @@ struct cfs_rq {

#ifdef CONFIG_CFS_BANDWIDTH
int runtime_enabled;
- int expires_seq;
- u64 runtime_expires;
s64 runtime_remaining;

u64 throttled_clock;
--
1.8.3.1


2019-05-23 18:46:16

by Dave Chiluk

[permalink] [raw]
Subject: [PATCH v2 0/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices

Changelog v2
- Fixed some checkpatch errors in the commit message.

2019-05-23 18:47:29

by Dave Chiluk

[permalink] [raw]
Subject: [PATCH v2 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices

It has been observed, that highly-threaded, non-cpu-bound applications
running under cpu.cfs_quota_us constraints can hit a high percentage of
periods throttled while simultaneously not consuming the allocated
amount of quota. This use case is typical of user-interactive non-cpu
bound applications, such as those running in kubernetes or mesos when
run on multiple cpu cores.

This has been root caused to threads being allocated per cpu bandwidth
slices, and then not fully using that slice within the period. At which
point the slice and quota expires. This expiration of unused slice
results in applications not being able to utilize the quota for which
they are allocated.

The expiration of per-cpu slices was recently fixed by
'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift
condition")'. Prior to that it appears that this has been broken since
at least 'commit 51f2176d74ac ("sched/fair: Fix unlocked reads of some
cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That
added the following conditional which resulted in slices never being
expired.

if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
/* extend local deadline, drift is bounded above by 2 ticks */
cfs_rq->runtime_expires += TICK_NSEC;

Because this was broken for nearly 5 years, and has recently been fixed
and is now being noticed by many users running kubernetes
(https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion
that the mechanisms around expiring runtime should be removed
altogether.

This allows only per-cpu slices to live longer than the period boundary.
This allows threads on runqueues that do not use much CPU to continue to
use their remaining slice over a longer period of time than
cpu.cfs_period_us. However, this helps prevents the above condition of
hitting throttling while also not fully utilizing your cpu quota.

This theoretically allows a machine to use slightly more than it's
allotted quota in some periods. This overflow would be bounded by the
remaining per-cpu slice that was left un-used in the previous period.
For CPU bound tasks this will change nothing, as they should
theoretically fully utilize all of their quota and slices in each
period. For user-interactive tasks as described above this provides a
much better user/application experience as their cpu utilization will
more closely match the amount they requested when they hit throttling.

This greatly improves performance of high-thread-count, non-cpu bound
applications with low cfs_quota_us allocation on high-core-count
machines. In the case of an artificial testcase, this performance
discrepancy has been observed to be almost 30x performance improvement,
while still maintaining correct cpu quota restrictions albeit over
longer time intervals than cpu.cfs_period_us. That testcase is
available at https://github.com/indeedeng/fibtest.

Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition")
Signed-off-by: Dave Chiluk <[email protected]>
---
Documentation/scheduler/sched-bwc.txt | 29 +++++++++++---
kernel/sched/fair.c | 71 +++--------------------------------
kernel/sched/sched.h | 4 --
3 files changed, 29 insertions(+), 75 deletions(-)

diff --git a/Documentation/scheduler/sched-bwc.txt b/Documentation/scheduler/sched-bwc.txt
index f6b1873..4ded8ae 100644
--- a/Documentation/scheduler/sched-bwc.txt
+++ b/Documentation/scheduler/sched-bwc.txt
@@ -8,16 +8,33 @@ CFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED extension which allows the
specification of the maximum CPU bandwidth available to a group or hierarchy.

The bandwidth allowed for a group is specified using a quota and period. Within
-each given "period" (microseconds), a group is allowed to consume only up to
-"quota" microseconds of CPU time. When the CPU bandwidth consumption of a
-group exceeds this limit (for that period), the tasks belonging to its
-hierarchy will be throttled and are not allowed to run again until the next
-period.
+each given "period" (microseconds), a task group is allocated up to "quota"
+microseconds of CPU time. When the CPU bandwidth consumption of a group
+exceeds this limit (for that period), the tasks belonging to its hierarchy will
+be throttled and are not allowed to run again until the next period.

A group's unused runtime is globally tracked, being refreshed with quota units
above at each period boundary. As threads consume this bandwidth it is
transferred to cpu-local "silos" on a demand basis. The amount transferred
-within each of these updates is tunable and described as the "slice".
+within each of these updates is tunable and described as the "slice". Slices
+that are allocated to cpu-local silos do not expire at the end of the period,
+but unallocated quota does. This doesn't affect cpu-bound applications as they
+by definition consume all of their bandwidth in each each period.
+
+However for highly-threaded user-interactive/non-cpu bound applications this
+non-expiration nuance allows applications to burst past their quota limits
+equal to the amount of unused slice per cpu that the task group is running on.
+This slight burst requires that quota had gone unused in previous periods.
+Additionally this burst amount is limited to the size of a slice for every cpu
+a task group is run on. As a result, this mechanism still strictly limits the
+task group to quota average usage over a longer time windows. This provides
+better more predictable user experience for highly threaded applications with
+small quota limits on high core count machines. It also eliminates the
+propensity to throttle these applications while simultanously using less than
+quota amounts of cpu. Another way to say this, is that by allowing the unused
+portion of a slice to be used in following periods we have decreased the
+possibility of wasting unused quota on cpu-local silos that don't need much cpu
+time.

Management
----------
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f35930f..a675c69 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4295,8 +4295,6 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)

now = sched_clock_cpu(smp_processor_id());
cfs_b->runtime = cfs_b->quota;
- cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period);
- cfs_b->expires_seq++;
}

static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
@@ -4318,8 +4316,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
{
struct task_group *tg = cfs_rq->tg;
struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
- u64 amount = 0, min_amount, expires;
- int expires_seq;
+ u64 amount = 0, min_amount;

/* note: this is a positive sum as runtime_remaining <= 0 */
min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining;
@@ -4336,61 +4333,17 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
cfs_b->idle = 0;
}
}
- expires_seq = cfs_b->expires_seq;
- expires = cfs_b->runtime_expires;
raw_spin_unlock(&cfs_b->lock);

cfs_rq->runtime_remaining += amount;
- /*
- * we may have advanced our local expiration to account for allowed
- * spread between our sched_clock and the one on which runtime was
- * issued.
- */
- if (cfs_rq->expires_seq != expires_seq) {
- cfs_rq->expires_seq = expires_seq;
- cfs_rq->runtime_expires = expires;
- }

return cfs_rq->runtime_remaining > 0;
}

-/*
- * Note: This depends on the synchronization provided by sched_clock and the
- * fact that rq->clock snapshots this value.
- */
-static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
-{
- struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
-
- /* if the deadline is ahead of our clock, nothing to do */
- if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0))
- return;
-
- if (cfs_rq->runtime_remaining < 0)
- return;
-
- /*
- * If the local deadline has passed we have to consider the
- * possibility that our sched_clock is 'fast' and the global deadline
- * has not truly expired.
- *
- * Fortunately we can check determine whether this the case by checking
- * whether the global deadline(cfs_b->expires_seq) has advanced.
- */
- if (cfs_rq->expires_seq == cfs_b->expires_seq) {
- /* extend local deadline, drift is bounded above by 2 ticks */
- cfs_rq->runtime_expires += TICK_NSEC;
- } else {
- /* global deadline is ahead, expiration has passed */
- cfs_rq->runtime_remaining = 0;
- }
-}
-
static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
{
/* dock delta_exec before expiring quota (as it could span periods) */
cfs_rq->runtime_remaining -= delta_exec;
- expire_cfs_rq_runtime(cfs_rq);

if (likely(cfs_rq->runtime_remaining > 0))
return;
@@ -4581,8 +4534,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
resched_curr(rq);
}

-static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
- u64 remaining, u64 expires)
+static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining)
{
struct cfs_rq *cfs_rq;
u64 runtime;
@@ -4604,7 +4556,6 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
remaining -= runtime;

cfs_rq->runtime_remaining += runtime;
- cfs_rq->runtime_expires = expires;

/* we check whether we're throttled above */
if (cfs_rq->runtime_remaining > 0)
@@ -4629,7 +4580,7 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
*/
static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, unsigned long flags)
{
- u64 runtime, runtime_expires;
+ u64 runtime;
int throttled;

/* no need to continue the timer with no bandwidth constraint */
@@ -4657,8 +4608,6 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
/* account preceding periods in which throttling occurred */
cfs_b->nr_throttled += overrun;

- runtime_expires = cfs_b->runtime_expires;
-
/*
* This check is repeated as we are holding onto the new bandwidth while
* we unthrottle. This can potentially race with an unthrottled group
@@ -4671,8 +4620,7 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
cfs_b->distribute_running = 1;
raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
/* we can't nest cfs_b->lock while distributing bandwidth */
- runtime = distribute_cfs_runtime(cfs_b, runtime,
- runtime_expires);
+ runtime = distribute_cfs_runtime(cfs_b, runtime);
raw_spin_lock_irqsave(&cfs_b->lock, flags);

cfs_b->distribute_running = 0;
@@ -4749,8 +4697,7 @@ static void __return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
return;

raw_spin_lock(&cfs_b->lock);
- if (cfs_b->quota != RUNTIME_INF &&
- cfs_rq->runtime_expires == cfs_b->runtime_expires) {
+ if (cfs_b->quota != RUNTIME_INF) {
cfs_b->runtime += slack_runtime;

/* we are under rq->lock, defer unthrottling using a timer */
@@ -4783,7 +4730,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
{
u64 runtime = 0, slice = sched_cfs_bandwidth_slice();
unsigned long flags;
- u64 expires;

/* confirm we're still not at a refresh boundary */
raw_spin_lock_irqsave(&cfs_b->lock, flags);
@@ -4800,7 +4746,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice)
runtime = cfs_b->runtime;

- expires = cfs_b->runtime_expires;
if (runtime)
cfs_b->distribute_running = 1;

@@ -4809,11 +4754,9 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
if (!runtime)
return;

- runtime = distribute_cfs_runtime(cfs_b, runtime, expires);
+ runtime = distribute_cfs_runtime(cfs_b, runtime);

raw_spin_lock_irqsave(&cfs_b->lock, flags);
- if (expires == cfs_b->runtime_expires)
- lsub_positive(&cfs_b->runtime, runtime);
cfs_b->distribute_running = 0;
raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
}
@@ -4969,8 +4912,6 @@ void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)

cfs_b->period_active = 1;
overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
- cfs_b->runtime_expires += (overrun + 1) * ktime_to_ns(cfs_b->period);
- cfs_b->expires_seq++;
hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b52ed1a..0c0ed23 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -341,8 +341,6 @@ struct cfs_bandwidth {
u64 quota;
u64 runtime;
s64 hierarchical_quota;
- u64 runtime_expires;
- int expires_seq;

short idle;
short period_active;
@@ -562,8 +560,6 @@ struct cfs_rq {

#ifdef CONFIG_CFS_BANDWIDTH
int runtime_enabled;
- int expires_seq;
- u64 runtime_expires;
s64 runtime_remaining;

u64 throttled_clock;
--
1.8.3.1

2019-05-23 21:04:02

by Peter Oskolkov

[permalink] [raw]
Subject: Re: [PATCH v2 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices

On Thu, May 23, 2019 at 11:44 AM Dave Chiluk <[email protected]> wrote:
>
> It has been observed, that highly-threaded, non-cpu-bound applications
> running under cpu.cfs_quota_us constraints can hit a high percentage of
> periods throttled while simultaneously not consuming the allocated
> amount of quota. This use case is typical of user-interactive non-cpu
> bound applications, such as those running in kubernetes or mesos when
> run on multiple cpu cores.
>
> This has been root caused to threads being allocated per cpu bandwidth
> slices, and then not fully using that slice within the period. At which
> point the slice and quota expires. This expiration of unused slice
> results in applications not being able to utilize the quota for which
> they are allocated.
>
> The expiration of per-cpu slices was recently fixed by
> 'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift
> condition")'. Prior to that it appears that this has been broken since
> at least 'commit 51f2176d74ac ("sched/fair: Fix unlocked reads of some
> cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That
> added the following conditional which resulted in slices never being
> expired.
>
> if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
> /* extend local deadline, drift is bounded above by 2 ticks */
> cfs_rq->runtime_expires += TICK_NSEC;
>
> Because this was broken for nearly 5 years, and has recently been fixed
> and is now being noticed by many users running kubernetes
> (https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion
> that the mechanisms around expiring runtime should be removed
> altogether.
>
> This allows only per-cpu slices to live longer than the period boundary.
> This allows threads on runqueues that do not use much CPU to continue to
> use their remaining slice over a longer period of time than
> cpu.cfs_period_us. However, this helps prevents the above condition of
> hitting throttling while also not fully utilizing your cpu quota.
>
> This theoretically allows a machine to use slightly more than it's
> allotted quota in some periods. This overflow would be bounded by the
> remaining per-cpu slice that was left un-used in the previous period.
> For CPU bound tasks this will change nothing, as they should
> theoretically fully utilize all of their quota and slices in each
> period. For user-interactive tasks as described above this provides a
> much better user/application experience as their cpu utilization will
> more closely match the amount they requested when they hit throttling.
>
> This greatly improves performance of high-thread-count, non-cpu bound
> applications with low cfs_quota_us allocation on high-core-count
> machines. In the case of an artificial testcase, this performance
> discrepancy has been observed to be almost 30x performance improvement,
> while still maintaining correct cpu quota restrictions albeit over
> longer time intervals than cpu.cfs_period_us.

If the machine runs at/close to capacity, won't the overallocation
of the quota to bursty tasks necessarily negatively impact every other
task? Should the "unused" quota be available only on idle CPUs?
(Or maybe this is the behavior achieved here, and only the comment and
the commit message should be fixed...)

> That testcase is
> available at https://github.com/indeedeng/fibtest.
>
> Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition")
> Signed-off-by: Dave Chiluk <[email protected]>
> ---
> Documentation/scheduler/sched-bwc.txt | 29 +++++++++++---
> kernel/sched/fair.c | 71 +++--------------------------------
> kernel/sched/sched.h | 4 --
> 3 files changed, 29 insertions(+), 75 deletions(-)
>
> diff --git a/Documentation/scheduler/sched-bwc.txt b/Documentation/scheduler/sched-bwc.txt
> index f6b1873..4ded8ae 100644
> --- a/Documentation/scheduler/sched-bwc.txt
> +++ b/Documentation/scheduler/sched-bwc.txt
> @@ -8,16 +8,33 @@ CFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED extension which allows the
> specification of the maximum CPU bandwidth available to a group or hierarchy.
>
> The bandwidth allowed for a group is specified using a quota and period. Within
> -each given "period" (microseconds), a group is allowed to consume only up to
> -"quota" microseconds of CPU time. When the CPU bandwidth consumption of a
> -group exceeds this limit (for that period), the tasks belonging to its
> -hierarchy will be throttled and are not allowed to run again until the next
> -period.
> +each given "period" (microseconds), a task group is allocated up to "quota"
> +microseconds of CPU time. When the CPU bandwidth consumption of a group
> +exceeds this limit (for that period), the tasks belonging to its hierarchy will
> +be throttled and are not allowed to run again until the next period.
>
> A group's unused runtime is globally tracked, being refreshed with quota units
> above at each period boundary. As threads consume this bandwidth it is
> transferred to cpu-local "silos" on a demand basis. The amount transferred
> -within each of these updates is tunable and described as the "slice".
> +within each of these updates is tunable and described as the "slice". Slices
> +that are allocated to cpu-local silos do not expire at the end of the period,
> +but unallocated quota does. This doesn't affect cpu-bound applications as they
> +by definition consume all of their bandwidth in each each period.
> +
> +However for highly-threaded user-interactive/non-cpu bound applications this
> +non-expiration nuance allows applications to burst past their quota limits
> +equal to the amount of unused slice per cpu that the task group is running on.
> +This slight burst requires that quota had gone unused in previous periods.
> +Additionally this burst amount is limited to the size of a slice for every cpu
> +a task group is run on. As a result, this mechanism still strictly limits the
> +task group to quota average usage over a longer time windows. This provides
> +better more predictable user experience for highly threaded applications with
> +small quota limits on high core count machines. It also eliminates the
> +propensity to throttle these applications while simultanously using less than
> +quota amounts of cpu. Another way to say this, is that by allowing the unused
> +portion of a slice to be used in following periods we have decreased the
> +possibility of wasting unused quota on cpu-local silos that don't need much cpu
> +time.
>
> Management
> ----------
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f35930f..a675c69 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4295,8 +4295,6 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
>
> now = sched_clock_cpu(smp_processor_id());
> cfs_b->runtime = cfs_b->quota;
> - cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period);
> - cfs_b->expires_seq++;
> }
>
> static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
> @@ -4318,8 +4316,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
> {
> struct task_group *tg = cfs_rq->tg;
> struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
> - u64 amount = 0, min_amount, expires;
> - int expires_seq;
> + u64 amount = 0, min_amount;
>
> /* note: this is a positive sum as runtime_remaining <= 0 */
> min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining;
> @@ -4336,61 +4333,17 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
> cfs_b->idle = 0;
> }
> }
> - expires_seq = cfs_b->expires_seq;
> - expires = cfs_b->runtime_expires;
> raw_spin_unlock(&cfs_b->lock);
>
> cfs_rq->runtime_remaining += amount;
> - /*
> - * we may have advanced our local expiration to account for allowed
> - * spread between our sched_clock and the one on which runtime was
> - * issued.
> - */
> - if (cfs_rq->expires_seq != expires_seq) {
> - cfs_rq->expires_seq = expires_seq;
> - cfs_rq->runtime_expires = expires;
> - }
>
> return cfs_rq->runtime_remaining > 0;
> }
>
> -/*
> - * Note: This depends on the synchronization provided by sched_clock and the
> - * fact that rq->clock snapshots this value.
> - */
> -static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
> -{
> - struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
> -
> - /* if the deadline is ahead of our clock, nothing to do */
> - if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0))
> - return;
> -
> - if (cfs_rq->runtime_remaining < 0)
> - return;
> -
> - /*
> - * If the local deadline has passed we have to consider the
> - * possibility that our sched_clock is 'fast' and the global deadline
> - * has not truly expired.
> - *
> - * Fortunately we can check determine whether this the case by checking
> - * whether the global deadline(cfs_b->expires_seq) has advanced.
> - */
> - if (cfs_rq->expires_seq == cfs_b->expires_seq) {
> - /* extend local deadline, drift is bounded above by 2 ticks */
> - cfs_rq->runtime_expires += TICK_NSEC;
> - } else {
> - /* global deadline is ahead, expiration has passed */
> - cfs_rq->runtime_remaining = 0;
> - }
> -}
> -
> static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
> {
> /* dock delta_exec before expiring quota (as it could span periods) */
> cfs_rq->runtime_remaining -= delta_exec;
> - expire_cfs_rq_runtime(cfs_rq);
>
> if (likely(cfs_rq->runtime_remaining > 0))
> return;
> @@ -4581,8 +4534,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
> resched_curr(rq);
> }
>
> -static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
> - u64 remaining, u64 expires)
> +static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining)
> {
> struct cfs_rq *cfs_rq;
> u64 runtime;
> @@ -4604,7 +4556,6 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
> remaining -= runtime;
>
> cfs_rq->runtime_remaining += runtime;
> - cfs_rq->runtime_expires = expires;
>
> /* we check whether we're throttled above */
> if (cfs_rq->runtime_remaining > 0)
> @@ -4629,7 +4580,7 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
> */
> static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, unsigned long flags)
> {
> - u64 runtime, runtime_expires;
> + u64 runtime;
> int throttled;
>
> /* no need to continue the timer with no bandwidth constraint */
> @@ -4657,8 +4608,6 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
> /* account preceding periods in which throttling occurred */
> cfs_b->nr_throttled += overrun;
>
> - runtime_expires = cfs_b->runtime_expires;
> -
> /*
> * This check is repeated as we are holding onto the new bandwidth while
> * we unthrottle. This can potentially race with an unthrottled group
> @@ -4671,8 +4620,7 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
> cfs_b->distribute_running = 1;
> raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
> /* we can't nest cfs_b->lock while distributing bandwidth */
> - runtime = distribute_cfs_runtime(cfs_b, runtime,
> - runtime_expires);
> + runtime = distribute_cfs_runtime(cfs_b, runtime);
> raw_spin_lock_irqsave(&cfs_b->lock, flags);
>
> cfs_b->distribute_running = 0;
> @@ -4749,8 +4697,7 @@ static void __return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
> return;
>
> raw_spin_lock(&cfs_b->lock);
> - if (cfs_b->quota != RUNTIME_INF &&
> - cfs_rq->runtime_expires == cfs_b->runtime_expires) {
> + if (cfs_b->quota != RUNTIME_INF) {
> cfs_b->runtime += slack_runtime;
>
> /* we are under rq->lock, defer unthrottling using a timer */
> @@ -4783,7 +4730,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
> {
> u64 runtime = 0, slice = sched_cfs_bandwidth_slice();
> unsigned long flags;
> - u64 expires;
>
> /* confirm we're still not at a refresh boundary */
> raw_spin_lock_irqsave(&cfs_b->lock, flags);
> @@ -4800,7 +4746,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
> if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice)
> runtime = cfs_b->runtime;
>
> - expires = cfs_b->runtime_expires;
> if (runtime)
> cfs_b->distribute_running = 1;
>
> @@ -4809,11 +4754,9 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
> if (!runtime)
> return;
>
> - runtime = distribute_cfs_runtime(cfs_b, runtime, expires);
> + runtime = distribute_cfs_runtime(cfs_b, runtime);
>
> raw_spin_lock_irqsave(&cfs_b->lock, flags);
> - if (expires == cfs_b->runtime_expires)
> - lsub_positive(&cfs_b->runtime, runtime);
> cfs_b->distribute_running = 0;
> raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
> }
> @@ -4969,8 +4912,6 @@ void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
>
> cfs_b->period_active = 1;
> overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
> - cfs_b->runtime_expires += (overrun + 1) * ktime_to_ns(cfs_b->period);
> - cfs_b->expires_seq++;
> hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
> }
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index b52ed1a..0c0ed23 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -341,8 +341,6 @@ struct cfs_bandwidth {
> u64 quota;
> u64 runtime;
> s64 hierarchical_quota;
> - u64 runtime_expires;
> - int expires_seq;
>
> short idle;
> short period_active;
> @@ -562,8 +560,6 @@ struct cfs_rq {
>
> #ifdef CONFIG_CFS_BANDWIDTH
> int runtime_enabled;
> - int expires_seq;
> - u64 runtime_expires;
> s64 runtime_remaining;
>
> u64 throttled_clock;
> --
> 1.8.3.1
>

2019-05-24 08:59:49

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices


(it always helps to Cc the people who actually wrote the code)

Ben, can you have a look at this?

On Thu, May 23, 2019 at 01:44:47PM -0500, Dave Chiluk wrote:
> It has been observed, that highly-threaded, non-cpu-bound applications
> running under cpu.cfs_quota_us constraints can hit a high percentage of
> periods throttled while simultaneously not consuming the allocated
> amount of quota. This use case is typical of user-interactive non-cpu
> bound applications, such as those running in kubernetes or mesos when
> run on multiple cpu cores.
>
> This has been root caused to threads being allocated per cpu bandwidth
> slices, and then not fully using that slice within the period. At which
> point the slice and quota expires. This expiration of unused slice
> results in applications not being able to utilize the quota for which
> they are allocated.
>
> The expiration of per-cpu slices was recently fixed by
> 'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift
> condition")'. Prior to that it appears that this has been broken since
> at least 'commit 51f2176d74ac ("sched/fair: Fix unlocked reads of some
> cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That
> added the following conditional which resulted in slices never being
> expired.
>
> if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
> /* extend local deadline, drift is bounded above by 2 ticks */
> cfs_rq->runtime_expires += TICK_NSEC;
>
> Because this was broken for nearly 5 years, and has recently been fixed
> and is now being noticed by many users running kubernetes
> (https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion
> that the mechanisms around expiring runtime should be removed
> altogether.
>
> This allows only per-cpu slices to live longer than the period boundary.
> This allows threads on runqueues that do not use much CPU to continue to
> use their remaining slice over a longer period of time than
> cpu.cfs_period_us. However, this helps prevents the above condition of
> hitting throttling while also not fully utilizing your cpu quota.
>
> This theoretically allows a machine to use slightly more than it's
> allotted quota in some periods. This overflow would be bounded by the
> remaining per-cpu slice that was left un-used in the previous period.
> For CPU bound tasks this will change nothing, as they should
> theoretically fully utilize all of their quota and slices in each
> period. For user-interactive tasks as described above this provides a
> much better user/application experience as their cpu utilization will
> more closely match the amount they requested when they hit throttling.
>
> This greatly improves performance of high-thread-count, non-cpu bound
> applications with low cfs_quota_us allocation on high-core-count
> machines. In the case of an artificial testcase, this performance
> discrepancy has been observed to be almost 30x performance improvement,
> while still maintaining correct cpu quota restrictions albeit over
> longer time intervals than cpu.cfs_period_us. That testcase is
> available at https://github.com/indeedeng/fibtest.
>
> Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition")
> Signed-off-by: Dave Chiluk <[email protected]>
> ---
> Documentation/scheduler/sched-bwc.txt | 29 +++++++++++---
> kernel/sched/fair.c | 71 +++--------------------------------
> kernel/sched/sched.h | 4 --
> 3 files changed, 29 insertions(+), 75 deletions(-)
>
> diff --git a/Documentation/scheduler/sched-bwc.txt b/Documentation/scheduler/sched-bwc.txt
> index f6b1873..4ded8ae 100644
> --- a/Documentation/scheduler/sched-bwc.txt
> +++ b/Documentation/scheduler/sched-bwc.txt
> @@ -8,16 +8,33 @@ CFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED extension which allows the
> specification of the maximum CPU bandwidth available to a group or hierarchy.
>
> The bandwidth allowed for a group is specified using a quota and period. Within
> -each given "period" (microseconds), a group is allowed to consume only up to
> -"quota" microseconds of CPU time. When the CPU bandwidth consumption of a
> -group exceeds this limit (for that period), the tasks belonging to its
> -hierarchy will be throttled and are not allowed to run again until the next
> -period.
> +each given "period" (microseconds), a task group is allocated up to "quota"
> +microseconds of CPU time. When the CPU bandwidth consumption of a group
> +exceeds this limit (for that period), the tasks belonging to its hierarchy will
> +be throttled and are not allowed to run again until the next period.
>
> A group's unused runtime is globally tracked, being refreshed with quota units
> above at each period boundary. As threads consume this bandwidth it is
> transferred to cpu-local "silos" on a demand basis. The amount transferred
> -within each of these updates is tunable and described as the "slice".
> +within each of these updates is tunable and described as the "slice". Slices
> +that are allocated to cpu-local silos do not expire at the end of the period,
> +but unallocated quota does. This doesn't affect cpu-bound applications as they
> +by definition consume all of their bandwidth in each each period.
> +
> +However for highly-threaded user-interactive/non-cpu bound applications this
> +non-expiration nuance allows applications to burst past their quota limits
> +equal to the amount of unused slice per cpu that the task group is running on.
> +This slight burst requires that quota had gone unused in previous periods.
> +Additionally this burst amount is limited to the size of a slice for every cpu
> +a task group is run on. As a result, this mechanism still strictly limits the
> +task group to quota average usage over a longer time windows. This provides
> +better more predictable user experience for highly threaded applications with
> +small quota limits on high core count machines. It also eliminates the
> +propensity to throttle these applications while simultanously using less than
> +quota amounts of cpu. Another way to say this, is that by allowing the unused
> +portion of a slice to be used in following periods we have decreased the
> +possibility of wasting unused quota on cpu-local silos that don't need much cpu
> +time.
>
> Management
> ----------
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f35930f..a675c69 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4295,8 +4295,6 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
>
> now = sched_clock_cpu(smp_processor_id());
> cfs_b->runtime = cfs_b->quota;
> - cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period);
> - cfs_b->expires_seq++;
> }
>
> static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
> @@ -4318,8 +4316,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
> {
> struct task_group *tg = cfs_rq->tg;
> struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
> - u64 amount = 0, min_amount, expires;
> - int expires_seq;
> + u64 amount = 0, min_amount;
>
> /* note: this is a positive sum as runtime_remaining <= 0 */
> min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining;
> @@ -4336,61 +4333,17 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
> cfs_b->idle = 0;
> }
> }
> - expires_seq = cfs_b->expires_seq;
> - expires = cfs_b->runtime_expires;
> raw_spin_unlock(&cfs_b->lock);
>
> cfs_rq->runtime_remaining += amount;
> - /*
> - * we may have advanced our local expiration to account for allowed
> - * spread between our sched_clock and the one on which runtime was
> - * issued.
> - */
> - if (cfs_rq->expires_seq != expires_seq) {
> - cfs_rq->expires_seq = expires_seq;
> - cfs_rq->runtime_expires = expires;
> - }
>
> return cfs_rq->runtime_remaining > 0;
> }
>
> -/*
> - * Note: This depends on the synchronization provided by sched_clock and the
> - * fact that rq->clock snapshots this value.
> - */
> -static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
> -{
> - struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
> -
> - /* if the deadline is ahead of our clock, nothing to do */
> - if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0))
> - return;
> -
> - if (cfs_rq->runtime_remaining < 0)
> - return;
> -
> - /*
> - * If the local deadline has passed we have to consider the
> - * possibility that our sched_clock is 'fast' and the global deadline
> - * has not truly expired.
> - *
> - * Fortunately we can check determine whether this the case by checking
> - * whether the global deadline(cfs_b->expires_seq) has advanced.
> - */
> - if (cfs_rq->expires_seq == cfs_b->expires_seq) {
> - /* extend local deadline, drift is bounded above by 2 ticks */
> - cfs_rq->runtime_expires += TICK_NSEC;
> - } else {
> - /* global deadline is ahead, expiration has passed */
> - cfs_rq->runtime_remaining = 0;
> - }
> -}
> -
> static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
> {
> /* dock delta_exec before expiring quota (as it could span periods) */
> cfs_rq->runtime_remaining -= delta_exec;
> - expire_cfs_rq_runtime(cfs_rq);
>
> if (likely(cfs_rq->runtime_remaining > 0))
> return;
> @@ -4581,8 +4534,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
> resched_curr(rq);
> }
>
> -static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
> - u64 remaining, u64 expires)
> +static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining)
> {
> struct cfs_rq *cfs_rq;
> u64 runtime;
> @@ -4604,7 +4556,6 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
> remaining -= runtime;
>
> cfs_rq->runtime_remaining += runtime;
> - cfs_rq->runtime_expires = expires;
>
> /* we check whether we're throttled above */
> if (cfs_rq->runtime_remaining > 0)
> @@ -4629,7 +4580,7 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
> */
> static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, unsigned long flags)
> {
> - u64 runtime, runtime_expires;
> + u64 runtime;
> int throttled;
>
> /* no need to continue the timer with no bandwidth constraint */
> @@ -4657,8 +4608,6 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
> /* account preceding periods in which throttling occurred */
> cfs_b->nr_throttled += overrun;
>
> - runtime_expires = cfs_b->runtime_expires;
> -
> /*
> * This check is repeated as we are holding onto the new bandwidth while
> * we unthrottle. This can potentially race with an unthrottled group
> @@ -4671,8 +4620,7 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
> cfs_b->distribute_running = 1;
> raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
> /* we can't nest cfs_b->lock while distributing bandwidth */
> - runtime = distribute_cfs_runtime(cfs_b, runtime,
> - runtime_expires);
> + runtime = distribute_cfs_runtime(cfs_b, runtime);
> raw_spin_lock_irqsave(&cfs_b->lock, flags);
>
> cfs_b->distribute_running = 0;
> @@ -4749,8 +4697,7 @@ static void __return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
> return;
>
> raw_spin_lock(&cfs_b->lock);
> - if (cfs_b->quota != RUNTIME_INF &&
> - cfs_rq->runtime_expires == cfs_b->runtime_expires) {
> + if (cfs_b->quota != RUNTIME_INF) {
> cfs_b->runtime += slack_runtime;
>
> /* we are under rq->lock, defer unthrottling using a timer */
> @@ -4783,7 +4730,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
> {
> u64 runtime = 0, slice = sched_cfs_bandwidth_slice();
> unsigned long flags;
> - u64 expires;
>
> /* confirm we're still not at a refresh boundary */
> raw_spin_lock_irqsave(&cfs_b->lock, flags);
> @@ -4800,7 +4746,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
> if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice)
> runtime = cfs_b->runtime;
>
> - expires = cfs_b->runtime_expires;
> if (runtime)
> cfs_b->distribute_running = 1;
>
> @@ -4809,11 +4754,9 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
> if (!runtime)
> return;
>
> - runtime = distribute_cfs_runtime(cfs_b, runtime, expires);
> + runtime = distribute_cfs_runtime(cfs_b, runtime);
>
> raw_spin_lock_irqsave(&cfs_b->lock, flags);
> - if (expires == cfs_b->runtime_expires)
> - lsub_positive(&cfs_b->runtime, runtime);
> cfs_b->distribute_running = 0;
> raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
> }
> @@ -4969,8 +4912,6 @@ void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
>
> cfs_b->period_active = 1;
> overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
> - cfs_b->runtime_expires += (overrun + 1) * ktime_to_ns(cfs_b->period);
> - cfs_b->expires_seq++;
> hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
> }
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index b52ed1a..0c0ed23 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -341,8 +341,6 @@ struct cfs_bandwidth {
> u64 quota;
> u64 runtime;
> s64 hierarchical_quota;
> - u64 runtime_expires;
> - int expires_seq;
>
> short idle;
> short period_active;
> @@ -562,8 +560,6 @@ struct cfs_rq {
>
> #ifdef CONFIG_CFS_BANDWIDTH
> int runtime_enabled;
> - int expires_seq;
> - u64 runtime_expires;
> s64 runtime_remaining;
>
> u64 throttled_clock;
> --
> 1.8.3.1
>

2019-05-24 14:35:14

by Phil Auld

[permalink] [raw]
Subject: Re: [PATCH v2 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices

On Thu, May 23, 2019 at 02:01:58PM -0700 Peter Oskolkov wrote:
> On Thu, May 23, 2019 at 11:44 AM Dave Chiluk <[email protected]> wrote:
> >
> > It has been observed, that highly-threaded, non-cpu-bound applications
> > running under cpu.cfs_quota_us constraints can hit a high percentage of
> > periods throttled while simultaneously not consuming the allocated
> > amount of quota. This use case is typical of user-interactive non-cpu
> > bound applications, such as those running in kubernetes or mesos when
> > run on multiple cpu cores.
> >
> > This has been root caused to threads being allocated per cpu bandwidth
> > slices, and then not fully using that slice within the period. At which
> > point the slice and quota expires. This expiration of unused slice
> > results in applications not being able to utilize the quota for which
> > they are allocated.
> >
> > The expiration of per-cpu slices was recently fixed by
> > 'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift
> > condition")'. Prior to that it appears that this has been broken since
> > at least 'commit 51f2176d74ac ("sched/fair: Fix unlocked reads of some
> > cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That
> > added the following conditional which resulted in slices never being
> > expired.
> >
> > if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
> > /* extend local deadline, drift is bounded above by 2 ticks */
> > cfs_rq->runtime_expires += TICK_NSEC;
> >
> > Because this was broken for nearly 5 years, and has recently been fixed
> > and is now being noticed by many users running kubernetes
> > (https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion
> > that the mechanisms around expiring runtime should be removed
> > altogether.
> >
> > This allows only per-cpu slices to live longer than the period boundary.
> > This allows threads on runqueues that do not use much CPU to continue to
> > use their remaining slice over a longer period of time than
> > cpu.cfs_period_us. However, this helps prevents the above condition of
> > hitting throttling while also not fully utilizing your cpu quota.
> >
> > This theoretically allows a machine to use slightly more than it's
> > allotted quota in some periods. This overflow would be bounded by the
> > remaining per-cpu slice that was left un-used in the previous period.
> > For CPU bound tasks this will change nothing, as they should
> > theoretically fully utilize all of their quota and slices in each
> > period. For user-interactive tasks as described above this provides a
> > much better user/application experience as their cpu utilization will
> > more closely match the amount they requested when they hit throttling.
> >
> > This greatly improves performance of high-thread-count, non-cpu bound
> > applications with low cfs_quota_us allocation on high-core-count
> > machines. In the case of an artificial testcase, this performance
> > discrepancy has been observed to be almost 30x performance improvement,
> > while still maintaining correct cpu quota restrictions albeit over
> > longer time intervals than cpu.cfs_period_us.
>
> If the machine runs at/close to capacity, won't the overallocation
> of the quota to bursty tasks necessarily negatively impact every other
> task? Should the "unused" quota be available only on idle CPUs?
> (Or maybe this is the behavior achieved here, and only the comment and
> the commit message should be fixed...)
>

It's bounded by the amount left unused from the previous period. So
theoretically a process could use almost twice its quota. But then it
would have nothing left over in the next period. To repeat it would have
to not use any that next period. Over a longer number of periods it's the
same amount of CPU usage.

I think that is more fair than throttling a process that has never used
its full quota.

And it removes complexity.

Cheers,
Phil

> > That testcase is
> > available at https://github.com/indeedeng/fibtest.
> >
> > Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition")
> > Signed-off-by: Dave Chiluk <[email protected]>
> > ---
> > Documentation/scheduler/sched-bwc.txt | 29 +++++++++++---
> > kernel/sched/fair.c | 71 +++--------------------------------
> > kernel/sched/sched.h | 4 --
> > 3 files changed, 29 insertions(+), 75 deletions(-)
> >
> > diff --git a/Documentation/scheduler/sched-bwc.txt b/Documentation/scheduler/sched-bwc.txt
> > index f6b1873..4ded8ae 100644
> > --- a/Documentation/scheduler/sched-bwc.txt
> > +++ b/Documentation/scheduler/sched-bwc.txt
> > @@ -8,16 +8,33 @@ CFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED extension which allows the
> > specification of the maximum CPU bandwidth available to a group or hierarchy.
> >
> > The bandwidth allowed for a group is specified using a quota and period. Within
> > -each given "period" (microseconds), a group is allowed to consume only up to
> > -"quota" microseconds of CPU time. When the CPU bandwidth consumption of a
> > -group exceeds this limit (for that period), the tasks belonging to its
> > -hierarchy will be throttled and are not allowed to run again until the next
> > -period.
> > +each given "period" (microseconds), a task group is allocated up to "quota"
> > +microseconds of CPU time. When the CPU bandwidth consumption of a group
> > +exceeds this limit (for that period), the tasks belonging to its hierarchy will
> > +be throttled and are not allowed to run again until the next period.
> >
> > A group's unused runtime is globally tracked, being refreshed with quota units
> > above at each period boundary. As threads consume this bandwidth it is
> > transferred to cpu-local "silos" on a demand basis. The amount transferred
> > -within each of these updates is tunable and described as the "slice".
> > +within each of these updates is tunable and described as the "slice". Slices
> > +that are allocated to cpu-local silos do not expire at the end of the period,
> > +but unallocated quota does. This doesn't affect cpu-bound applications as they
> > +by definition consume all of their bandwidth in each each period.
> > +
> > +However for highly-threaded user-interactive/non-cpu bound applications this
> > +non-expiration nuance allows applications to burst past their quota limits
> > +equal to the amount of unused slice per cpu that the task group is running on.
> > +This slight burst requires that quota had gone unused in previous periods.
> > +Additionally this burst amount is limited to the size of a slice for every cpu
> > +a task group is run on. As a result, this mechanism still strictly limits the
> > +task group to quota average usage over a longer time windows. This provides
> > +better more predictable user experience for highly threaded applications with
> > +small quota limits on high core count machines. It also eliminates the
> > +propensity to throttle these applications while simultanously using less than
> > +quota amounts of cpu. Another way to say this, is that by allowing the unused
> > +portion of a slice to be used in following periods we have decreased the
> > +possibility of wasting unused quota on cpu-local silos that don't need much cpu
> > +time.
> >
> > Management
> > ----------
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index f35930f..a675c69 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4295,8 +4295,6 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
> >
> > now = sched_clock_cpu(smp_processor_id());
> > cfs_b->runtime = cfs_b->quota;
> > - cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period);
> > - cfs_b->expires_seq++;
> > }
> >
> > static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
> > @@ -4318,8 +4316,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
> > {
> > struct task_group *tg = cfs_rq->tg;
> > struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
> > - u64 amount = 0, min_amount, expires;
> > - int expires_seq;
> > + u64 amount = 0, min_amount;
> >
> > /* note: this is a positive sum as runtime_remaining <= 0 */
> > min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining;
> > @@ -4336,61 +4333,17 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
> > cfs_b->idle = 0;
> > }
> > }
> > - expires_seq = cfs_b->expires_seq;
> > - expires = cfs_b->runtime_expires;
> > raw_spin_unlock(&cfs_b->lock);
> >
> > cfs_rq->runtime_remaining += amount;
> > - /*
> > - * we may have advanced our local expiration to account for allowed
> > - * spread between our sched_clock and the one on which runtime was
> > - * issued.
> > - */
> > - if (cfs_rq->expires_seq != expires_seq) {
> > - cfs_rq->expires_seq = expires_seq;
> > - cfs_rq->runtime_expires = expires;
> > - }
> >
> > return cfs_rq->runtime_remaining > 0;
> > }
> >
> > -/*
> > - * Note: This depends on the synchronization provided by sched_clock and the
> > - * fact that rq->clock snapshots this value.
> > - */
> > -static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
> > -{
> > - struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
> > -
> > - /* if the deadline is ahead of our clock, nothing to do */
> > - if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0))
> > - return;
> > -
> > - if (cfs_rq->runtime_remaining < 0)
> > - return;
> > -
> > - /*
> > - * If the local deadline has passed we have to consider the
> > - * possibility that our sched_clock is 'fast' and the global deadline
> > - * has not truly expired.
> > - *
> > - * Fortunately we can check determine whether this the case by checking
> > - * whether the global deadline(cfs_b->expires_seq) has advanced.
> > - */
> > - if (cfs_rq->expires_seq == cfs_b->expires_seq) {
> > - /* extend local deadline, drift is bounded above by 2 ticks */
> > - cfs_rq->runtime_expires += TICK_NSEC;
> > - } else {
> > - /* global deadline is ahead, expiration has passed */
> > - cfs_rq->runtime_remaining = 0;
> > - }
> > -}
> > -
> > static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
> > {
> > /* dock delta_exec before expiring quota (as it could span periods) */
> > cfs_rq->runtime_remaining -= delta_exec;
> > - expire_cfs_rq_runtime(cfs_rq);
> >
> > if (likely(cfs_rq->runtime_remaining > 0))
> > return;
> > @@ -4581,8 +4534,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
> > resched_curr(rq);
> > }
> >
> > -static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
> > - u64 remaining, u64 expires)
> > +static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining)
> > {
> > struct cfs_rq *cfs_rq;
> > u64 runtime;
> > @@ -4604,7 +4556,6 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
> > remaining -= runtime;
> >
> > cfs_rq->runtime_remaining += runtime;
> > - cfs_rq->runtime_expires = expires;
> >
> > /* we check whether we're throttled above */
> > if (cfs_rq->runtime_remaining > 0)
> > @@ -4629,7 +4580,7 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
> > */
> > static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, unsigned long flags)
> > {
> > - u64 runtime, runtime_expires;
> > + u64 runtime;
> > int throttled;
> >
> > /* no need to continue the timer with no bandwidth constraint */
> > @@ -4657,8 +4608,6 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
> > /* account preceding periods in which throttling occurred */
> > cfs_b->nr_throttled += overrun;
> >
> > - runtime_expires = cfs_b->runtime_expires;
> > -
> > /*
> > * This check is repeated as we are holding onto the new bandwidth while
> > * we unthrottle. This can potentially race with an unthrottled group
> > @@ -4671,8 +4620,7 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
> > cfs_b->distribute_running = 1;
> > raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
> > /* we can't nest cfs_b->lock while distributing bandwidth */
> > - runtime = distribute_cfs_runtime(cfs_b, runtime,
> > - runtime_expires);
> > + runtime = distribute_cfs_runtime(cfs_b, runtime);
> > raw_spin_lock_irqsave(&cfs_b->lock, flags);
> >
> > cfs_b->distribute_running = 0;
> > @@ -4749,8 +4697,7 @@ static void __return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
> > return;
> >
> > raw_spin_lock(&cfs_b->lock);
> > - if (cfs_b->quota != RUNTIME_INF &&
> > - cfs_rq->runtime_expires == cfs_b->runtime_expires) {
> > + if (cfs_b->quota != RUNTIME_INF) {
> > cfs_b->runtime += slack_runtime;
> >
> > /* we are under rq->lock, defer unthrottling using a timer */
> > @@ -4783,7 +4730,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
> > {
> > u64 runtime = 0, slice = sched_cfs_bandwidth_slice();
> > unsigned long flags;
> > - u64 expires;
> >
> > /* confirm we're still not at a refresh boundary */
> > raw_spin_lock_irqsave(&cfs_b->lock, flags);
> > @@ -4800,7 +4746,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
> > if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice)
> > runtime = cfs_b->runtime;
> >
> > - expires = cfs_b->runtime_expires;
> > if (runtime)
> > cfs_b->distribute_running = 1;
> >
> > @@ -4809,11 +4754,9 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
> > if (!runtime)
> > return;
> >
> > - runtime = distribute_cfs_runtime(cfs_b, runtime, expires);
> > + runtime = distribute_cfs_runtime(cfs_b, runtime);
> >
> > raw_spin_lock_irqsave(&cfs_b->lock, flags);
> > - if (expires == cfs_b->runtime_expires)
> > - lsub_positive(&cfs_b->runtime, runtime);
> > cfs_b->distribute_running = 0;
> > raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
> > }
> > @@ -4969,8 +4912,6 @@ void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
> >
> > cfs_b->period_active = 1;
> > overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
> > - cfs_b->runtime_expires += (overrun + 1) * ktime_to_ns(cfs_b->period);
> > - cfs_b->expires_seq++;
> > hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
> > }
> >
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index b52ed1a..0c0ed23 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -341,8 +341,6 @@ struct cfs_bandwidth {
> > u64 quota;
> > u64 runtime;
> > s64 hierarchical_quota;
> > - u64 runtime_expires;
> > - int expires_seq;
> >
> > short idle;
> > short period_active;
> > @@ -562,8 +560,6 @@ struct cfs_rq {
> >
> > #ifdef CONFIG_CFS_BANDWIDTH
> > int runtime_enabled;
> > - int expires_seq;
> > - u64 runtime_expires;
> > s64 runtime_remaining;
> >
> > u64 throttled_clock;
> > --
> > 1.8.3.1
> >

--

2019-05-24 15:18:10

by Dave Chiluk

[permalink] [raw]
Subject: Re: [PATCH v2 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices

On Fri, May 24, 2019 at 9:32 AM Phil Auld <[email protected]> wrote:
> On Thu, May 23, 2019 at 02:01:58PM -0700 Peter Oskolkov wrote:

> > If the machine runs at/close to capacity, won't the overallocation
> > of the quota to bursty tasks necessarily negatively impact every other
> > task? Should the "unused" quota be available only on idle CPUs?
> > (Or maybe this is the behavior achieved here, and only the comment and
> > the commit message should be fixed...)
> >
>
> It's bounded by the amount left unused from the previous period. So
> theoretically a process could use almost twice its quota. But then it
> would have nothing left over in the next period. To repeat it would have
> to not use any that next period. Over a longer number of periods it's the
> same amount of CPU usage.
>
> I think that is more fair than throttling a process that has never used
> its full quota.
>
> And it removes complexity.
>
> Cheers,
> Phil

Actually it's not even that bad. The overallocation of quota to a
bursty task in a period is limited to at most one slice per cpu, and
that slice must not have been used in the previous periods. The slice
size is set with /proc/sys/kernel/sched_cfs_bandwidth_slice_us and
defaults to 5ms. If a bursty task goes from underutilizing quota to
using it's entire quota, it will not be able to burst in the
subsequent periods. Therefore in an absolute worst case contrived
scenario, a bursty task can add at most 5ms to the latency of other
threads on the same CPU. I think this worst case 5ms tradeoff is
entirely worth it.

This does mean that a theoretically a poorly written massively
threaded application on an 80 core box, that spreads itself onto 80
cpu run queues, can overutilize it's quota in a period by at most 5ms
* 80 CPUs in a sincle period (slice * number of runqueues the
application is running on). But that means that each of those threads
would have had to not be use their quota in a previous period, and it
also means that the application would have to be carefully written to
exacerbate this behavior.

Additionally if cpu bound threads underutilize a slice of their quota
in a period due to the cfs choosing a bursty task to run, they should
theoretically be able to make it up in the following periods when the
bursty task is unable to "burst".

Please be careful here quota and slice are being treated differently.
Quota does not roll-over between periods, only slices of quota that
has already been allocated to per cpu run queues. If you allocate
100ms of quota per period to an application, but it only spreads onto
3 cpu run queues that means it can in the worst case use 3 x slice
size = 15ms in periods following underutilization.

So why does this matter. Well applications that use thread pools
*(*cough* java *cough*) with lots of tiny little worker threads, tend
to spread themselves out onto a lot of run queues. These worker
threads grab quota slices in order to run, then rarely use all of
their slice (1 or 2ms out of the 5ms). This results in those worker
threads starving the main application of quota, and then expiring the
remainder of that quota slice on the per-cpu. Going back to my
earlier 100ms quota / 80 cpu example. That means only
100ms/cfs_bandwidth_slice_us(5ms) = 20 slices are available in a
period. So only 20 out of these 80 cpus ever get a slice allocated to
them. By allowing these per-cpu run queues to use their remaining
slice in following periods these worker threads do not need to be
allocated additional slice, and thereby the main threads are actually
able to use the allocated cpu quota.

This can be experienced by running fibtest available at
https://github.com/indeedeng/fibtest/.
$ runfibtest 1
runs a single fast thread taskset to cpu 0
$ runfibtest 8
Runs a single fast thread taskset to cpu 0, and 7 slow threads taskset
to cpus 1-7. This run is expected to show less iterations, but the
worse problem is that the cpu usage is far less than the 500ms that it
should have received.

Thanks for the engagement on this,
Dave Chiluk

2019-05-24 16:00:48

by Phil Auld

[permalink] [raw]
Subject: Re: [PATCH v2 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices

On Fri, May 24, 2019 at 10:14:36AM -0500 Dave Chiluk wrote:
> On Fri, May 24, 2019 at 9:32 AM Phil Auld <[email protected]> wrote:
> > On Thu, May 23, 2019 at 02:01:58PM -0700 Peter Oskolkov wrote:
>
> > > If the machine runs at/close to capacity, won't the overallocation
> > > of the quota to bursty tasks necessarily negatively impact every other
> > > task? Should the "unused" quota be available only on idle CPUs?
> > > (Or maybe this is the behavior achieved here, and only the comment and
> > > the commit message should be fixed...)
> > >
> >
> > It's bounded by the amount left unused from the previous period. So
> > theoretically a process could use almost twice its quota. But then it
> > would have nothing left over in the next period. To repeat it would have
> > to not use any that next period. Over a longer number of periods it's the
> > same amount of CPU usage.
> >
> > I think that is more fair than throttling a process that has never used
> > its full quota.
> >
> > And it removes complexity.
> >
> > Cheers,
> > Phil
>
> Actually it's not even that bad. The overallocation of quota to a
> bursty task in a period is limited to at most one slice per cpu, and
> that slice must not have been used in the previous periods. The slice
> size is set with /proc/sys/kernel/sched_cfs_bandwidth_slice_us and
> defaults to 5ms. If a bursty task goes from underutilizing quota to
> using it's entire quota, it will not be able to burst in the
> subsequent periods. Therefore in an absolute worst case contrived
> scenario, a bursty task can add at most 5ms to the latency of other
> threads on the same CPU. I think this worst case 5ms tradeoff is
> entirely worth it.
>
> This does mean that a theoretically a poorly written massively
> threaded application on an 80 core box, that spreads itself onto 80
> cpu run queues, can overutilize it's quota in a period by at most 5ms
> * 80 CPUs in a sincle period (slice * number of runqueues the
> application is running on). But that means that each of those threads
> would have had to not be use their quota in a previous period, and it
> also means that the application would have to be carefully written to
> exacerbate this behavior.
>
> Additionally if cpu bound threads underutilize a slice of their quota
> in a period due to the cfs choosing a bursty task to run, they should
> theoretically be able to make it up in the following periods when the
> bursty task is unable to "burst".
>
> Please be careful here quota and slice are being treated differently.
> Quota does not roll-over between periods, only slices of quota that
> has already been allocated to per cpu run queues. If you allocate
> 100ms of quota per period to an application, but it only spreads onto
> 3 cpu run queues that means it can in the worst case use 3 x slice
> size = 15ms in periods following underutilization.
>
> So why does this matter. Well applications that use thread pools
> *(*cough* java *cough*) with lots of tiny little worker threads, tend
> to spread themselves out onto a lot of run queues. These worker
> threads grab quota slices in order to run, then rarely use all of
> their slice (1 or 2ms out of the 5ms). This results in those worker
> threads starving the main application of quota, and then expiring the
> remainder of that quota slice on the per-cpu. Going back to my
> earlier 100ms quota / 80 cpu example. That means only
> 100ms/cfs_bandwidth_slice_us(5ms) = 20 slices are available in a
> period. So only 20 out of these 80 cpus ever get a slice allocated to
> them. By allowing these per-cpu run queues to use their remaining
> slice in following periods these worker threads do not need to be
> allocated additional slice, and thereby the main threads are actually
> able to use the allocated cpu quota.
>
> This can be experienced by running fibtest available at
> https://github.com/indeedeng/fibtest/.
> $ runfibtest 1
> runs a single fast thread taskset to cpu 0
> $ runfibtest 8
> Runs a single fast thread taskset to cpu 0, and 7 slow threads taskset
> to cpus 1-7. This run is expected to show less iterations, but the
> worse problem is that the cpu usage is far less than the 500ms that it
> should have received.
>
> Thanks for the engagement on this,
> Dave Chiluk

Thanks for the clarification. This is an even better explanation.

Fwiw, I ran some of my cfs throttling tests with this, none of which are
designed to measure or hit this particular issue. They are more focused
on starvation and hard lockups that I've hit. But I did not see any issues
or oddities with this patch.

Cheers,
Phil

--

2019-05-24 16:31:21

by Peter Oskolkov

[permalink] [raw]
Subject: Re: [PATCH v2 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices

On Fri, May 24, 2019 at 8:15 AM Dave Chiluk <[email protected]> wrote:
>
> On Fri, May 24, 2019 at 9:32 AM Phil Auld <[email protected]> wrote:
> > On Thu, May 23, 2019 at 02:01:58PM -0700 Peter Oskolkov wrote:
>
> > > If the machine runs at/close to capacity, won't the overallocation
> > > of the quota to bursty tasks necessarily negatively impact every other
> > > task? Should the "unused" quota be available only on idle CPUs?
> > > (Or maybe this is the behavior achieved here, and only the comment and
> > > the commit message should be fixed...)
> > >
> >
> > It's bounded by the amount left unused from the previous period. So
> > theoretically a process could use almost twice its quota. But then it
> > would have nothing left over in the next period. To repeat it would have
> > to not use any that next period. Over a longer number of periods it's the
> > same amount of CPU usage.
> >
> > I think that is more fair than throttling a process that has never used
> > its full quota.
> >
> > And it removes complexity.
> >
> > Cheers,
> > Phil
>
> Actually it's not even that bad. The overallocation of quota to a
> bursty task in a period is limited to at most one slice per cpu, and
> that slice must not have been used in the previous periods. The slice
> size is set with /proc/sys/kernel/sched_cfs_bandwidth_slice_us and
> defaults to 5ms. If a bursty task goes from underutilizing quota to
> using it's entire quota, it will not be able to burst in the
> subsequent periods. Therefore in an absolute worst case contrived
> scenario, a bursty task can add at most 5ms to the latency of other
> threads on the same CPU. I think this worst case 5ms tradeoff is
> entirely worth it.
>
> This does mean that a theoretically a poorly written massively
> threaded application on an 80 core box, that spreads itself onto 80
> cpu run queues, can overutilize it's quota in a period by at most 5ms
> * 80 CPUs in a sincle period (slice * number of runqueues the
> application is running on). But that means that each of those threads
> would have had to not be use their quota in a previous period, and it
> also means that the application would have to be carefully written to
> exacerbate this behavior.
>
> Additionally if cpu bound threads underutilize a slice of their quota
> in a period due to the cfs choosing a bursty task to run, they should
> theoretically be able to make it up in the following periods when the
> bursty task is unable to "burst".

OK, so it is indeed possible that CPU bound threads will underutilize a slice
of their quota in a period as a result of this patch. This should probably
be clearly stated in the code comments and in the commit message.

In addition, I believe that although many workloads will indeed be
indifferent to getting their fair share "later", some latency-sensitive
workloads will definitely be negatively affected by this temporary
CPU quota stealing by bursty antagonists. So there should probably be
a way to limit this behavior; for example, by making it tunable
per cgroup.

>
> Please be careful here quota and slice are being treated differently.
> Quota does not roll-over between periods, only slices of quota that
> has already been allocated to per cpu run queues. If you allocate
> 100ms of quota per period to an application, but it only spreads onto
> 3 cpu run queues that means it can in the worst case use 3 x slice
> size = 15ms in periods following underutilization.
>
> So why does this matter. Well applications that use thread pools
> *(*cough* java *cough*) with lots of tiny little worker threads, tend
> to spread themselves out onto a lot of run queues. These worker
> threads grab quota slices in order to run, then rarely use all of
> their slice (1 or 2ms out of the 5ms). This results in those worker
> threads starving the main application of quota, and then expiring the
> remainder of that quota slice on the per-cpu. Going back to my
> earlier 100ms quota / 80 cpu example. That means only
> 100ms/cfs_bandwidth_slice_us(5ms) = 20 slices are available in a
> period. So only 20 out of these 80 cpus ever get a slice allocated to
> them. By allowing these per-cpu run queues to use their remaining
> slice in following periods these worker threads do not need to be
> allocated additional slice, and thereby the main threads are actually
> able to use the allocated cpu quota.
>
> This can be experienced by running fibtest available at
> https://github.com/indeedeng/fibtest/.
> $ runfibtest 1
> runs a single fast thread taskset to cpu 0
> $ runfibtest 8
> Runs a single fast thread taskset to cpu 0, and 7 slow threads taskset
> to cpus 1-7. This run is expected to show less iterations, but the
> worse problem is that the cpu usage is far less than the 500ms that it
> should have received.
>
> Thanks for the engagement on this,
> Dave Chiluk

2019-05-24 21:37:14

by Dave Chiluk

[permalink] [raw]
Subject: Re: [PATCH v2 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices

On Fri, May 24, 2019 at 11:28 AM Peter Oskolkov <[email protected]> wrote:
>
> On Fri, May 24, 2019 at 8:15 AM Dave Chiluk <[email protected]> wrote:
> >
> > On Fri, May 24, 2019 at 9:32 AM Phil Auld <[email protected]> wrote:
> > > On Thu, May 23, 2019 at 02:01:58PM -0700 Peter Oskolkov wrote:
> >
> > > > If the machine runs at/close to capacity, won't the overallocation
> > > > of the quota to bursty tasks necessarily negatively impact every other
> > > > task? Should the "unused" quota be available only on idle CPUs?
> > > > (Or maybe this is the behavior achieved here, and only the comment and
> > > > the commit message should be fixed...)
> > > >
> > >
> > > It's bounded by the amount left unused from the previous period. So
> > > theoretically a process could use almost twice its quota. But then it
> > > would have nothing left over in the next period. To repeat it would have
> > > to not use any that next period. Over a longer number of periods it's the
> > > same amount of CPU usage.
> > >
> > > I think that is more fair than throttling a process that has never used
> > > its full quota.
> > >
> > > And it removes complexity.
> > >
> > > Cheers,
> > > Phil
> >
> > Actually it's not even that bad. The overallocation of quota to a
> > bursty task in a period is limited to at most one slice per cpu, and
> > that slice must not have been used in the previous periods. The slice
> > size is set with /proc/sys/kernel/sched_cfs_bandwidth_slice_us and
> > defaults to 5ms. If a bursty task goes from underutilizing quota to
> > using it's entire quota, it will not be able to burst in the
> > subsequent periods. Therefore in an absolute worst case contrived
> > scenario, a bursty task can add at most 5ms to the latency of other
> > threads on the same CPU. I think this worst case 5ms tradeoff is
> > entirely worth it.
> >
> > This does mean that a theoretically a poorly written massively
> > threaded application on an 80 core box, that spreads itself onto 80
> > cpu run queues, can overutilize it's quota in a period by at most 5ms
> > * 80 CPUs in a sincle period (slice * number of runqueues the
> > application is running on). But that means that each of those threads
> > would have had to not be use their quota in a previous period, and it
> > also means that the application would have to be carefully written to
> > exacerbate this behavior.
> >
> > Additionally if cpu bound threads underutilize a slice of their quota
> > in a period due to the cfs choosing a bursty task to run, they should
> > theoretically be able to make it up in the following periods when the
> > bursty task is unable to "burst".
>
> OK, so it is indeed possible that CPU bound threads will underutilize a slice
> of their quota in a period as a result of this patch. This should probably
> be clearly stated in the code comments and in the commit message.
>
> In addition, I believe that although many workloads will indeed be
> indifferent to getting their fair share "later", some latency-sensitive
> workloads will definitely be negatively affected by this temporary
> CPU quota stealing by bursty antagonists. So there should probably be
> a way to limit this behavior; for example, by making it tunable
> per cgroup.
>
This patch restores the behavior that existed from at least
v3.16..v4.18, and the current Redhat 7 kernels. So you are kind of
championing a moot point as no one has noticed this "bursting"
behavior in over 5 years. By removing this slice expiration
altogether we restore the behavior and also solve the root issue of
'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift
condition")'.

Since 512ac999d275, many people are now noticing the slice expiration
and very displeased with the behavior change.
see: https://github.com/kubernetes/kubernetes/issues/67577

I would love to hear if you know of a single complaint during that 5
year time window, where someone noticed this bursting and reported
that it negatively affected their application.

As for the documentation, I thought about documenting the possible
adverse side-effect, but I didn't feel it was worthwhile since no one
had noticed that corner case in the 5 years. I also could not figure
out a concise way of describing the slight corner case issue without
overly complicating the documentation. I felt that adding that corner
case would have detracted rather than added to the documentation's
usefulness. As far as commenting in code, considering most of this
commit removes lines, there's not a really great place for it. That's
why I did my best to describe the behavior in the documentation.
Smart people can draw conclusions from there as you have done. Please
keep in mind that this "bursting" is again limited to a single slice
on each per-cpu run-queue.

Thank you,
Dave

2019-05-24 22:09:02

by Peter Oskolkov

[permalink] [raw]
Subject: Re: [PATCH v2 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices

On Fri, May 24, 2019 at 2:35 PM Dave Chiluk <[email protected]> wrote:
>
> On Fri, May 24, 2019 at 11:28 AM Peter Oskolkov <[email protected]> wrote:
> >
> > On Fri, May 24, 2019 at 8:15 AM Dave Chiluk <[email protected]> wrote:
> > >
> > > On Fri, May 24, 2019 at 9:32 AM Phil Auld <[email protected]> wrote:
> > > > On Thu, May 23, 2019 at 02:01:58PM -0700 Peter Oskolkov wrote:
> > >
> > > > > If the machine runs at/close to capacity, won't the overallocation
> > > > > of the quota to bursty tasks necessarily negatively impact every other
> > > > > task? Should the "unused" quota be available only on idle CPUs?
> > > > > (Or maybe this is the behavior achieved here, and only the comment and
> > > > > the commit message should be fixed...)
> > > > >
> > > >
> > > > It's bounded by the amount left unused from the previous period. So
> > > > theoretically a process could use almost twice its quota. But then it
> > > > would have nothing left over in the next period. To repeat it would have
> > > > to not use any that next period. Over a longer number of periods it's the
> > > > same amount of CPU usage.
> > > >
> > > > I think that is more fair than throttling a process that has never used
> > > > its full quota.
> > > >
> > > > And it removes complexity.
> > > >
> > > > Cheers,
> > > > Phil
> > >
> > > Actually it's not even that bad. The overallocation of quota to a
> > > bursty task in a period is limited to at most one slice per cpu, and
> > > that slice must not have been used in the previous periods. The slice
> > > size is set with /proc/sys/kernel/sched_cfs_bandwidth_slice_us and
> > > defaults to 5ms. If a bursty task goes from underutilizing quota to
> > > using it's entire quota, it will not be able to burst in the
> > > subsequent periods. Therefore in an absolute worst case contrived
> > > scenario, a bursty task can add at most 5ms to the latency of other
> > > threads on the same CPU. I think this worst case 5ms tradeoff is
> > > entirely worth it.
> > >
> > > This does mean that a theoretically a poorly written massively
> > > threaded application on an 80 core box, that spreads itself onto 80
> > > cpu run queues, can overutilize it's quota in a period by at most 5ms
> > > * 80 CPUs in a sincle period (slice * number of runqueues the
> > > application is running on). But that means that each of those threads
> > > would have had to not be use their quota in a previous period, and it
> > > also means that the application would have to be carefully written to
> > > exacerbate this behavior.
> > >
> > > Additionally if cpu bound threads underutilize a slice of their quota
> > > in a period due to the cfs choosing a bursty task to run, they should
> > > theoretically be able to make it up in the following periods when the
> > > bursty task is unable to "burst".
> >
> > OK, so it is indeed possible that CPU bound threads will underutilize a slice
> > of their quota in a period as a result of this patch. This should probably
> > be clearly stated in the code comments and in the commit message.
> >
> > In addition, I believe that although many workloads will indeed be
> > indifferent to getting their fair share "later", some latency-sensitive
> > workloads will definitely be negatively affected by this temporary
> > CPU quota stealing by bursty antagonists. So there should probably be
> > a way to limit this behavior; for example, by making it tunable
> > per cgroup.
> >
> This patch restores the behavior that existed from at least
> v3.16..v4.18, and the current Redhat 7 kernels. So you are kind of
> championing a moot point as no one has noticed this "bursting"
> behavior in over 5 years. By removing this slice expiration
> altogether we restore the behavior and also solve the root issue of
> 'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift
> condition")'.
>
> Since 512ac999d275, many people are now noticing the slice expiration
> and very displeased with the behavior change.
> see: https://github.com/kubernetes/kubernetes/issues/67577
>
> I would love to hear if you know of a single complaint during that 5
> year time window, where someone noticed this bursting and reported
> that it negatively affected their application.

Linux CPU scheduling tail latency is a well-known issue and a major
pain point in some workloads:
https://www.google.com/search?q=linux+cpu+scheduling+tail+latency

Even assuming that nobody noticed this particular cause
of CPU scheduling latencies, it does not mean the problem should be waved
away. At least it should be documented, if at this point it decided that
it is difficult to address it in a meaningful way. And, preferably, a way
to address the issue later on should be discussed and hopefully agreed to.

>
> As for the documentation, I thought about documenting the possible
> adverse side-effect, but I didn't feel it was worthwhile since no one
> had noticed that corner case in the 5 years. I also could not figure
> out a concise way of describing the slight corner case issue without
> overly complicating the documentation. I felt that adding that corner
> case would have detracted rather than added to the documentation's
> usefulness. As far as commenting in code, considering most of this
> commit removes lines, there's not a really great place for it. That's
> why I did my best to describe the behavior in the documentation.
> Smart people can draw conclusions from there as you have done. Please
> keep in mind that this "bursting" is again limited to a single slice
> on each per-cpu run-queue.
>
> Thank you,
> Dave

2019-05-28 22:29:14

by Dave Chiluk

[permalink] [raw]
Subject: Re: [PATCH v2 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices

On Fri, May 24, 2019 at 5:07 PM Peter Oskolkov <[email protected]> wrote:
> Linux CPU scheduling tail latency is a well-known issue and a major
> pain point in some workloads:
> https://www.google.com/search?q=linux+cpu+scheduling+tail+latency
>
> Even assuming that nobody noticed this particular cause
> of CPU scheduling latencies, it does not mean the problem should be waved
> away. At least it should be documented, if at this point it decided that
> it is difficult to address it in a meaningful way. And, preferably, a way
> to address the issue later on should be discussed and hopefully agreed to.

Pursuing reducing tail latencies for our web application is the
precise reason I created this patch set. Those applications that
previously were responding in 20ms 95% where now taking 220ms. Those
were correctly sized applications prior to 512ac999. After which, they
started seeing massive increases in their latencies due to hitting
throttling with lower than quota amounts of cpu usage.

I'll see if I can rework the documentation. Any specific
suggestions for how that can be worded would be appreciated.

2019-05-29 19:10:48

by Dave Chiluk

[permalink] [raw]
Subject: [PATCH v3 0/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices

Changelog v3
- Reworked documentation to better describe behavior of slice expiration
per feedback from Peter Oskolkov

Changelog v2
- Fixed some checkpatch errors in the commit message.

2019-06-24 19:49:25

by Dave Chiluk

[permalink] [raw]
Subject:


Changelog v4
- Rewrote patch to return all quota when cfs_b has very litte.
- Removed documentation changes, as bursting is no longer possible with this
new solution.

After the suggestion from Ben Segall to set min_cfs_rq_runtime=0, I came up
this in an attempt to balance the desire leave runtime on the per-cpu queues
with the desire to use this quota on other per-cpu rq.

Basically we now check the cfs_b on each return, and decide if all the remaining
time should be returned or to leave min_cfs_rq_runtime on the per-cpu queue
based on how much time is remaining on the cfs_b. As a result this mostly
gives us the benefits of both worlds.

2019-06-24 19:49:34

by Dave Chiluk

[permalink] [raw]
Subject: [PATCH v4 1/1] sched/fair: Return all runtime when cfs_b has very little remaining.

It has been observed, that highly-threaded, user-interactive
applications running under cpu.cfs_quota_us constraints can hit a high
percentage of periods throttled while simultaneously not consuming the
allocated amount of quota. This impacts user-interactive non-cpu bound
applications, such as those running in kubernetes or mesos when run on
multiple cores.

This has been root caused to threads being allocated per cpu bandwidth
slices, and then not fully using that slice within the period. This
results in min_cfs_rq_runtime remaining on each per-cpu cfs_rq. At the
end of the period this remaining quota goes unused and expires. This
expiration of unused time on per-cpu runqueues results in applications
under-utilizing their quota while simultaneously hitting throttling.

The solution is to return all spare cfs_rq->runtime_remaining when
cfs_b->runtime nears the sched_cfs_bandwidth_slice. This balances the
desire to prevent cfs_rq from always pulling quota with the desire to
allow applications to fully utilize their quota.

Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition")
Signed-off-by: Dave Chiluk <[email protected]>
---
kernel/sched/fair.c | 19 ++++++++++++++++---
1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f35930f..4894eda 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4695,7 +4695,9 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
return 1;
}

-/* a cfs_rq won't donate quota below this amount */
+/* a cfs_rq won't donate quota below this amount unless cfs_b has very little
+ * remaining runtime.
+ */
static const u64 min_cfs_rq_runtime = 1 * NSEC_PER_MSEC;
/* minimum remaining period time to redistribute slack quota */
static const u64 min_bandwidth_expiration = 2 * NSEC_PER_MSEC;
@@ -4743,16 +4745,27 @@ static void start_cfs_slack_bandwidth(struct cfs_bandwidth *cfs_b)
static void __return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
{
struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
- s64 slack_runtime = cfs_rq->runtime_remaining - min_cfs_rq_runtime;
+ s64 slack_runtime = cfs_rq->runtime_remaining;

+ /* There is no runtime to return. */
if (slack_runtime <= 0)
return;

raw_spin_lock(&cfs_b->lock);
if (cfs_b->quota != RUNTIME_INF &&
cfs_rq->runtime_expires == cfs_b->runtime_expires) {
- cfs_b->runtime += slack_runtime;
+ /* As we near 0 quota remaining on cfs_b start returning all
+ * remaining runtime. This avoids stranding and then expiring
+ * runtime on per-cpu cfs_rq.
+ *
+ * cfs->b has plenty of runtime leave min_cfs_rq_runtime of
+ * runtime on this cfs_rq.
+ */
+ if (cfs_b->runtime >= sched_cfs_bandwidth_slice() * 3 &&
+ slack_runtime > min_cfs_rq_runtime)
+ slack_runtime -= min_cfs_rq_runtime;

+ cfs_b->runtime += slack_runtime;
/* we are under rq->lock, defer unthrottling using a timer */
if (cfs_b->runtime > sched_cfs_bandwidth_slice() &&
!list_empty(&cfs_b->throttled_cfs_rq))
--
1.8.3.1

2019-06-24 20:53:12

by Benjamin Segall

[permalink] [raw]
Subject: Re: [PATCH v4 1/1] sched/fair: Return all runtime when cfs_b has very little remaining.

Dave Chiluk <[email protected]> writes:

> It has been observed, that highly-threaded, user-interactive
> applications running under cpu.cfs_quota_us constraints can hit a high
> percentage of periods throttled while simultaneously not consuming the
> allocated amount of quota. This impacts user-interactive non-cpu bound
> applications, such as those running in kubernetes or mesos when run on
> multiple cores.
>
> This has been root caused to threads being allocated per cpu bandwidth
> slices, and then not fully using that slice within the period. This
> results in min_cfs_rq_runtime remaining on each per-cpu cfs_rq. At the
> end of the period this remaining quota goes unused and expires. This
> expiration of unused time on per-cpu runqueues results in applications
> under-utilizing their quota while simultaneously hitting throttling.
>
> The solution is to return all spare cfs_rq->runtime_remaining when
> cfs_b->runtime nears the sched_cfs_bandwidth_slice. This balances the
> desire to prevent cfs_rq from always pulling quota with the desire to
> allow applications to fully utilize their quota.
>
> Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition")
> Signed-off-by: Dave Chiluk <[email protected]>
> ---
> kernel/sched/fair.c | 19 ++++++++++++++++---
> 1 file changed, 16 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f35930f..4894eda 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4695,7 +4695,9 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
> return 1;
> }
>
> -/* a cfs_rq won't donate quota below this amount */
> +/* a cfs_rq won't donate quota below this amount unless cfs_b has very little
> + * remaining runtime.
> + */
> static const u64 min_cfs_rq_runtime = 1 * NSEC_PER_MSEC;
> /* minimum remaining period time to redistribute slack quota */
> static const u64 min_bandwidth_expiration = 2 * NSEC_PER_MSEC;
> @@ -4743,16 +4745,27 @@ static void start_cfs_slack_bandwidth(struct cfs_bandwidth *cfs_b)
> static void __return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
> {
> struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
> - s64 slack_runtime = cfs_rq->runtime_remaining - min_cfs_rq_runtime;
> + s64 slack_runtime = cfs_rq->runtime_remaining;
>
> + /* There is no runtime to return. */
> if (slack_runtime <= 0)
> return;
>
> raw_spin_lock(&cfs_b->lock);
> if (cfs_b->quota != RUNTIME_INF &&
> cfs_rq->runtime_expires == cfs_b->runtime_expires) {
> - cfs_b->runtime += slack_runtime;
> + /* As we near 0 quota remaining on cfs_b start returning all
> + * remaining runtime. This avoids stranding and then expiring
> + * runtime on per-cpu cfs_rq.
> + *
> + * cfs->b has plenty of runtime leave min_cfs_rq_runtime of
> + * runtime on this cfs_rq.
> + */
> + if (cfs_b->runtime >= sched_cfs_bandwidth_slice() * 3 &&
> + slack_runtime > min_cfs_rq_runtime)
> + slack_runtime -= min_cfs_rq_runtime;
>
> + cfs_b->runtime += slack_runtime;
> /* we are under rq->lock, defer unthrottling using a timer */
> if (cfs_b->runtime > sched_cfs_bandwidth_slice() &&
> !list_empty(&cfs_b->throttled_cfs_rq))


This still has a similar cost as reducing min_cfs_rq_runtime to 0 - we
now take a tg-global lock on every group se dequeue. Setting min=0 means
that we have to take it on both enqueue and dequeue, while baseline
means we take it once per min_cfs_rq_runtime in the worst case.

In addition how much this helps is very dependent on the exact pattern
of sleep/wake - you can still strand all but 15ms of runtime with a
pretty reasonable pattern.

If the cost of taking this global lock across all cpus without a
ratelimit was somehow not a problem, I'd much prefer to just set
min_cfs_rq_runtime = 0. (Assuming it is, I definitely prefer the "lie
and sorta have 2x period 2x runtime" solution of removing expiration)

2019-06-26 22:11:09

by Dave Chiluk

[permalink] [raw]
Subject: Re: [PATCH v4 1/1] sched/fair: Return all runtime when cfs_b has very little remaining.

On Mon, Jun 24, 2019 at 10:33:07AM -0700, [email protected] wrote:
> This still has a similar cost as reducing min_cfs_rq_runtime to 0 - we
> now take a tg-global lock on every group se dequeue. Setting min=0 means
> that we have to take it on both enqueue and dequeue, while baseline
> means we take it once per min_cfs_rq_runtime in the worst case.
Yep, it's only slightly better than simply setting min_cfs_rq_runtime=0.
There's definitely a tradeoff of having to grab the lock every time.

The other non-obvious benefit to this is that when the application
starts hitting throttling the entire application starts hitting
throttling closer to simultaneously. Previously, threads that don't do
much could continue sipping on their 1ms of min_cfs_rq_runtime, while
main threads were throttled. With this the application would more or
less halt within 5ms of full quota usage.

> In addition how much this helps is very dependent on the exact pattern
> of sleep/wake - you can still strand all but 15ms of runtime with a
> pretty reasonable pattern.

I thought this change would have an upper bound stranded time of:
NUMCPUs * min_cfs_rq_runtime - 3 * sched_cfs_bandwidth_slice().
From the stranding of 1ms on each queue minues the amount that we'd
return when we forcibly hit the 3 x bandwidth slice Am I missing
something here? Additionally that's worst case, and would require that
threads be scheduled on distinct cpus mostly sequentially, and then
never run again.

> If the cost of taking this global lock across all cpus without a
> ratelimit was somehow not a problem, I'd much prefer to just set
> min_cfs_rq_runtime = 0. (Assuming it is, I definitely prefer the "lie
> and sorta have 2x period 2x runtime" solution of removing expiration)

Can I take this as an technical ack of the v3 patchset? Do you have any
other comments for that patchset you'd like to see corrected before it
gets integrated? If you are indeed good with that patchset, I'll clean
up the comment and Documentation text and re-submit as a v5 patchset. I

actually like that patchset best as well, for all the reasons already
discussed. Especially the one where it's been running like this for 5
years prior to 512ac999.

Also given my new-found understanding of min_cfs_rq_runtime, doesn't
this mean that if we don't expire runtime we could potentially have a
worst case of 1ms * NUMCPUs of extra runtime instead of 2x runtime?
This is because min_cfs_rq_runtime could theoretically live on a per-cpu
indefinitely? Still this is actually much smaller than I previously had
feared. Also it still holds that average runtime is still bounded by
runtime/period on average.

Thanks

2019-06-27 19:09:55

by Dave Chiluk

[permalink] [raw]
Subject: [PATCH] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices

It has been observed, that highly-threaded, non-cpu-bound applications
running under cpu.cfs_quota_us constraints can hit a high percentage of
periods throttled while simultaneously not consuming the allocated
amount of quota. This use case is typical of user-interactive non-cpu
bound applications, such as those running in kubernetes or mesos when
run on multiple cpu cores.

This has been root caused to threads being allocated per cpu bandwidth
slices, and then not fully using that slice within the period. At which
point the slice and quota expires. This expiration of unused slice
results in applications not being able to utilize the quota for which
they are allocated.

The expiration of per-cpu slices was recently fixed by
'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift
condition")'. Prior to that it appears that this has been broken since
at least 'commit 51f2176d74ac ("sched/fair: Fix unlocked reads of some
cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That
added the following conditional which resulted in slices never being
expired.

if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
/* extend local deadline, drift is bounded above by 2 ticks */
cfs_rq->runtime_expires += TICK_NSEC;

Because this was broken for nearly 5 years, and has recently been fixed
and is now being noticed by many users running kubernetes
(https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion
that the mechanisms around expiring runtime should be removed
altogether.

This allows quota already allocated to per-cpu run-queues to live longer
than the period boundary. This allows threads on runqueues that do not
use much CPU to continue to use their remaining slice over a longer
period of time than cpu.cfs_period_us. However, this helps prevents the
above condition of hitting throttling while also not fully utilizing
your cpu quota.

This theoretically allows a machine to use slightly more than its
allotted quota in some periods. This overflow would be bounded by the
remaining quota left on each per-cpu runqueueu. This is typically no
more than min_cfs_rq_runtime=1ms per cpu. For CPU bound tasks this will
change nothing, as they should theoretically fully utilize all of their
quota in each period. For user-interactive tasks as described above this
provides a much better user/application experience as their cpu
utilization will more closely match the amount they requested when they
hit throttling. This means that cpu limits no longer strictly apply per
period for non-cpu bound applications, but that they are still accurate
over longer timeframes.

This greatly improves performance of high-thread-count, non-cpu bound
applications with low cfs_quota_us allocation on high-core-count
machines. In the case of an artificial testcase (10ms/100ms of quota on
80 CPU machine), this commit resulted in almost 30x performance
improvement, while still maintaining correct cpu quota restrictions.
That testcase is available at https://github.com/indeedeng/fibtest.

Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition")
Signed-off-by: Dave Chiluk <[email protected]>
---
Documentation/scheduler/sched-bwc.txt | 73 ++++++++++++++++++++++++++++-------
kernel/sched/fair.c | 71 +++-------------------------------
kernel/sched/sched.h | 4 --
3 files changed, 65 insertions(+), 83 deletions(-)

diff --git a/Documentation/scheduler/sched-bwc.txt b/Documentation/scheduler/sched-bwc.txt
index f6b1873..4c19c94 100644
--- a/Documentation/scheduler/sched-bwc.txt
+++ b/Documentation/scheduler/sched-bwc.txt
@@ -8,15 +8,16 @@ CFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED extension which allows the
specification of the maximum CPU bandwidth available to a group or hierarchy.

The bandwidth allowed for a group is specified using a quota and period. Within
-each given "period" (microseconds), a group is allowed to consume only up to
-"quota" microseconds of CPU time. When the CPU bandwidth consumption of a
-group exceeds this limit (for that period), the tasks belonging to its
-hierarchy will be throttled and are not allowed to run again until the next
-period.
-
-A group's unused runtime is globally tracked, being refreshed with quota units
-above at each period boundary. As threads consume this bandwidth it is
-transferred to cpu-local "silos" on a demand basis. The amount transferred
+each given "period" (microseconds), a task group is allocated up to "quota"
+microseconds of CPU time. That quota is assigned to per cpu run queues in
+slices as threads in the cgroup become runnable. Once all quota has been
+assigned any additional requests for quota will result in those threads being
+throttled. Throttled threads will not be able to run again until the next
+period when the quota is replenished.
+
+A group's unassigned quota is globally tracked, being refreshed back to
+cfs_quota units at each period boundary. As threads consume this bandwidth it
+is transferred to cpu-local "silos" on a demand basis. The amount transferred
within each of these updates is tunable and described as the "slice".

Management
@@ -33,12 +34,12 @@ The default values are:

A value of -1 for cpu.cfs_quota_us indicates that the group does not have any
bandwidth restriction in place, such a group is described as an unconstrained
-bandwidth group. This represents the traditional work-conserving behavior for
+bandwidth group. This represents the traditional work-conserving behavior for
CFS.

Writing any (valid) positive value(s) will enact the specified bandwidth limit.
-The minimum quota allowed for the quota or period is 1ms. There is also an
-upper bound on the period length of 1s. Additional restrictions exist when
+The minimum quota allowed for the quota or period is 1ms. There is also an
+upper bound on the period length of 1s. Additional restrictions exist when
bandwidth limits are used in a hierarchical fashion, these are explained in
more detail below.

@@ -51,8 +52,8 @@ unthrottled if it is in a constrained state.
System wide settings
--------------------
For efficiency run-time is transferred between the global pool and CPU local
-"silos" in a batch fashion. This greatly reduces global accounting pressure
-on large systems. The amount transferred each time such an update is required
+"silos" in a batch fashion. This greatly reduces global accounting pressure
+on large systems. The amount transferred each time such an update is required
is described as the "slice".

This is tunable via procfs:
@@ -90,6 +91,50 @@ There are two ways in which a group may become throttled:
In case b) above, even though the child may have runtime remaining it will not
be allowed to until the parent's runtime is refreshed.

+Real-world behavior
+-------------------
+Once a slice is assigned to a cpu it does not expire, but all except 1ms of it
+may be returned to the global pool if all threads on that cpu become
+unrunnable. This is configured at compile time by the min_cfs_rq_runtime
+variable.
+
+The fact that cpu-local slices do not expire results in some interesting corner
+cases that should be understood.
+
+For cgroup cpu constrained applications that are cpu limited this is a
+relatively moot point because they will naturally consume the entirety of their
+quota as well as the entirety of each cpu-local slice in each period. As a
+result it is expected that nr_periods roughly equal nr_throttled, and that
+cpuacct.usage will increase roughly equal to cfs_quota_us in each period.
+
+However in a worst-case scenario, highly-threaded, interactive/non-cpu bound
+applications this non-expiration nuance allows applications to briefly burst
+past their quota limits by the amount of unused slice on each cpu that the task
+group is running on (typically at most 1ms per cpu). This slight burst
+only applies if quota had been assigned to a cpu and then not fully used or
+returned in previous periods. This burst amount will not be transferred
+between cores. As a result, this mechanism still strictly limits the task
+group to quota average usage, albeit over a longer time window than period.
+This also limits the burst ability to no more than 1ms per cpu. This provides
+better more predictable user experience for highly threaded applications with
+small quota limits on high core count machines. It also eliminates the
+propensity to throttle these applications while simultanously using less than
+quota amounts of cpu. Another way to say this, is that by allowing the unused
+portion of a slice to remain valid across periods we have decreased the
+possibility of wasting quota on cpu-local silos that don't need a full slice's
+amount of cpu time.
+
+The interaction between cpu-bound and non-cpu-bound-interactive applications
+should also be considered, especially when single core usage hits 100%. If you
+gave each of these applications half of a cpu-core and they both got scheduled
+on the same CPU it is theoretically possible that the non-cpu bound application
+will use up to 1ms additional quota in some periods, thereby preventing the
+cpu-bound application from fully using its quota by that same amount. In these
+instances it will be up to the CFS algorithm (see sched-design-CFS.txt) to
+decide which application is chosen to run, as they will both be runnable and
+have remaining quota. This runtime discrepancy will be made up in the following
+periods when the interactive application idles.
+
Examples
--------
1. Limit a group to 1 CPU worth of runtime.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f35930f..a675c69 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4295,8 +4295,6 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)

now = sched_clock_cpu(smp_processor_id());
cfs_b->runtime = cfs_b->quota;
- cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period);
- cfs_b->expires_seq++;
}

static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
@@ -4318,8 +4316,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
{
struct task_group *tg = cfs_rq->tg;
struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
- u64 amount = 0, min_amount, expires;
- int expires_seq;
+ u64 amount = 0, min_amount;

/* note: this is a positive sum as runtime_remaining <= 0 */
min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining;
@@ -4336,61 +4333,17 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
cfs_b->idle = 0;
}
}
- expires_seq = cfs_b->expires_seq;
- expires = cfs_b->runtime_expires;
raw_spin_unlock(&cfs_b->lock);

cfs_rq->runtime_remaining += amount;
- /*
- * we may have advanced our local expiration to account for allowed
- * spread between our sched_clock and the one on which runtime was
- * issued.
- */
- if (cfs_rq->expires_seq != expires_seq) {
- cfs_rq->expires_seq = expires_seq;
- cfs_rq->runtime_expires = expires;
- }

return cfs_rq->runtime_remaining > 0;
}

-/*
- * Note: This depends on the synchronization provided by sched_clock and the
- * fact that rq->clock snapshots this value.
- */
-static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
-{
- struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
-
- /* if the deadline is ahead of our clock, nothing to do */
- if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0))
- return;
-
- if (cfs_rq->runtime_remaining < 0)
- return;
-
- /*
- * If the local deadline has passed we have to consider the
- * possibility that our sched_clock is 'fast' and the global deadline
- * has not truly expired.
- *
- * Fortunately we can check determine whether this the case by checking
- * whether the global deadline(cfs_b->expires_seq) has advanced.
- */
- if (cfs_rq->expires_seq == cfs_b->expires_seq) {
- /* extend local deadline, drift is bounded above by 2 ticks */
- cfs_rq->runtime_expires += TICK_NSEC;
- } else {
- /* global deadline is ahead, expiration has passed */
- cfs_rq->runtime_remaining = 0;
- }
-}
-
static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
{
/* dock delta_exec before expiring quota (as it could span periods) */
cfs_rq->runtime_remaining -= delta_exec;
- expire_cfs_rq_runtime(cfs_rq);

if (likely(cfs_rq->runtime_remaining > 0))
return;
@@ -4581,8 +4534,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
resched_curr(rq);
}

-static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
- u64 remaining, u64 expires)
+static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining)
{
struct cfs_rq *cfs_rq;
u64 runtime;
@@ -4604,7 +4556,6 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
remaining -= runtime;

cfs_rq->runtime_remaining += runtime;
- cfs_rq->runtime_expires = expires;

/* we check whether we're throttled above */
if (cfs_rq->runtime_remaining > 0)
@@ -4629,7 +4580,7 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
*/
static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, unsigned long flags)
{
- u64 runtime, runtime_expires;
+ u64 runtime;
int throttled;

/* no need to continue the timer with no bandwidth constraint */
@@ -4657,8 +4608,6 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
/* account preceding periods in which throttling occurred */
cfs_b->nr_throttled += overrun;

- runtime_expires = cfs_b->runtime_expires;
-
/*
* This check is repeated as we are holding onto the new bandwidth while
* we unthrottle. This can potentially race with an unthrottled group
@@ -4671,8 +4620,7 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
cfs_b->distribute_running = 1;
raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
/* we can't nest cfs_b->lock while distributing bandwidth */
- runtime = distribute_cfs_runtime(cfs_b, runtime,
- runtime_expires);
+ runtime = distribute_cfs_runtime(cfs_b, runtime);
raw_spin_lock_irqsave(&cfs_b->lock, flags);

cfs_b->distribute_running = 0;
@@ -4749,8 +4697,7 @@ static void __return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
return;

raw_spin_lock(&cfs_b->lock);
- if (cfs_b->quota != RUNTIME_INF &&
- cfs_rq->runtime_expires == cfs_b->runtime_expires) {
+ if (cfs_b->quota != RUNTIME_INF) {
cfs_b->runtime += slack_runtime;

/* we are under rq->lock, defer unthrottling using a timer */
@@ -4783,7 +4730,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
{
u64 runtime = 0, slice = sched_cfs_bandwidth_slice();
unsigned long flags;
- u64 expires;

/* confirm we're still not at a refresh boundary */
raw_spin_lock_irqsave(&cfs_b->lock, flags);
@@ -4800,7 +4746,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice)
runtime = cfs_b->runtime;

- expires = cfs_b->runtime_expires;
if (runtime)
cfs_b->distribute_running = 1;

@@ -4809,11 +4754,9 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
if (!runtime)
return;

- runtime = distribute_cfs_runtime(cfs_b, runtime, expires);
+ runtime = distribute_cfs_runtime(cfs_b, runtime);

raw_spin_lock_irqsave(&cfs_b->lock, flags);
- if (expires == cfs_b->runtime_expires)
- lsub_positive(&cfs_b->runtime, runtime);
cfs_b->distribute_running = 0;
raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
}
@@ -4969,8 +4912,6 @@ void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)

cfs_b->period_active = 1;
overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
- cfs_b->runtime_expires += (overrun + 1) * ktime_to_ns(cfs_b->period);
- cfs_b->expires_seq++;
hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b52ed1a..0c0ed23 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -341,8 +341,6 @@ struct cfs_bandwidth {
u64 quota;
u64 runtime;
s64 hierarchical_quota;
- u64 runtime_expires;
- int expires_seq;

short idle;
short period_active;
@@ -562,8 +560,6 @@ struct cfs_rq {

#ifdef CONFIG_CFS_BANDWIDTH
int runtime_enabled;
- int expires_seq;
- u64 runtime_expires;
s64 runtime_remaining;

u64 throttled_clock;
--
1.8.3.1

2019-06-27 19:49:59

by Dave Chiluk

[permalink] [raw]
Subject: [PATCH v5 0/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices

Changelog v5
- Based on this comment from Ben Segall's comment on v4
> If the cost of taking this global lock across all cpus without a
> ratelimit was somehow not a problem, I'd much prefer to just set
> min_cfs_rq_runtime = 0. (Assuming it is, I definitely prefer the "lie
> and sorta have 2x period 2x runtime" solution of removing expiration)
I'm resubmitting my v3 patchset, with the requested changes.
- Updated Commit log given review comments
- Update sched-bwc.txt give my new understanding of the slack timer.

Changelog v4
- Rewrote patchset around the concept of returning all of runtime_remaining
when cfs_b nears the end of available quota.

Changelog v3
- Reworked documentation to better describe behavior of slice expiration per
feedback from Peter Oskolkov

Changelog v2
- Fixed some checkpatch errors in the commit message.

2019-06-27 19:50:05

by Dave Chiluk

[permalink] [raw]
Subject: [PATCH v5 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices

It has been observed, that highly-threaded, non-cpu-bound applications
running under cpu.cfs_quota_us constraints can hit a high percentage of
periods throttled while simultaneously not consuming the allocated
amount of quota. This use case is typical of user-interactive non-cpu
bound applications, such as those running in kubernetes or mesos when
run on multiple cpu cores.

This has been root caused to threads being allocated per cpu bandwidth
slices, and then not fully using that slice within the period. At which
point the slice and quota expires. This expiration of unused slice
results in applications not being able to utilize the quota for which
they are allocated.

The expiration of per-cpu slices was recently fixed by
'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift
condition")'. Prior to that it appears that this has been broken since
at least 'commit 51f2176d74ac ("sched/fair: Fix unlocked reads of some
cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That
added the following conditional which resulted in slices never being
expired.

if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
/* extend local deadline, drift is bounded above by 2 ticks */
cfs_rq->runtime_expires += TICK_NSEC;

Because this was broken for nearly 5 years, and has recently been fixed
and is now being noticed by many users running kubernetes
(https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion
that the mechanisms around expiring runtime should be removed
altogether.

This allows quota already allocated to per-cpu run-queues to live longer
than the period boundary. This allows threads on runqueues that do not
use much CPU to continue to use their remaining slice over a longer
period of time than cpu.cfs_period_us. However, this helps prevents the
above condition of hitting throttling while also not fully utilizing
your cpu quota.

This theoretically allows a machine to use slightly more than its
allotted quota in some periods. This overflow would be bounded by the
remaining quota left on each per-cpu runqueueu. This is typically no
more than min_cfs_rq_runtime=1ms per cpu. For CPU bound tasks this will
change nothing, as they should theoretically fully utilize all of their
quota in each period. For user-interactive tasks as described above this
provides a much better user/application experience as their cpu
utilization will more closely match the amount they requested when they
hit throttling. This means that cpu limits no longer strictly apply per
period for non-cpu bound applications, but that they are still accurate
over longer timeframes.

This greatly improves performance of high-thread-count, non-cpu bound
applications with low cfs_quota_us allocation on high-core-count
machines. In the case of an artificial testcase (10ms/100ms of quota on
80 CPU machine), this commit resulted in almost 30x performance
improvement, while still maintaining correct cpu quota restrictions.
That testcase is available at https://github.com/indeedeng/fibtest.

Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition")
Signed-off-by: Dave Chiluk <[email protected]>
---
Documentation/scheduler/sched-bwc.txt | 73 ++++++++++++++++++++++++++++-------
kernel/sched/fair.c | 71 +++-------------------------------
kernel/sched/sched.h | 4 --
3 files changed, 65 insertions(+), 83 deletions(-)

diff --git a/Documentation/scheduler/sched-bwc.txt b/Documentation/scheduler/sched-bwc.txt
index f6b1873..4c19c94 100644
--- a/Documentation/scheduler/sched-bwc.txt
+++ b/Documentation/scheduler/sched-bwc.txt
@@ -8,15 +8,16 @@ CFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED extension which allows the
specification of the maximum CPU bandwidth available to a group or hierarchy.

The bandwidth allowed for a group is specified using a quota and period. Within
-each given "period" (microseconds), a group is allowed to consume only up to
-"quota" microseconds of CPU time. When the CPU bandwidth consumption of a
-group exceeds this limit (for that period), the tasks belonging to its
-hierarchy will be throttled and are not allowed to run again until the next
-period.
-
-A group's unused runtime is globally tracked, being refreshed with quota units
-above at each period boundary. As threads consume this bandwidth it is
-transferred to cpu-local "silos" on a demand basis. The amount transferred
+each given "period" (microseconds), a task group is allocated up to "quota"
+microseconds of CPU time. That quota is assigned to per cpu run queues in
+slices as threads in the cgroup become runnable. Once all quota has been
+assigned any additional requests for quota will result in those threads being
+throttled. Throttled threads will not be able to run again until the next
+period when the quota is replenished.
+
+A group's unassigned quota is globally tracked, being refreshed back to
+cfs_quota units at each period boundary. As threads consume this bandwidth it
+is transferred to cpu-local "silos" on a demand basis. The amount transferred
within each of these updates is tunable and described as the "slice".

Management
@@ -33,12 +34,12 @@ The default values are:

A value of -1 for cpu.cfs_quota_us indicates that the group does not have any
bandwidth restriction in place, such a group is described as an unconstrained
-bandwidth group. This represents the traditional work-conserving behavior for
+bandwidth group. This represents the traditional work-conserving behavior for
CFS.

Writing any (valid) positive value(s) will enact the specified bandwidth limit.
-The minimum quota allowed for the quota or period is 1ms. There is also an
-upper bound on the period length of 1s. Additional restrictions exist when
+The minimum quota allowed for the quota or period is 1ms. There is also an
+upper bound on the period length of 1s. Additional restrictions exist when
bandwidth limits are used in a hierarchical fashion, these are explained in
more detail below.

@@ -51,8 +52,8 @@ unthrottled if it is in a constrained state.
System wide settings
--------------------
For efficiency run-time is transferred between the global pool and CPU local
-"silos" in a batch fashion. This greatly reduces global accounting pressure
-on large systems. The amount transferred each time such an update is required
+"silos" in a batch fashion. This greatly reduces global accounting pressure
+on large systems. The amount transferred each time such an update is required
is described as the "slice".

This is tunable via procfs:
@@ -90,6 +91,50 @@ There are two ways in which a group may become throttled:
In case b) above, even though the child may have runtime remaining it will not
be allowed to until the parent's runtime is refreshed.

+Real-world behavior
+-------------------
+Once a slice is assigned to a cpu it does not expire, but all except 1ms of it
+may be returned to the global pool if all threads on that cpu become
+unrunnable. This is configured at compile time by the min_cfs_rq_runtime
+variable.
+
+The fact that cpu-local slices do not expire results in some interesting corner
+cases that should be understood.
+
+For cgroup cpu constrained applications that are cpu limited this is a
+relatively moot point because they will naturally consume the entirety of their
+quota as well as the entirety of each cpu-local slice in each period. As a
+result it is expected that nr_periods roughly equal nr_throttled, and that
+cpuacct.usage will increase roughly equal to cfs_quota_us in each period.
+
+However in a worst-case scenario, highly-threaded, interactive/non-cpu bound
+applications this non-expiration nuance allows applications to briefly burst
+past their quota limits by the amount of unused slice on each cpu that the task
+group is running on (typically at most 1ms per cpu). This slight burst
+only applies if quota had been assigned to a cpu and then not fully used or
+returned in previous periods. This burst amount will not be transferred
+between cores. As a result, this mechanism still strictly limits the task
+group to quota average usage, albeit over a longer time window than period.
+This also limits the burst ability to no more than 1ms per cpu. This provides
+better more predictable user experience for highly threaded applications with
+small quota limits on high core count machines. It also eliminates the
+propensity to throttle these applications while simultanously using less than
+quota amounts of cpu. Another way to say this, is that by allowing the unused
+portion of a slice to remain valid across periods we have decreased the
+possibility of wasting quota on cpu-local silos that don't need a full slice's
+amount of cpu time.
+
+The interaction between cpu-bound and non-cpu-bound-interactive applications
+should also be considered, especially when single core usage hits 100%. If you
+gave each of these applications half of a cpu-core and they both got scheduled
+on the same CPU it is theoretically possible that the non-cpu bound application
+will use up to 1ms additional quota in some periods, thereby preventing the
+cpu-bound application from fully using its quota by that same amount. In these
+instances it will be up to the CFS algorithm (see sched-design-CFS.txt) to
+decide which application is chosen to run, as they will both be runnable and
+have remaining quota. This runtime discrepancy will be made up in the following
+periods when the interactive application idles.
+
Examples
--------
1. Limit a group to 1 CPU worth of runtime.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f35930f..a675c69 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4295,8 +4295,6 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)

now = sched_clock_cpu(smp_processor_id());
cfs_b->runtime = cfs_b->quota;
- cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period);
- cfs_b->expires_seq++;
}

static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
@@ -4318,8 +4316,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
{
struct task_group *tg = cfs_rq->tg;
struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
- u64 amount = 0, min_amount, expires;
- int expires_seq;
+ u64 amount = 0, min_amount;

/* note: this is a positive sum as runtime_remaining <= 0 */
min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining;
@@ -4336,61 +4333,17 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
cfs_b->idle = 0;
}
}
- expires_seq = cfs_b->expires_seq;
- expires = cfs_b->runtime_expires;
raw_spin_unlock(&cfs_b->lock);

cfs_rq->runtime_remaining += amount;
- /*
- * we may have advanced our local expiration to account for allowed
- * spread between our sched_clock and the one on which runtime was
- * issued.
- */
- if (cfs_rq->expires_seq != expires_seq) {
- cfs_rq->expires_seq = expires_seq;
- cfs_rq->runtime_expires = expires;
- }

return cfs_rq->runtime_remaining > 0;
}

-/*
- * Note: This depends on the synchronization provided by sched_clock and the
- * fact that rq->clock snapshots this value.
- */
-static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
-{
- struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
-
- /* if the deadline is ahead of our clock, nothing to do */
- if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0))
- return;
-
- if (cfs_rq->runtime_remaining < 0)
- return;
-
- /*
- * If the local deadline has passed we have to consider the
- * possibility that our sched_clock is 'fast' and the global deadline
- * has not truly expired.
- *
- * Fortunately we can check determine whether this the case by checking
- * whether the global deadline(cfs_b->expires_seq) has advanced.
- */
- if (cfs_rq->expires_seq == cfs_b->expires_seq) {
- /* extend local deadline, drift is bounded above by 2 ticks */
- cfs_rq->runtime_expires += TICK_NSEC;
- } else {
- /* global deadline is ahead, expiration has passed */
- cfs_rq->runtime_remaining = 0;
- }
-}
-
static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
{
/* dock delta_exec before expiring quota (as it could span periods) */
cfs_rq->runtime_remaining -= delta_exec;
- expire_cfs_rq_runtime(cfs_rq);

if (likely(cfs_rq->runtime_remaining > 0))
return;
@@ -4581,8 +4534,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
resched_curr(rq);
}

-static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
- u64 remaining, u64 expires)
+static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining)
{
struct cfs_rq *cfs_rq;
u64 runtime;
@@ -4604,7 +4556,6 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
remaining -= runtime;

cfs_rq->runtime_remaining += runtime;
- cfs_rq->runtime_expires = expires;

/* we check whether we're throttled above */
if (cfs_rq->runtime_remaining > 0)
@@ -4629,7 +4580,7 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
*/
static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, unsigned long flags)
{
- u64 runtime, runtime_expires;
+ u64 runtime;
int throttled;

/* no need to continue the timer with no bandwidth constraint */
@@ -4657,8 +4608,6 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
/* account preceding periods in which throttling occurred */
cfs_b->nr_throttled += overrun;

- runtime_expires = cfs_b->runtime_expires;
-
/*
* This check is repeated as we are holding onto the new bandwidth while
* we unthrottle. This can potentially race with an unthrottled group
@@ -4671,8 +4620,7 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
cfs_b->distribute_running = 1;
raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
/* we can't nest cfs_b->lock while distributing bandwidth */
- runtime = distribute_cfs_runtime(cfs_b, runtime,
- runtime_expires);
+ runtime = distribute_cfs_runtime(cfs_b, runtime);
raw_spin_lock_irqsave(&cfs_b->lock, flags);

cfs_b->distribute_running = 0;
@@ -4749,8 +4697,7 @@ static void __return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
return;

raw_spin_lock(&cfs_b->lock);
- if (cfs_b->quota != RUNTIME_INF &&
- cfs_rq->runtime_expires == cfs_b->runtime_expires) {
+ if (cfs_b->quota != RUNTIME_INF) {
cfs_b->runtime += slack_runtime;

/* we are under rq->lock, defer unthrottling using a timer */
@@ -4783,7 +4730,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
{
u64 runtime = 0, slice = sched_cfs_bandwidth_slice();
unsigned long flags;
- u64 expires;

/* confirm we're still not at a refresh boundary */
raw_spin_lock_irqsave(&cfs_b->lock, flags);
@@ -4800,7 +4746,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice)
runtime = cfs_b->runtime;

- expires = cfs_b->runtime_expires;
if (runtime)
cfs_b->distribute_running = 1;

@@ -4809,11 +4754,9 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
if (!runtime)
return;

- runtime = distribute_cfs_runtime(cfs_b, runtime, expires);
+ runtime = distribute_cfs_runtime(cfs_b, runtime);

raw_spin_lock_irqsave(&cfs_b->lock, flags);
- if (expires == cfs_b->runtime_expires)
- lsub_positive(&cfs_b->runtime, runtime);
cfs_b->distribute_running = 0;
raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
}
@@ -4969,8 +4912,6 @@ void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)

cfs_b->period_active = 1;
overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
- cfs_b->runtime_expires += (overrun + 1) * ktime_to_ns(cfs_b->period);
- cfs_b->expires_seq++;
hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b52ed1a..0c0ed23 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -341,8 +341,6 @@ struct cfs_bandwidth {
u64 quota;
u64 runtime;
s64 hierarchical_quota;
- u64 runtime_expires;
- int expires_seq;

short idle;
short period_active;
@@ -562,8 +560,6 @@ struct cfs_rq {

#ifdef CONFIG_CFS_BANDWIDTH
int runtime_enabled;
- int expires_seq;
- u64 runtime_expires;
s64 runtime_remaining;

u64 throttled_clock;
--
1.8.3.1

2019-06-27 20:19:12

by Benjamin Segall

[permalink] [raw]
Subject: Re: [PATCH v4 1/1] sched/fair: Return all runtime when cfs_b has very little remaining.

Dave Chiluk <[email protected]> writes:

> On Mon, Jun 24, 2019 at 10:33:07AM -0700, [email protected] wrote:
>> This still has a similar cost as reducing min_cfs_rq_runtime to 0 - we
>> now take a tg-global lock on every group se dequeue. Setting min=0 means
>> that we have to take it on both enqueue and dequeue, while baseline
>> means we take it once per min_cfs_rq_runtime in the worst case.
> Yep, it's only slightly better than simply setting min_cfs_rq_runtime=0.
> There's definitely a tradeoff of having to grab the lock every time.
>
> The other non-obvious benefit to this is that when the application
> starts hitting throttling the entire application starts hitting
> throttling closer to simultaneously. Previously, threads that don't do
> much could continue sipping on their 1ms of min_cfs_rq_runtime, while
> main threads were throttled. With this the application would more or
> less halt within 5ms of full quota usage.
>
>> In addition how much this helps is very dependent on the exact pattern
>> of sleep/wake - you can still strand all but 15ms of runtime with a
>> pretty reasonable pattern.
>
> I thought this change would have an upper bound stranded time of:
> NUMCPUs * min_cfs_rq_runtime - 3 * sched_cfs_bandwidth_slice().
> From the stranding of 1ms on each queue minues the amount that we'd
> return when we forcibly hit the 3 x bandwidth slice Am I missing
> something here? Additionally that's worst case, and would require that
> threads be scheduled on distinct cpus mostly sequentially, and then
> never run again.

Yeah, to be exact it's that. It's just that this is only relevant at all
when NUMCPUs*min_cfs_rq_runtime is at least a significant fraction of
quota, and when "scheduled on distinct cpus, and then never run again"
happens. How sequential it needs to be depends on the threads vs num
cpus.


>
>> If the cost of taking this global lock across all cpus without a
>> ratelimit was somehow not a problem, I'd much prefer to just set
>> min_cfs_rq_runtime = 0. (Assuming it is, I definitely prefer the "lie
>> and sorta have 2x period 2x runtime" solution of removing expiration)
>
> Can I take this as an technical ack of the v3 patchset? Do you have any
> other comments for that patchset you'd like to see corrected before it
> gets integrated? If you are indeed good with that patchset, I'll clean
> up the comment and Documentation text and re-submit as a v5 patchset. I
>
> actually like that patchset best as well, for all the reasons already
> discussed. Especially the one where it's been running like this for 5
> years prior to 512ac999.
>
> Also given my new-found understanding of min_cfs_rq_runtime, doesn't
> this mean that if we don't expire runtime we could potentially have a
> worst case of 1ms * NUMCPUs of extra runtime instead of 2x runtime?
> This is because min_cfs_rq_runtime could theoretically live on a per-cpu
> indefinitely? Still this is actually much smaller than I previously had
> feared. Also it still holds that average runtime is still bounded by
> runtime/period on average.

Yeah, I think I was misremembering or thinking only about one example or
something. Meanwhile I'm thinking this is potentially /worse/ than just
2x runtime per 2x period, though obviously it depends on cpus vs quota.


One thing I'm trying now is if we made the slack timer go and collect
all the min_cfs_rq_runtime left behind, which would provide something
similar to the min=0 case for fixing not only this but other cases of
stranding bandwidth, but also some level of rate limiting.



>
> Thanks

2019-07-01 20:16:47

by Benjamin Segall

[permalink] [raw]
Subject: Re: [PATCH v5 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices

Alright, this prototype of "maybe we should just 100% avoid stranding
runtime for a full period" appears to fix the fibtest synthetic example,
and seems like a theoretically-reasonable approach.

Things that may want improvement or at least thought (but it's a holiday
week in the US and I wanted any opinions on the basic approach):

- I don't like cfs_rq->slack_list, since logically it's mutually
exclusive with throttled_list, but we have to iterate without
locks, so I don't know that we can avoid it.

- I previously was using _rcu for the slack list, like throttled, but
there is no list_for_each_entry_rcu_safe, so the list_del_init would
be invalid and we'd have to use another flag or opencode the
equivalent.

- (Actually, this just made me realize that distribute is sorta wrong if
the unthrottled task immediately runs and rethrottles; this would just
mean that we effectively restart the loop)

- We unconditionally start the slack timer, even if nothing is
throttled. We could instead have throttle start the timer in this case
(setting the timeout some definition of "appropriately"), but this
bookkeeping would be a big hassle.

- We could try to do better about deciding what cfs_rqs are idle than
"nr_running == 0", possibly requiring that to have been true for N<5
ms, and restarting the slack timer if we didn't clear everything.

- Taking up-to-every rq->lock is bad and expensive and 5ms may be too
short a delay for this. I haven't tried microbenchmarks on the cost of
this vs min_cfs_rq_runtime = 0 vs baseline.

- runtime_expires vs expires_seq choice is basically rand(), much like
the existing code. (I think the most consistent with existing code
would be runtime_expires, since cfs_b lock is held; I think most
consistent in general would change most of the existing ones as well
to be seq)


-- >8 --
Subject: [PATCH] sched: avoid stranding cfs_bandwidth runtime

We avoid contention on the per-tg cfs_b->lock by keeping 1ms of runtime on a
cfs_rq even when all tasks in that cfs_rq dequeue. This way tasks doing frequent
wake/sleep can't hit this cross-cpu lock more than once per ms. This however
means that up to 1ms of runtime per cpu can be lost if no task does wake up on
that cpu, which is leading to issues on cgroups with low quota, many available
cpus, and a combination of threads that run for very little time and ones that
want to run constantly.

This was previously hidden by runtime expiration being broken, which allowed
this stranded runtime to be kept indefinitely across period resets. Thus after
an initial period or two any long-running tasks could use an appropriate portion
of their group's quota. The issue was that the group could also potentially
burst for 1ms * cpus more than their quota allowed, and in these situations this
is a significant increase.

Fix this by having the group's slack timer (which runs at most once per 5ms)
remove all runtime from empty cfs_rqs, not just redistribute any runtime above
that 1ms that was returned immediately.

Signed-off-by: Ben Segall <[email protected]>
---
kernel/sched/fair.c | 66 +++++++++++++++++++++++++++++++++++---------
kernel/sched/sched.h | 2 ++
2 files changed, 55 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 300f2c54dea5..80b2198d9b29 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4745,23 +4745,32 @@ static void __return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
s64 slack_runtime = cfs_rq->runtime_remaining - min_cfs_rq_runtime;

- if (slack_runtime <= 0)
+ if (cfs_rq->runtime_remaining <= 0)
+ return;
+
+ if (slack_runtime <= 0 && !list_empty(&cfs_rq->slack_list))
return;

raw_spin_lock(&cfs_b->lock);
if (cfs_b->quota != RUNTIME_INF &&
- cfs_rq->runtime_expires == cfs_b->runtime_expires) {
- cfs_b->runtime += slack_runtime;
+ cfs_rq->expires_seq == cfs_b->expires_seq) {
+ if (slack_runtime > 0)
+ cfs_b->runtime += slack_runtime;
+ if (list_empty(&cfs_rq->slack_list))
+ list_add(&cfs_rq->slack_list, &cfs_b->slack_cfs_rq);

- /* we are under rq->lock, defer unthrottling using a timer */
- if (cfs_b->runtime > sched_cfs_bandwidth_slice() &&
- !list_empty(&cfs_b->throttled_cfs_rq))
- start_cfs_slack_bandwidth(cfs_b);
+ /*
+ * After a timeout, gather our remaining runtime so it can't get
+ * stranded. We need a timer anyways to distribute any of the
+ * runtime due to locking issues.
+ */
+ start_cfs_slack_bandwidth(cfs_b);
}
raw_spin_unlock(&cfs_b->lock);

- /* even if it's not valid for return we don't want to try again */
- cfs_rq->runtime_remaining -= slack_runtime;
+ if (slack_runtime > 0)
+ /* even if it's not valid for return we don't want to try again */
+ cfs_rq->runtime_remaining -= slack_runtime;
}

static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
@@ -4781,12 +4790,41 @@ static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
*/
static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
{
- u64 runtime = 0, slice = sched_cfs_bandwidth_slice();
+ u64 runtime = 0;
unsigned long flags;
u64 expires;
+ struct cfs_rq *cfs_rq, *temp;
+ LIST_HEAD(temp_head);
+
+ local_irq_save(flags);
+
+ raw_spin_lock(&cfs_b->lock);
+ cfs_b->slack_started = false;
+ list_splice_init(&cfs_b->slack_cfs_rq, &temp_head);
+ raw_spin_unlock(&cfs_b->lock);
+
+
+ /* Gather all left over runtime from all rqs */
+ list_for_each_entry_safe(cfs_rq, temp, &temp_head, slack_list) {
+ struct rq *rq = rq_of(cfs_rq);
+ struct rq_flags rf;
+
+ rq_lock(rq, &rf);
+
+ raw_spin_lock(&cfs_b->lock);
+ list_del_init(&cfs_rq->slack_list);
+ if (!cfs_rq->nr_running && cfs_rq->runtime_remaining > 0 &&
+ cfs_rq->runtime_expires == cfs_b->runtime_expires) {
+ cfs_b->runtime += cfs_rq->runtime_remaining;
+ cfs_rq->runtime_remaining = 0;
+ }
+ raw_spin_unlock(&cfs_b->lock);
+
+ rq_unlock(rq, &rf);
+ }

/* confirm we're still not at a refresh boundary */
- raw_spin_lock_irqsave(&cfs_b->lock, flags);
+ raw_spin_lock(&cfs_b->lock);
cfs_b->slack_started = false;
if (cfs_b->distribute_running) {
raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
@@ -4798,7 +4836,7 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
return;
}

- if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice)
+ if (cfs_b->quota != RUNTIME_INF)
runtime = cfs_b->runtime;

expires = cfs_b->runtime_expires;
@@ -4946,6 +4984,7 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
cfs_b->period = ns_to_ktime(default_cfs_period());

INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
+ INIT_LIST_HEAD(&cfs_b->slack_cfs_rq);
hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
cfs_b->period_timer.function = sched_cfs_period_timer;
hrtimer_init(&cfs_b->slack_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
@@ -4958,6 +4997,7 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
{
cfs_rq->runtime_enabled = 0;
INIT_LIST_HEAD(&cfs_rq->throttled_list);
+ INIT_LIST_HEAD(&cfs_rq->slack_list);
}

void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b08dee29ef5e..3b272ee894fb 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -345,6 +345,7 @@ struct cfs_bandwidth {
struct hrtimer period_timer;
struct hrtimer slack_timer;
struct list_head throttled_cfs_rq;
+ struct list_head slack_cfs_rq;

/* Statistics: */
int nr_periods;
@@ -566,6 +567,7 @@ struct cfs_rq {
int throttled;
int throttle_count;
struct list_head throttled_list;
+ struct list_head slack_list;
#endif /* CONFIG_CFS_BANDWIDTH */
#endif /* CONFIG_FAIR_GROUP_SCHED */
};
--
2.22.0.410.gd8fdbe21b5-goog

2019-07-11 10:21:29

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v5 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices


FWIW, good to see progress, still waiting for you guys to agree :-)

On Mon, Jul 01, 2019 at 01:15:44PM -0700, [email protected] wrote:

> - Taking up-to-every rq->lock is bad and expensive and 5ms may be too
> short a delay for this. I haven't tried microbenchmarks on the cost of
> this vs min_cfs_rq_runtime = 0 vs baseline.

Yes, that's tricky, SGI/HPE have definite ideas about that.

> @@ -4781,12 +4790,41 @@ static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
> */
> static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
> {
> - u64 runtime = 0, slice = sched_cfs_bandwidth_slice();
> + u64 runtime = 0;
> unsigned long flags;
> u64 expires;
> + struct cfs_rq *cfs_rq, *temp;
> + LIST_HEAD(temp_head);
> +
> + local_irq_save(flags);
> +
> + raw_spin_lock(&cfs_b->lock);
> + cfs_b->slack_started = false;
> + list_splice_init(&cfs_b->slack_cfs_rq, &temp_head);
> + raw_spin_unlock(&cfs_b->lock);
> +
> +
> + /* Gather all left over runtime from all rqs */
> + list_for_each_entry_safe(cfs_rq, temp, &temp_head, slack_list) {
> + struct rq *rq = rq_of(cfs_rq);
> + struct rq_flags rf;
> +
> + rq_lock(rq, &rf);
> +
> + raw_spin_lock(&cfs_b->lock);
> + list_del_init(&cfs_rq->slack_list);
> + if (!cfs_rq->nr_running && cfs_rq->runtime_remaining > 0 &&
> + cfs_rq->runtime_expires == cfs_b->runtime_expires) {
> + cfs_b->runtime += cfs_rq->runtime_remaining;
> + cfs_rq->runtime_remaining = 0;
> + }
> + raw_spin_unlock(&cfs_b->lock);
> +
> + rq_unlock(rq, &rf);
> + }

But worse still, you take possibly every rq->lock without ever
re-enabling IRQs.

>
> /* confirm we're still not at a refresh boundary */
> - raw_spin_lock_irqsave(&cfs_b->lock, flags);
> + raw_spin_lock(&cfs_b->lock);
> cfs_b->slack_started = false;
> if (cfs_b->distribute_running) {
> raw_spin_unlock_irqrestore(&cfs_b->lock, flags);

2019-07-11 18:22:10

by Benjamin Segall

[permalink] [raw]
Subject: Re: [PATCH v5 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices

Peter Zijlstra <[email protected]> writes:

> FWIW, good to see progress, still waiting for you guys to agree :-)
>
> On Mon, Jul 01, 2019 at 01:15:44PM -0700, [email protected] wrote:
>
>> - Taking up-to-every rq->lock is bad and expensive and 5ms may be too
>> short a delay for this. I haven't tried microbenchmarks on the cost of
>> this vs min_cfs_rq_runtime = 0 vs baseline.
>
> Yes, that's tricky, SGI/HPE have definite ideas about that.
>
>> @@ -4781,12 +4790,41 @@ static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>> */
>> static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
>> {
>> - u64 runtime = 0, slice = sched_cfs_bandwidth_slice();
>> + u64 runtime = 0;
>> unsigned long flags;
>> u64 expires;
>> + struct cfs_rq *cfs_rq, *temp;
>> + LIST_HEAD(temp_head);
>> +
>> + local_irq_save(flags);
>> +
>> + raw_spin_lock(&cfs_b->lock);
>> + cfs_b->slack_started = false;
>> + list_splice_init(&cfs_b->slack_cfs_rq, &temp_head);
>> + raw_spin_unlock(&cfs_b->lock);
>> +
>> +
>> + /* Gather all left over runtime from all rqs */
>> + list_for_each_entry_safe(cfs_rq, temp, &temp_head, slack_list) {
>> + struct rq *rq = rq_of(cfs_rq);
>> + struct rq_flags rf;
>> +
>> + rq_lock(rq, &rf);
>> +
>> + raw_spin_lock(&cfs_b->lock);
>> + list_del_init(&cfs_rq->slack_list);
>> + if (!cfs_rq->nr_running && cfs_rq->runtime_remaining > 0 &&
>> + cfs_rq->runtime_expires == cfs_b->runtime_expires) {
>> + cfs_b->runtime += cfs_rq->runtime_remaining;
>> + cfs_rq->runtime_remaining = 0;
>> + }
>> + raw_spin_unlock(&cfs_b->lock);
>> +
>> + rq_unlock(rq, &rf);
>> + }
>
> But worse still, you take possibly every rq->lock without ever
> re-enabling IRQs.
>

Yeah, I'm not sure why I did that, it isn't correctness.

2019-07-12 18:02:20

by Benjamin Segall

[permalink] [raw]
Subject: Re: [PATCH v5 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices

Dave Chiluk <[email protected]> writes:

> So I spent some more time testing this new patch as is *(interrupts disabled). I know I probably should have fixed the patch, but it's hard to get time on big test hardware sometimes, and I was already well along my way with testing.
>
> In regards to the quota usage overage I was seeing earlier: I have a
> theory as to what might be happening here, and I'm pretty sure it's
> related to the IRQs being disabled during the rq->lock walk. I think
> that the main fast thread was able to use an excess amount of quota
> because the timer interrupt meant to stop it wasn't being handled
> timely due to the interrupts being disabled. On my 8 core machine this
> resulted in a what looked like simply improved usage of the quota, but
> when I ran the test on an 80 core machine I saw a massive overage of
> cpu usage when running fibtest. Specifically when running fibtest for
> 5 seconds with 50ms quota/100ms period expecting ~2500ms of quota
> usage; I got 3731 ms of cpu usage which was an unexpected overage of
> 1231ms. Is that a reasonable theory?

Tht doesn't seem likely - taking 1ms would be way longer than I'd expect
to begin with, and runtime_remaining can go negative for that sort of
reason anyways assuming the irq time is even counted towards the task.
Also I don't that the enable-irqs version will help for the scheduler
tick at least without rt patchsets.

That is still also too much for what I was thinking of though. I'll have
to look into this more.

>
> I'll try to get some time again tomorrow to test with IRQs disabled before the walk. Ben if you have a chance to fix and resend the patch that'd help.
>
> I'm really starting to think that simply removing the quota expiration
> may be the best solution here. Mathmatically it works out, it makes
> the code simpler, it doesn't have any of the lock walk issues, it
> doesn't add extra latency or overhead due to the slack timer,

It works out _for the job that is supposed to be throttled_. If the job
then gets a burst of actually-expensive work on many threads it can then
use NCPUs extra ms, adding latency to any other job on the system. Given
that it's still only 1ms on each runqueue, maybe this isn't the end of
the world, but the fail case does exist.

(We have to do exactly the same locking stuff on distribute, both more
rarely on the period timer, and on the currently existing slack timer)

> and that behavior is exactly what the kernel was doing for 5 years with few complaints about overage afaik.
>
> Either way, I'm very glad that we are getting to the end of this one, and all solutions appear to solve the core of the problem. I thank you all the work you guys have put into this.
>
> On Thu, Jul 11, 2019 at 12:46 PM <[email protected]> wrote:
>
> Peter Zijlstra <[email protected]> writes:
>
> > FWIW, good to see progress, still waiting for you guys to agree :-)
> >
> > On Mon, Jul 01, 2019 at 01:15:44PM -0700, [email protected] wrote:
> >
> >> - Taking up-to-every rq->lock is bad and expensive and 5ms may be too
> >> short a delay for this. I haven't tried microbenchmarks on the cost of
> >> this vs min_cfs_rq_runtime = 0 vs baseline.
> >
> > Yes, that's tricky, SGI/HPE have definite ideas about that.
> >
> >> @@ -4781,12 +4790,41 @@ static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
> >> */
> >> static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
> >> {
> >> - u64 runtime = 0, slice = sched_cfs_bandwidth_slice();
> >> + u64 runtime = 0;
> >> unsigned long flags;
> >> u64 expires;
> >> + struct cfs_rq *cfs_rq, *temp;
> >> + LIST_HEAD(temp_head);
> >> +
> >> + local_irq_save(flags);
> >> +
> >> + raw_spin_lock(&cfs_b->lock);
> >> + cfs_b->slack_started = false;
> >> + list_splice_init(&cfs_b->slack_cfs_rq, &temp_head);
> >> + raw_spin_unlock(&cfs_b->lock);
> >> +
> >> +
> >> + /* Gather all left over runtime from all rqs */
> >> + list_for_each_entry_safe(cfs_rq, temp, &temp_head, slack_list) {
> >> + struct rq *rq = rq_of(cfs_rq);
> >> + struct rq_flags rf;
> >> +
> >> + rq_lock(rq, &rf);
> >> +
> >> + raw_spin_lock(&cfs_b->lock);
> >> + list_del_init(&cfs_rq->slack_list);
> >> + if (!cfs_rq->nr_running && cfs_rq->runtime_remaining > 0 &&
> >> + cfs_rq->runtime_expires == cfs_b->runtime_expires) {
> >> + cfs_b->runtime += cfs_rq->runtime_remaining;
> >> + cfs_rq->runtime_remaining = 0;
> >> + }
> >> + raw_spin_unlock(&cfs_b->lock);
> >> +
> >> + rq_unlock(rq, &rf);
> >> + }
> >
> > But worse still, you take possibly every rq->lock without ever
> > re-enabling IRQs.
> >
>
> Yeah, I'm not sure why I did that, it isn't correctness.

2019-07-12 22:10:47

by Benjamin Segall

[permalink] [raw]
Subject: Re: [PATCH v5 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices

Dave Chiluk <[email protected]> writes:

> So I spent some more time testing this new patch as is *(interrupts disabled). I know I probably should have fixed the patch, but it's hard to get time on big test hardware sometimes, and I was already well along my way with testing.
>
> In regards to the quota usage overage I was seeing earlier: I have a theory as to what might be happening here, and I'm pretty sure it's related to the IRQs being disabled during the rq->lock walk. I think that the main fast thread was able to use an excess amount
> of quota because the timer interrupt meant to stop it wasn't being handled timely due to the interrupts being disabled. On my 8 core machine this resulted in a what looked like simply improved usage of the quota, but when I ran the test on an 80 core machine I
> saw a massive overage of cpu usage when running fibtest. Specifically when running fibtest for 5 seconds with 50ms quota/100ms period expecting ~2500ms of quota usage; I got 3731 ms of cpu usage which was an unexpected overage of 1231ms. Is that a
> reasonable theory?

I think I've figured out what's going on here (and a related issue
that gave me some inconsistency when trying to debug it): other "slow"
threads can wake up while the slack timer is in distribute and
double-spend some runtime. Since we lsub_positive rather than allow
cfs_b->runtime to be negative this double-spending is permanent, and can
go on indefinitely.

In addition, if things fall out in a slightly different way, all the
"slow" threads can wind up getting on cpu and claiming slices of runtime
before the "fast" thread, and then it just has to wait another slack
period to hope that the ordering winds up better that time. This just
depends on things like IPI latency and maybe what order things happened
to happen at the start of the period.

Ugh. Maybe we /do/ just give up and say that most people don't seem to
be using cfs_b in a way that expiration of the leftover 1ms matters.

2019-07-15 15:48:25

by Dave Chiluk

[permalink] [raw]
Subject: Re: [PATCH v5 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices

On Fri, Jul 12, 2019 at 5:10 PM <[email protected]> wrote:
> Ugh. Maybe we /do/ just give up and say that most people don't seem to
> be using cfs_b in a way that expiration of the leftover 1ms matters.

That was my conclusion as well. Does this mean you want to proceed
with my patch set? Do you have any changes you want made to the
proposed documentation changes, or any other changes for that matter?

2019-07-16 19:58:47

by Benjamin Segall

[permalink] [raw]
Subject: Re: [PATCH v5 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices

Dave Chiluk <[email protected]> writes:

> It has been observed, that highly-threaded, non-cpu-bound applications
> running under cpu.cfs_quota_us constraints can hit a high percentage of
> periods throttled while simultaneously not consuming the allocated
> amount of quota. This use case is typical of user-interactive non-cpu
> bound applications, such as those running in kubernetes or mesos when
> run on multiple cpu cores.
>
> This has been root caused to threads being allocated per cpu bandwidth
> slices, and then not fully using that slice within the period. At which
> point the slice and quota expires. This expiration of unused slice
> results in applications not being able to utilize the quota for which
> they are allocated.
>
> The expiration of per-cpu slices was recently fixed by
> 'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift
> condition")'. Prior to that it appears that this has been broken since
> at least 'commit 51f2176d74ac ("sched/fair: Fix unlocked reads of some
> cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That
> added the following conditional which resulted in slices never being
> expired.
>
> if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
> /* extend local deadline, drift is bounded above by 2 ticks */
> cfs_rq->runtime_expires += TICK_NSEC;
>
> Because this was broken for nearly 5 years, and has recently been fixed
> and is now being noticed by many users running kubernetes
> (https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion
> that the mechanisms around expiring runtime should be removed
> altogether.
>
> This allows quota already allocated to per-cpu run-queues to live longer
> than the period boundary. This allows threads on runqueues that do not
> use much CPU to continue to use their remaining slice over a longer
> period of time than cpu.cfs_period_us. However, this helps prevents the
> above condition of hitting throttling while also not fully utilizing
> your cpu quota.
>
> This theoretically allows a machine to use slightly more than its
> allotted quota in some periods. This overflow would be bounded by the
> remaining quota left on each per-cpu runqueueu. This is typically no
> more than min_cfs_rq_runtime=1ms per cpu. For CPU bound tasks this will
> change nothing, as they should theoretically fully utilize all of their
> quota in each period. For user-interactive tasks as described above this
> provides a much better user/application experience as their cpu
> utilization will more closely match the amount they requested when they
> hit throttling. This means that cpu limits no longer strictly apply per
> period for non-cpu bound applications, but that they are still accurate
> over longer timeframes.
>
> This greatly improves performance of high-thread-count, non-cpu bound
> applications with low cfs_quota_us allocation on high-core-count
> machines. In the case of an artificial testcase (10ms/100ms of quota on
> 80 CPU machine), this commit resulted in almost 30x performance
> improvement, while still maintaining correct cpu quota restrictions.
> That testcase is available at https://github.com/indeedeng/fibtest.
>
> Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition")
> Signed-off-by: Dave Chiluk <[email protected]>
> ---
> Documentation/scheduler/sched-bwc.txt | 73 ++++++++++++++++++++++++++++-------
> kernel/sched/fair.c | 71 +++-------------------------------
> kernel/sched/sched.h | 4 --
> 3 files changed, 65 insertions(+), 83 deletions(-)
>
> [...]
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f35930f..a675c69 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4809,11 +4754,9 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
> if (!runtime)
> return;
>
> - runtime = distribute_cfs_runtime(cfs_b, runtime, expires);
> + runtime = distribute_cfs_runtime(cfs_b, runtime);
>
> raw_spin_lock_irqsave(&cfs_b->lock, flags);
> - if (expires == cfs_b->runtime_expires)
> - lsub_positive(&cfs_b->runtime, runtime);

The lsub_positive is still needed, just get rid of the if.


Other than that,
Reviewed-by: Ben Segall <[email protected]>

2019-07-24 02:25:32

by Dave Chiluk

[permalink] [raw]
Subject: [PATCH v6 0/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices

Changelog v6
- Added back missing call to lsub_positive(&cfs_b->runtime, runtime);
- Added Reviewed-by: Ben Segall <[email protected]>
- Fix some grammar in the Documentation, and change some wording.
- Updated documentation due to the .rst change

Changelog v5
- Based on this comment from Ben Segall's comment on v4
> If the cost of taking this global lock across all cpus without a
> ratelimit was somehow not a problem, I'd much prefer to just set
> min_cfs_rq_runtime = 0. (Assuming it is, I definitely prefer the "lie
> and sorta have 2x period 2x runtime" solution of removing expiration)
I'm resubmitting my v3 patchset, with the requested changes.
- Updated Commit log given review comments
- Update sched-bwc.txt give my new understanding of the slack timer.

Changelog v4
- Rewrote patchset around the concept of returning all of runtime_remaining
when cfs_b nears the end of available quota.

Changelog v3
- Reworked documentation to better describe behavior of slice expiration per
feedback from Peter Oskolkov

Changelog v2
- Fixed some checkpatch errors in the commit message.