2019-05-29 19:10:24

by Dave Chiluk

[permalink] [raw]
Subject: [PATCH v3 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices

It has been observed, that highly-threaded, non-cpu-bound applications
running under cpu.cfs_quota_us constraints can hit a high percentage of
periods throttled while simultaneously not consuming the allocated
amount of quota. This use case is typical of user-interactive non-cpu
bound applications, such as those running in kubernetes or mesos when
run on multiple cpu cores.

This has been root caused to threads being allocated per cpu bandwidth
slices, and then not fully using that slice within the period. At which
point the slice and quota expires. This expiration of unused slice
results in applications not being able to utilize the quota for which
they are allocated.

The expiration of per-cpu slices was recently fixed by
'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift
condition")'. Prior to that it appears that this has been broken since
at least 'commit 51f2176d74ac ("sched/fair: Fix unlocked reads of some
cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That
added the following conditional which resulted in slices never being
expired.

if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
/* extend local deadline, drift is bounded above by 2 ticks */
cfs_rq->runtime_expires += TICK_NSEC;

Because this was broken for nearly 5 years, and has recently been fixed
and is now being noticed by many users running kubernetes
(https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion
that the mechanisms around expiring runtime should be removed
altogether.

This allows only per-cpu slices to live longer than the period boundary.
This allows threads on runqueues that do not use much CPU to continue to
use their remaining slice over a longer period of time than
cpu.cfs_period_us. However, this helps prevents the above condition of
hitting throttling while also not fully utilizing your cpu quota.

This theoretically allows a machine to use slightly more than it's
allotted quota in some periods. This overflow would be bounded by the
remaining per-cpu slice that was left un-used in the previous period.
For CPU bound tasks this will change nothing, as they should
theoretically fully utilize all of their quota and slices in each
period. For user-interactive tasks as described above this provides a
much better user/application experience as their cpu utilization will
more closely match the amount they requested when they hit throttling.

This greatly improves performance of high-thread-count, non-cpu bound
applications with low cfs_quota_us allocation on high-core-count
machines. In the case of an artificial testcase, this performance
discrepancy has been observed to be almost 30x performance improvement,
while still maintaining correct cpu quota restrictions albeit over
longer time intervals than cpu.cfs_period_us. That testcase is
available at https://github.com/indeedeng/fibtest.

Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition")
Signed-off-by: Dave Chiluk <[email protected]>
---
Documentation/scheduler/sched-bwc.txt | 56 ++++++++++++++++++++++-----
kernel/sched/fair.c | 71 +++--------------------------------
kernel/sched/sched.h | 4 --
3 files changed, 53 insertions(+), 78 deletions(-)

diff --git a/Documentation/scheduler/sched-bwc.txt b/Documentation/scheduler/sched-bwc.txt
index f6b1873..260fd65 100644
--- a/Documentation/scheduler/sched-bwc.txt
+++ b/Documentation/scheduler/sched-bwc.txt
@@ -8,15 +8,16 @@ CFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED extension which allows the
specification of the maximum CPU bandwidth available to a group or hierarchy.

The bandwidth allowed for a group is specified using a quota and period. Within
-each given "period" (microseconds), a group is allowed to consume only up to
-"quota" microseconds of CPU time. When the CPU bandwidth consumption of a
-group exceeds this limit (for that period), the tasks belonging to its
-hierarchy will be throttled and are not allowed to run again until the next
-period.
-
-A group's unused runtime is globally tracked, being refreshed with quota units
-above at each period boundary. As threads consume this bandwidth it is
-transferred to cpu-local "silos" on a demand basis. The amount transferred
+each given "period" (microseconds), a task group is allocated up to "quota"
+microseconds of CPU time. That quota is assigned to per cpu run queues in
+slices as threads in the cgroup become runnable. Once all quota has been
+assigned any additional requests for quota will result in those threads being
+throttled. Throttled threads will not be able to run again until the next
+period when the quota is replenished.
+
+A group's unassigned quota is globally tracked, being refreshed back to
+cfs_quota units at each period boundary. As threads consume this bandwidth it
+is transferred to cpu-local "silos" on a demand basis. The amount transferred
within each of these updates is tunable and described as the "slice".

Management
@@ -90,6 +91,43 @@ There are two ways in which a group may become throttled:
In case b) above, even though the child may have runtime remaining it will not
be allowed to until the parent's runtime is refreshed.

+Real-world behavior of slice non-expiration
+-------------------------------------------
+The fact that cpu-local slices do not expire results in some interesting corner
+cases that should be understood.
+
+For cgroup cpu constrained applications that are cpu limited this is a
+relatively moot point because they will naturally consume the entirety of their
+quota as well as the entirety of each cpu-local slice in each period. As a
+result it is expected that nr_periods roughly equal nr_throttled, and that
+cpuacct.usage will increase roughly equal to cfs_quota_us in each period.
+
+However in a worst-case scenario, highly-threaded, interactive/non-cpu bound
+applications this non-expiration nuance allows applications to briefly burst
+past their quota limits by the amount of unused slice on each cpu that the task
+group is running on. This slight burst requires that quota had been assigned
+and then not fully used in previous periods. This burst amount will not be
+transferred between cores. As a result, this mechanism still strictly limits
+the task group to quota average usage, albeit over a longer time window than
+period. This provides better more predictable user experience for highly
+threaded applications with small quota limits on high core count machines. It
+also eliminates the propensity to throttle these applications while
+simultanously using less than quota amounts of cpu. Another way to say this,
+is that by allowing the unused portion of a slice to remain valid across
+periods we have decreased the possibility of wasting quota on cpu-local silos
+that don't need a full slice's amount of cpu time.
+
+The interaction between cpu-bound and non-cpu-bound-interactive applications
+should also be considered, especially when single core usage hits 100%. If you
+gave each of these applications half of a cpu-core and they both got scheduled
+on the same CPU it is theoretically possible that the non-cpu bound application
+will use up to sched_cfs_bandwidth_slice_us additional quota in some periods,
+thereby preventing the cpu-bound application from fully using it's quota by
+that same amount. In these instances it will be up to the CFS algorithm (see
+sched-design-CFS.txt) to decide which application is chosen to run, as they
+will both be runnable and have remaining quota. This runtime discrepancy will
+should made up in the following periods when the interactive application idles.
+
Examples
--------
1. Limit a group to 1 CPU worth of runtime.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f35930f..a675c69 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4295,8 +4295,6 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)

now = sched_clock_cpu(smp_processor_id());
cfs_b->runtime = cfs_b->quota;
- cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period);
- cfs_b->expires_seq++;
}

static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
@@ -4318,8 +4316,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
{
struct task_group *tg = cfs_rq->tg;
struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
- u64 amount = 0, min_amount, expires;
- int expires_seq;
+ u64 amount = 0, min_amount;

/* note: this is a positive sum as runtime_remaining <= 0 */
min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining;
@@ -4336,61 +4333,17 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
cfs_b->idle = 0;
}
}
- expires_seq = cfs_b->expires_seq;
- expires = cfs_b->runtime_expires;
raw_spin_unlock(&cfs_b->lock);

cfs_rq->runtime_remaining += amount;
- /*
- * we may have advanced our local expiration to account for allowed
- * spread between our sched_clock and the one on which runtime was
- * issued.
- */
- if (cfs_rq->expires_seq != expires_seq) {
- cfs_rq->expires_seq = expires_seq;
- cfs_rq->runtime_expires = expires;
- }

return cfs_rq->runtime_remaining > 0;
}

-/*
- * Note: This depends on the synchronization provided by sched_clock and the
- * fact that rq->clock snapshots this value.
- */
-static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
-{
- struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
-
- /* if the deadline is ahead of our clock, nothing to do */
- if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0))
- return;
-
- if (cfs_rq->runtime_remaining < 0)
- return;
-
- /*
- * If the local deadline has passed we have to consider the
- * possibility that our sched_clock is 'fast' and the global deadline
- * has not truly expired.
- *
- * Fortunately we can check determine whether this the case by checking
- * whether the global deadline(cfs_b->expires_seq) has advanced.
- */
- if (cfs_rq->expires_seq == cfs_b->expires_seq) {
- /* extend local deadline, drift is bounded above by 2 ticks */
- cfs_rq->runtime_expires += TICK_NSEC;
- } else {
- /* global deadline is ahead, expiration has passed */
- cfs_rq->runtime_remaining = 0;
- }
-}
-
static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
{
/* dock delta_exec before expiring quota (as it could span periods) */
cfs_rq->runtime_remaining -= delta_exec;
- expire_cfs_rq_runtime(cfs_rq);

if (likely(cfs_rq->runtime_remaining > 0))
return;
@@ -4581,8 +4534,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
resched_curr(rq);
}

-static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
- u64 remaining, u64 expires)
+static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining)
{
struct cfs_rq *cfs_rq;
u64 runtime;
@@ -4604,7 +4556,6 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
remaining -= runtime;

cfs_rq->runtime_remaining += runtime;
- cfs_rq->runtime_expires = expires;

/* we check whether we're throttled above */
if (cfs_rq->runtime_remaining > 0)
@@ -4629,7 +4580,7 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
*/
static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, unsigned long flags)
{
- u64 runtime, runtime_expires;
+ u64 runtime;
int throttled;

/* no need to continue the timer with no bandwidth constraint */
@@ -4657,8 +4608,6 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
/* account preceding periods in which throttling occurred */
cfs_b->nr_throttled += overrun;

- runtime_expires = cfs_b->runtime_expires;
-
/*
* This check is repeated as we are holding onto the new bandwidth while
* we unthrottle. This can potentially race with an unthrottled group
@@ -4671,8 +4620,7 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
cfs_b->distribute_running = 1;
raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
/* we can't nest cfs_b->lock while distributing bandwidth */
- runtime = distribute_cfs_runtime(cfs_b, runtime,
- runtime_expires);
+ runtime = distribute_cfs_runtime(cfs_b, runtime);
raw_spin_lock_irqsave(&cfs_b->lock, flags);

cfs_b->distribute_running = 0;
@@ -4749,8 +4697,7 @@ static void __return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
return;

raw_spin_lock(&cfs_b->lock);
- if (cfs_b->quota != RUNTIME_INF &&
- cfs_rq->runtime_expires == cfs_b->runtime_expires) {
+ if (cfs_b->quota != RUNTIME_INF) {
cfs_b->runtime += slack_runtime;

/* we are under rq->lock, defer unthrottling using a timer */
@@ -4783,7 +4730,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
{
u64 runtime = 0, slice = sched_cfs_bandwidth_slice();
unsigned long flags;
- u64 expires;

/* confirm we're still not at a refresh boundary */
raw_spin_lock_irqsave(&cfs_b->lock, flags);
@@ -4800,7 +4746,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice)
runtime = cfs_b->runtime;

- expires = cfs_b->runtime_expires;
if (runtime)
cfs_b->distribute_running = 1;

@@ -4809,11 +4754,9 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
if (!runtime)
return;

- runtime = distribute_cfs_runtime(cfs_b, runtime, expires);
+ runtime = distribute_cfs_runtime(cfs_b, runtime);

raw_spin_lock_irqsave(&cfs_b->lock, flags);
- if (expires == cfs_b->runtime_expires)
- lsub_positive(&cfs_b->runtime, runtime);
cfs_b->distribute_running = 0;
raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
}
@@ -4969,8 +4912,6 @@ void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)

cfs_b->period_active = 1;
overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
- cfs_b->runtime_expires += (overrun + 1) * ktime_to_ns(cfs_b->period);
- cfs_b->expires_seq++;
hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b52ed1a..0c0ed23 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -341,8 +341,6 @@ struct cfs_bandwidth {
u64 quota;
u64 runtime;
s64 hierarchical_quota;
- u64 runtime_expires;
- int expires_seq;

short idle;
short period_active;
@@ -562,8 +560,6 @@ struct cfs_rq {

#ifdef CONFIG_CFS_BANDWIDTH
int runtime_enabled;
- int expires_seq;
- u64 runtime_expires;
s64 runtime_remaining;

u64 throttled_clock;
--
1.8.3.1


2019-05-29 19:30:46

by Phil Auld

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices

On Wed, May 29, 2019 at 02:08:46PM -0500 Dave Chiluk wrote:
> It has been observed, that highly-threaded, non-cpu-bound applications
> running under cpu.cfs_quota_us constraints can hit a high percentage of
> periods throttled while simultaneously not consuming the allocated
> amount of quota. This use case is typical of user-interactive non-cpu
> bound applications, such as those running in kubernetes or mesos when
> run on multiple cpu cores.
>
> This has been root caused to threads being allocated per cpu bandwidth
> slices, and then not fully using that slice within the period. At which
> point the slice and quota expires. This expiration of unused slice
> results in applications not being able to utilize the quota for which
> they are allocated.
>
> The expiration of per-cpu slices was recently fixed by
> 'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift
> condition")'. Prior to that it appears that this has been broken since
> at least 'commit 51f2176d74ac ("sched/fair: Fix unlocked reads of some
> cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That
> added the following conditional which resulted in slices never being
> expired.
>
> if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
> /* extend local deadline, drift is bounded above by 2 ticks */
> cfs_rq->runtime_expires += TICK_NSEC;
>
> Because this was broken for nearly 5 years, and has recently been fixed
> and is now being noticed by many users running kubernetes
> (https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion
> that the mechanisms around expiring runtime should be removed
> altogether.
>
> This allows only per-cpu slices to live longer than the period boundary.
> This allows threads on runqueues that do not use much CPU to continue to
> use their remaining slice over a longer period of time than
> cpu.cfs_period_us. However, this helps prevents the above condition of
> hitting throttling while also not fully utilizing your cpu quota.
>
> This theoretically allows a machine to use slightly more than it's
> allotted quota in some periods. This overflow would be bounded by the
> remaining per-cpu slice that was left un-used in the previous period.
> For CPU bound tasks this will change nothing, as they should
> theoretically fully utilize all of their quota and slices in each
> period. For user-interactive tasks as described above this provides a
> much better user/application experience as their cpu utilization will
> more closely match the amount they requested when they hit throttling.
>
> This greatly improves performance of high-thread-count, non-cpu bound
> applications with low cfs_quota_us allocation on high-core-count
> machines. In the case of an artificial testcase, this performance
> discrepancy has been observed to be almost 30x performance improvement,
> while still maintaining correct cpu quota restrictions albeit over
> longer time intervals than cpu.cfs_period_us. That testcase is
> available at https://github.com/indeedeng/fibtest.
>
> Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition")
> Signed-off-by: Dave Chiluk <[email protected]>
> ---
> Documentation/scheduler/sched-bwc.txt | 56 ++++++++++++++++++++++-----
> kernel/sched/fair.c | 71 +++--------------------------------
> kernel/sched/sched.h | 4 --
> 3 files changed, 53 insertions(+), 78 deletions(-)
>
> diff --git a/Documentation/scheduler/sched-bwc.txt b/Documentation/scheduler/sched-bwc.txt
> index f6b1873..260fd65 100644
> --- a/Documentation/scheduler/sched-bwc.txt
> +++ b/Documentation/scheduler/sched-bwc.txt
> @@ -8,15 +8,16 @@ CFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED extension which allows the
> specification of the maximum CPU bandwidth available to a group or hierarchy.
>
> The bandwidth allowed for a group is specified using a quota and period. Within
> -each given "period" (microseconds), a group is allowed to consume only up to
> -"quota" microseconds of CPU time. When the CPU bandwidth consumption of a
> -group exceeds this limit (for that period), the tasks belonging to its
> -hierarchy will be throttled and are not allowed to run again until the next
> -period.
> -
> -A group's unused runtime is globally tracked, being refreshed with quota units
> -above at each period boundary. As threads consume this bandwidth it is
> -transferred to cpu-local "silos" on a demand basis. The amount transferred
> +each given "period" (microseconds), a task group is allocated up to "quota"
> +microseconds of CPU time. That quota is assigned to per cpu run queues in
> +slices as threads in the cgroup become runnable. Once all quota has been
> +assigned any additional requests for quota will result in those threads being
> +throttled. Throttled threads will not be able to run again until the next
> +period when the quota is replenished.
> +
> +A group's unassigned quota is globally tracked, being refreshed back to
> +cfs_quota units at each period boundary. As threads consume this bandwidth it
> +is transferred to cpu-local "silos" on a demand basis. The amount transferred
> within each of these updates is tunable and described as the "slice".
>
> Management
> @@ -90,6 +91,43 @@ There are two ways in which a group may become throttled:
> In case b) above, even though the child may have runtime remaining it will not
> be allowed to until the parent's runtime is refreshed.
>
> +Real-world behavior of slice non-expiration
> +-------------------------------------------
> +The fact that cpu-local slices do not expire results in some interesting corner
> +cases that should be understood.
> +
> +For cgroup cpu constrained applications that are cpu limited this is a
> +relatively moot point because they will naturally consume the entirety of their
> +quota as well as the entirety of each cpu-local slice in each period. As a
> +result it is expected that nr_periods roughly equal nr_throttled, and that
> +cpuacct.usage will increase roughly equal to cfs_quota_us in each period.
> +
> +However in a worst-case scenario, highly-threaded, interactive/non-cpu bound
> +applications this non-expiration nuance allows applications to briefly burst
> +past their quota limits by the amount of unused slice on each cpu that the task
> +group is running on. This slight burst requires that quota had been assigned
> +and then not fully used in previous periods. This burst amount will not be
> +transferred between cores. As a result, this mechanism still strictly limits
> +the task group to quota average usage, albeit over a longer time window than
> +period. This provides better more predictable user experience for highly
> +threaded applications with small quota limits on high core count machines. It
> +also eliminates the propensity to throttle these applications while
> +simultanously using less than quota amounts of cpu. Another way to say this,
> +is that by allowing the unused portion of a slice to remain valid across
> +periods we have decreased the possibility of wasting quota on cpu-local silos
> +that don't need a full slice's amount of cpu time.
> +
> +The interaction between cpu-bound and non-cpu-bound-interactive applications
> +should also be considered, especially when single core usage hits 100%. If you
> +gave each of these applications half of a cpu-core and they both got scheduled
> +on the same CPU it is theoretically possible that the non-cpu bound application
> +will use up to sched_cfs_bandwidth_slice_us additional quota in some periods,
> +thereby preventing the cpu-bound application from fully using it's quota by


"its quota"


> +that same amount. In these instances it will be up to the CFS algorithm (see
> +sched-design-CFS.txt) to decide which application is chosen to run, as they
> +will both be runnable and have remaining quota. This runtime discrepancy will
> +should made up in the following periods when the interactive application idles.
> +


"discrepancy will be made" or "descrepancy should be made" but not both :)



Otherwise, fwiw,

Acked-by: Phil Auld <[email protected]>



Cheers,
Phil


--

2019-05-29 19:51:53

by Benjamin Segall

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices

Dave Chiluk <[email protected]> writes:

> It has been observed, that highly-threaded, non-cpu-bound applications
> running under cpu.cfs_quota_us constraints can hit a high percentage of
> periods throttled while simultaneously not consuming the allocated
> amount of quota. This use case is typical of user-interactive non-cpu
> bound applications, such as those running in kubernetes or mesos when
> run on multiple cpu cores.
>
> This has been root caused to threads being allocated per cpu bandwidth
> slices, and then not fully using that slice within the period. At which
> point the slice and quota expires. This expiration of unused slice
> results in applications not being able to utilize the quota for which
> they are allocated.
>

So most of the bandwidth is supposed to be returned, leaving only 1ms.
I'm setting up to try your test now, but there was a bug with the
unthrottle part of this where it could be continuously pushed into the
future. You might try this patch instead, possibly along with lowering
min_cfs_rq_runtime (currently not runtime tunable) if you see that cost
more than the cost of extra locking to acquire runtime.





From: Ben Segall <[email protected]>
Date: Tue, 28 May 2019 12:30:25 -0700
Subject: [PATCH] sched/fair: don't push cfs_bandwith slack timers forward

When a cfs_rq sleeps and returns its quota, we delay for 5ms before waking any
throttled cfs_rqs to coalesce with other cfs_rqs going to sleep, as this has has
to be done outside of the rq lock we hold. The current code waits for 5ms
without any sleeps, instead of waiting for 5ms from the first sleep, which can
delay the unthrottle more than we want. Switch this around.

Change-Id: Ife536bb54af633863de486ba6ec0d20e55ada8c9
Signed-off-by: Ben Segall <[email protected]>
---
kernel/sched/fair.c | 7 +++++++
kernel/sched/sched.h | 1 +
2 files changed, 8 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8213ff6e365d..2ead252cfa32 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4729,6 +4729,11 @@ static void start_cfs_slack_bandwidth(struct cfs_bandwidth *cfs_b)
if (runtime_refresh_within(cfs_b, min_left))
return;

+ /* don't push forwards an existing deferred unthrottle */
+ if (cfs_b->slack_started)
+ return;
+ cfs_b->slack_started = true;
+
hrtimer_start(&cfs_b->slack_timer,
ns_to_ktime(cfs_bandwidth_slack_period),
HRTIMER_MODE_REL);
@@ -4782,6 +4787,7 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)

/* confirm we're still not at a refresh boundary */
raw_spin_lock_irqsave(&cfs_b->lock, flags);
+ cfs_b->slack_started = false;
if (cfs_b->distribute_running) {
raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
return;
@@ -4920,6 +4926,7 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
hrtimer_init(&cfs_b->slack_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
cfs_b->slack_timer.function = sched_cfs_slack_timer;
cfs_b->distribute_running = 0;
+ cfs_b->slack_started = false;
}

static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index efa686eeff26..60219acda94b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -356,6 +356,7 @@ struct cfs_bandwidth {
u64 throttled_time;

bool distribute_running;
+ bool slack_started;
#endif
};

--
2.22.0.rc1.257.g3120a18244-goog

2019-05-29 21:08:01

by Benjamin Segall

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices

Dave Chiluk <[email protected]> writes:

> It has been observed, that highly-threaded, non-cpu-bound applications
> running under cpu.cfs_quota_us constraints can hit a high percentage of
> periods throttled while simultaneously not consuming the allocated
> amount of quota. This use case is typical of user-interactive non-cpu
> bound applications, such as those running in kubernetes or mesos when
> run on multiple cpu cores.
>
> This has been root caused to threads being allocated per cpu bandwidth
> slices, and then not fully using that slice within the period. At which
> point the slice and quota expires. This expiration of unused slice
> results in applications not being able to utilize the quota for which
> they are allocated.
>
> The expiration of per-cpu slices was recently fixed by
> 'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift
> condition")'. Prior to that it appears that this has been broken since
> at least 'commit 51f2176d74ac ("sched/fair: Fix unlocked reads of some
> cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That
> added the following conditional which resulted in slices never being
> expired.


Yeah, having run the test, stranding only 1 ms per cpu rather than 5
doesn't help if you only have 10 ms of quota and even 10 threads/cpus.
The slack timer isn't important in this test, though I think it probably
should be changed.

Decreasing min_cfs_rq_runtime helps, but would mean that we have to pull
quota more often / always. The worst case here I think is where you
run/sleep for ~1ns, so you wind up taking the lock twice every
min_cfs_rq_runtime: once for assign and once to return all but min,
which you then use up doing short run/sleep. I suppose that determines
how much we care about this overhead at all.

Removing expiration means that in the worst case period and quota can be
effectively twice what the user specified, but only on very particular
workloads.

I think we should at least think about instead lowering
min_cfs_rq_runtime to some smaller value.

2019-05-30 17:56:26

by Dave Chiluk

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices

On Wed, May 29, 2019 at 02:05:55PM -0700, [email protected] wrote:
> Dave Chiluk <[email protected]> writes:
>
> Yeah, having run the test, stranding only 1 ms per cpu rather than 5
> doesn't help if you only have 10 ms of quota and even 10 threads/cpus.
> The slack timer isn't important in this test, though I think it probably
> should be changed.
My min_cfs_rq_runtime was already set to 1ms.

Additionally raising the amount of quota from 10ms to 50ms or even
100ms, still results in throttling without full quota usage.

> Decreasing min_cfs_rq_runtime helps, but would mean that we have to pull
> quota more often / always. The worst case here I think is where you
> run/sleep for ~1ns, so you wind up taking the lock twice every
> min_cfs_rq_runtime: once for assign and once to return all but min,
> which you then use up doing short run/sleep. I suppose that determines
> how much we care about this overhead at all.
I'm not so concerned about how inefficiently the user-space application
runs, as that's up to the invidual developer. The fibtest testcase, is
purely my approximation of what a java application with lots of worker
threads might do, as I didn't have a great deterministic java
reproducer, and I feared posting java to LKML. I'm more concerned with
the fact that the user requested 10ms/period or 100ms/period and they
hit throttling while simultaneously not seeing that amount of cpu usage.
i.e. on an 8 core machine if I
$ ./runfibtest 1
Iterations Completed(M): 1886
Throttled for: 51
CPU Usage (msecs) = 507
$ ./runfibtest 8
Iterations Completed(M): 1274
Throttled for: 52
CPU Usage (msecs) = 380

You see that in the 8 core case where we have 7 do nothing threads on
cpu's 1-7, we see only 380 ms of usage, and 52 periods of throttling
when we should have received ~500ms of cpu usage.

Looking more closely at the __return_cfs_rq_runtime logic I noticed
if (cfs_b->quota != RUNTIME_INF &&
cfs_rq->runtime_expires == cfs_b->runtime_expires) {

Which is awfully similar to the logic that was fixed by 512ac999. Is it
possible that we are just not ever returning runtime back to the cfs_b
because of the runtime_expires comparison here?

Early on in my testing I created a patch that would report via sysfs the
amount of quota that was expired at the end of a period in
expire_cfs_rq_runtime, and it roughly equalled the difference in quota
between the actual cpu usage and the alloted quota. That leads me to
believe that something is not going quite correct in the slack return
logic __return_cfs_rq_runtime. I'll attach that patch at the bottom of
this e-mail in case you'd like to reproduce my tests.

> Removing expiration means that in the worst case period and quota can be
> effectively twice what the user specified, but only on very particular
> workloads.
I'm only removing expiration of slices that have already been assigned
to individual cfs_rq. My understanding is that there is at most one
cfs_rq per cpu, and each of those can have at most one slice of
available runtime. So the worst case burst is slice_ms * cpus. Please
help me understand how you get to twice user specified quota and period
as it's not obvious to me *(I've only been looking at this for a few
months).

> I think we should at least think about instead lowering
> min_cfs_rq_runtime to some smaller value
Do you mean lower than 1ms?

Thanks
Dave.

From: Dave Chiluk <[email protected]>
Date: Thu, 30 May 2019 12:47:12 -0500
Subject: [PATCH] Add expired_time sysfs entry

Signed-off-by: Dave Chiluk <[email protected]>
---
kernel/sched/core.c | 41 +++++++++++++++++++++++++++++++++++++++++
kernel/sched/fair.c | 4 ++++
kernel/sched/sched.h | 1 +
3 files changed, 46 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 874c427..3c06df9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6655,6 +6655,30 @@ static long tg_get_cfs_period(struct task_group *tg)
return cfs_period_us;
}

+static int tg_set_cfs_expired_runtime(struct task_group *tg, long
slice_expiration)
+{
+ struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
+
+ if (slice_expiration == 0)
+ {
+ raw_spin_lock_irq(&cfs_b->lock);
+ cfs_b->expired_runtime= 0;
+ raw_spin_unlock_irq(&cfs_b->lock);
+ return 0;
+ }
+ return 1;
+}
+
+static u64 tg_get_cfs_expired_runtime(struct task_group *tg)
+{
+ u64 expired_runtime;
+
+ expired_runtime = tg->cfs_bandwidth.expired_runtime;
+ do_div(expired_runtime, NSEC_PER_USEC);
+
+ return expired_runtime;
+}
+
static s64 cpu_cfs_quota_read_s64(struct cgroup_subsys_state *css,
struct cftype *cft)
{
@@ -6679,6 +6703,18 @@ static int cpu_cfs_period_write_u64(struct
cgroup_subsys_state *css,
return tg_set_cfs_period(css_tg(css), cfs_period_us);
}

+static u64 cpu_cfs_expired_runtime_read_u64(struct cgroup_subsys_state *css,
+ struct cftype *cft)
+{
+ return tg_get_cfs_expired_runtime(css_tg(css));
+}
+
+static int cpu_cfs_expired_runtime_write_s64(struct cgroup_subsys_state *css,
+ struct cftype *cftype, s64
cfs_slice_expiration_us)
+{
+ return tg_set_cfs_expired_runtime(css_tg(css), cfs_slice_expiration_us);
+}
+
struct cfs_schedulable_data {
struct task_group *tg;
u64 period, quota;
@@ -6832,6 +6868,11 @@ static u64 cpu_rt_period_read_uint(struct
cgroup_subsys_state *css,
.write_u64 = cpu_cfs_period_write_u64,
},
{
+ .name = "cfs_expired_runtime",
+ .read_u64 = cpu_cfs_expired_runtime_read_u64,
+ .write_s64 = cpu_cfs_expired_runtime_write_s64,
+ },
+ {
.name = "stat",
.seq_show = cpu_cfs_stat_show,
},
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f35930f..bcbd4ef 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4382,6 +4382,9 @@ static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
cfs_rq->runtime_expires += TICK_NSEC;
} else {
/* global deadline is ahead, expiration has passed */
+ raw_spin_lock_irq(&cfs_b->lock);
+ cfs_b->expired_runtime += cfs_rq->runtime_remaining;
+ raw_spin_unlock_irq(&cfs_b->lock);
cfs_rq->runtime_remaining = 0;
}
}
@@ -4943,6 +4946,7 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
cfs_b->runtime = 0;
cfs_b->quota = RUNTIME_INF;
cfs_b->period = ns_to_ktime(default_cfs_period());
+ cfs_b->expired_runtime = 0;

INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC,
HRTIMER_MODE_ABS_PINNED);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b52ed1a..499d2e2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -343,6 +343,7 @@ struct cfs_bandwidth {
s64 hierarchical_quota;
u64 runtime_expires;
int expires_seq;
+ u64 expired_runtime;

short idle;
short period_active;

--
1.8.3.1

2019-05-30 20:46:18

by Benjamin Segall

[permalink] [raw]
Subject: Re: [PATCH v3 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices

Dave Chiluk <[email protected]> writes:

> On Wed, May 29, 2019 at 02:05:55PM -0700, [email protected] wrote:
>> Dave Chiluk <[email protected]> writes:
>>
>> Yeah, having run the test, stranding only 1 ms per cpu rather than 5
>> doesn't help if you only have 10 ms of quota and even 10 threads/cpus.
>> The slack timer isn't important in this test, though I think it probably
>> should be changed.
> My min_cfs_rq_runtime was already set to 1ms.

Yeah, I meant min_cfs_rq_runtime vs the 5ms if the slack stuff was
broken.

>
> Additionally raising the amount of quota from 10ms to 50ms or even
> 100ms, still results in throttling without full quota usage.
>
>> Decreasing min_cfs_rq_runtime helps, but would mean that we have to pull
>> quota more often / always. The worst case here I think is where you
>> run/sleep for ~1ns, so you wind up taking the lock twice every
>> min_cfs_rq_runtime: once for assign and once to return all but min,
>> which you then use up doing short run/sleep. I suppose that determines
>> how much we care about this overhead at all.
> I'm not so concerned about how inefficiently the user-space application
> runs, as that's up to the invidual developer.

Increasing scheduler overhead is something we generally try to prevent
is what I was worried about.

> The fibtest testcase, is
> purely my approximation of what a java application with lots of worker
> threads might do, as I didn't have a great deterministic java
> reproducer, and I feared posting java to LKML. I'm more concerned with
> the fact that the user requested 10ms/period or 100ms/period and they
> hit throttling while simultaneously not seeing that amount of cpu usage.
> i.e. on an 8 core machine if I
> $ ./runfibtest 1
> Iterations Completed(M): 1886
> Throttled for: 51
> CPU Usage (msecs) = 507
> $ ./runfibtest 8
> Iterations Completed(M): 1274
> Throttled for: 52
> CPU Usage (msecs) = 380
>
> You see that in the 8 core case where we have 7 do nothing threads on
> cpu's 1-7, we see only 380 ms of usage, and 52 periods of throttling
> when we should have received ~500ms of cpu usage.
>
> Looking more closely at the __return_cfs_rq_runtime logic I noticed
> if (cfs_b->quota != RUNTIME_INF &&
> cfs_rq->runtime_expires == cfs_b->runtime_expires) {
>
> Which is awfully similar to the logic that was fixed by 512ac999. Is it
> possible that we are just not ever returning runtime back to the cfs_b
> because of the runtime_expires comparison here?

The relevant issue that patch fixes is that the old conditional was
backwards. Also lowering min_cfs_rq_runtime to 0 fixes your testcase, so
it's working.

>
>> Removing expiration means that in the worst case period and quota can be
>> effectively twice what the user specified, but only on very particular
>> workloads.
> I'm only removing expiration of slices that have already been assigned
> to individual cfs_rq. My understanding is that there is at most one
> cfs_rq per cpu, and each of those can have at most one slice of
> available runtime. So the worst case burst is slice_ms * cpus. Please
> help me understand how you get to twice user specified quota and period
> as it's not obvious to me *(I've only been looking at this for a few
> months).

The reason that this effect is so significant is because slice_ms * cpus
is roughly 100% of the quota. So yes, it's roughly the same thing.
Unfortunately if there are more spare cpus on the system just doubling
quota and period (keeping the same ratio) would not fix your issue,
while removing expiration does while also potentially having that effect.

>
>> I think we should at least think about instead lowering
>> min_cfs_rq_runtime to some smaller value
> Do you mean lower than 1ms?

Yes