[permalink] [raw]

Subject: Re: [PATCH v5 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices

On Fri, Jul 12, 2019 at 5:10 PM <[email protected]> wrote:
> Ugh. Maybe we /do/ just give up and say that most people don't seem to
> be using cfs_b in a way that expiration of the leftover 1ms matters.

That was my conclusion as well. Does this mean you want to proceed
with my patch set? Do you have any changes you want made to the
proposed documentation changes, or any other changes for that matter?

2019-07-16 19:58:47

by Benjamin Segall

[permalink] [raw]

Subject: Re: [PATCH v5 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices

Dave Chiluk <[email protected]> writes:

> It has been observed, that highly-threaded, non-cpu-bound applications
> running under cpu.cfs_quota_us constraints can hit a high percentage of
> periods throttled while simultaneously not consuming the allocated
> amount of quota. This use case is typical of user-interactive non-cpu
> bound applications, such as those running in kubernetes or mesos when
> run on multiple cpu cores.
>
> This has been root caused to threads being allocated per cpu bandwidth
> slices, and then not fully using that slice within the period. At which
> point the slice and quota expires. This expiration of unused slice
> results in applications not being able to utilize the quota for which
> they are allocated.
>
> The expiration of per-cpu slices was recently fixed by
> 'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift
> condition")'. Prior to that it appears that this has been broken since
> at least 'commit 51f2176d74ac ("sched/fair: Fix unlocked reads of some
> cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That
> added the following conditional which resulted in slices never being
> expired.
>
> if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
> /* extend local deadline, drift is bounded above by 2 ticks */
> cfs_rq->runtime_expires += TICK_NSEC;
>
> Because this was broken for nearly 5 years, and has recently been fixed
> and is now being noticed by many users running kubernetes
> (https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion
> that the mechanisms around expiring runtime should be removed
> altogether.
>
> This allows quota already allocated to per-cpu run-queues to live longer
> than the period boundary. This allows threads on runqueues that do not
> use much CPU to continue to use their remaining slice over a longer
> period of time than cpu.cfs_period_us. However, this helps prevents the
> above condition of hitting throttling while also not fully utilizing
> your cpu quota.
>
> This theoretically allows a machine to use slightly more than its
> allotted quota in some periods. This overflow would be bounded by the
> remaining quota left on each per-cpu runqueueu. This is typically no
> more than min_cfs_rq_runtime=1ms per cpu. For CPU bound tasks this will
> change nothing, as they should theoretically fully utilize all of their
> quota in each period. For user-interactive tasks as described above this
> provides a much better user/application experience as their cpu
> utilization will more closely match the amount they requested when they
> hit throttling. This means that cpu limits no longer strictly apply per
> period for non-cpu bound applications, but that they are still accurate
> over longer timeframes.
>
> This greatly improves performance of high-thread-count, non-cpu bound
> applications with low cfs_quota_us allocation on high-core-count
> machines. In the case of an artificial testcase (10ms/100ms of quota on
> 80 CPU machine), this commit resulted in almost 30x performance
> improvement, while still maintaining correct cpu quota restrictions.
> That testcase is available at https://github.com/indeedeng/fibtest.
>
> Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition")
> Signed-off-by: Dave Chiluk <[email protected]>
> ---
> Documentation/scheduler/sched-bwc.txt | 73 ++++++++++++++++++++++++++++-------
> kernel/sched/fair.c | 71 +++-------------------------------
> kernel/sched/sched.h | 4 --
> 3 files changed, 65 insertions(+), 83 deletions(-)
>
> [...]
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f35930f..a675c69 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4809,11 +4754,9 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
> if (!runtime)
> return;
>
> - runtime = distribute_cfs_runtime(cfs_b, runtime, expires);
> + runtime = distribute_cfs_runtime(cfs_b, runtime);
>
> raw_spin_lock_irqsave(&cfs_b->lock, flags);
> - if (expires == cfs_b->runtime_expires)
> - lsub_positive(&cfs_b->runtime, runtime);

The lsub_positive is still needed, just get rid of the if.

Other than that,
Reviewed-by: Ben Segall <[email protected]>

2019-07-24 02:25:32

by Dave Chiluk

[permalink] [raw]

Subject: [PATCH v6 0/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices

Changelog v6
- Added back missing call to lsub_positive(&cfs_b->runtime, runtime);
- Added Reviewed-by: Ben Segall <[email protected]>
- Fix some grammar in the Documentation, and change some wording.
- Updated documentation due to the .rst change

Changelog v5
- Based on this comment from Ben Segall's comment on v4
> If the cost of taking this global lock across all cpus without a
> ratelimit was somehow not a problem, I'd much prefer to just set
> min_cfs_rq_runtime = 0. (Assuming it is, I definitely prefer the "lie
> and sorta have 2x period 2x runtime" solution of removing expiration)
I'm resubmitting my v3 patchset, with the requested changes.
- Updated Commit log given review comments
- Update sched-bwc.txt give my new understanding of the slack timer.

Changelog v4
- Rewrote patchset around the concept of returning all of runtime_remaining
when cfs_b nears the end of available quota.

Changelog v3
- Reworked documentation to better describe behavior of slice expiration per
feedback from Peter Oskolkov

Changelog v2
- Fixed some checkpatch errors in the commit message.