2011-02-16 03:20:10

by Paul Turner

[permalink] [raw]
Subject: [CFS Bandwidth Control v4 2/7] sched: accumulate per-cfs_rq cpu usage

Introduce account_cfs_rq_quota() to account bandwidth usage on the cfs_rq
level versus task_groups for which bandwidth has been assigned. This is
tracked by whether the local cfs_rq->quota_assigned is finite or infinite
(RUNTIME_INF).

For cfs_rq's that belong to a bandwidth constrained task_group we introduce
tg_request_cfs_quota() which attempts to allocate quota from the global pool
for use locally. Updates involving the global pool are currently protected
under cfs_bandwidth->lock, local pools are protected by rq->lock.

This patch only attempts to assign and track quota, no action is taken in the
case that cfs_rq->quota_used exceeds cfs_rq->quota_assigned.

Signed-off-by: Paul Turner <[email protected]>
Signed-off-by: Nikhil Rao <[email protected]>
Signed-off-by: Bharata B Rao <[email protected]>
---
include/linux/sched.h | 4 +++
kernel/sched_fair.c | 62 ++++++++++++++++++++++++++++++++++++++++++++++++++
kernel/sysctl.c | 10 ++++++++
3 files changed, 76 insertions(+)

Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -95,6 +95,13 @@ unsigned int __read_mostly sysctl_sched_
* default: 0.5s, units: nanoseconds
*/
static u64 sched_cfs_bandwidth_period = 500000000ULL;
+
+/*
+ * default slice of quota to allocate from global tg to local cfs_rq pool on
+ * each refresh
+ * default: 10ms, units: microseconds
+ */
+unsigned int sysctl_sched_cfs_bandwidth_slice = 10000UL;
#endif

static const struct sched_class fair_sched_class;
@@ -313,6 +320,21 @@ find_matching_se(struct sched_entity **s

#endif /* CONFIG_FAIR_GROUP_SCHED */

+#ifdef CONFIG_CFS_BANDWIDTH
+static inline u64 sched_cfs_bandwidth_slice(void)
+{
+ return (u64)sysctl_sched_cfs_bandwidth_slice * NSEC_PER_USEC;
+}
+
+static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
+{
+ return &tg->cfs_bandwidth;
+}
+
+static void account_cfs_rq_quota(struct cfs_rq *cfs_rq,
+ unsigned long delta_exec);
+#endif
+

/**************************************************************
* Scheduling class tree data structure manipulation methods:
@@ -609,6 +631,9 @@ static void update_curr(struct cfs_rq *c
cpuacct_charge(curtask, delta_exec);
account_group_exec_runtime(curtask, delta_exec);
}
+#ifdef CONFIG_CFS_BANDWIDTH
+ account_cfs_rq_quota(cfs_rq, delta_exec);
+#endif
}

static inline void
@@ -1382,6 +1407,43 @@ static void dequeue_task_fair(struct rq
}

#ifdef CONFIG_CFS_BANDWIDTH
+static u64 tg_request_cfs_quota(struct task_group *tg)
+{
+ struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
+ u64 delta = 0;
+
+ if (cfs_b->runtime > 0 || cfs_b->quota == RUNTIME_INF) {
+ raw_spin_lock(&cfs_b->lock);
+ /*
+ * it's possible a bandwidth update has changed the global
+ * pool.
+ */
+ if (cfs_b->quota == RUNTIME_INF)
+ delta = sched_cfs_bandwidth_slice();
+ else {
+ delta = min(cfs_b->runtime,
+ sched_cfs_bandwidth_slice());
+ cfs_b->runtime -= delta;
+ }
+ raw_spin_unlock(&cfs_b->lock);
+ }
+ return delta;
+}
+
+static void account_cfs_rq_quota(struct cfs_rq *cfs_rq,
+ unsigned long delta_exec)
+{
+ if (cfs_rq->quota_assigned == RUNTIME_INF)
+ return;
+
+ cfs_rq->quota_used += delta_exec;
+
+ if (cfs_rq->quota_used < cfs_rq->quota_assigned)
+ return;
+
+ cfs_rq->quota_assigned += tg_request_cfs_quota(cfs_rq->tg);
+}
+
static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
{
return 1;
Index: tip/kernel/sysctl.c
===================================================================
--- tip.orig/kernel/sysctl.c
+++ tip/kernel/sysctl.c
@@ -361,6 +361,16 @@ static struct ctl_table kern_table[] = {
.mode = 0644,
.proc_handler = sched_rt_handler,
},
+#ifdef CONFIG_CFS_BANDWIDTH
+ {
+ .procname = "sched_cfs_bandwidth_slice_us",
+ .data = &sysctl_sched_cfs_bandwidth_slice,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = &one,
+ },
+#endif
#ifdef CONFIG_SCHED_AUTOGROUP
{
.procname = "sched_autogroup_enabled",
Index: tip/include/linux/sched.h
===================================================================
--- tip.orig/include/linux/sched.h
+++ tip/include/linux/sched.h
@@ -1943,6 +1943,10 @@ int sched_rt_handler(struct ctl_table *t
void __user *buffer, size_t *lenp,
loff_t *ppos);

+#ifdef CONFIG_CFS_BANDWIDTH
+extern unsigned int sysctl_sched_cfs_bandwidth_slice;
+#endif
+
#ifdef CONFIG_SCHED_AUTOGROUP
extern unsigned int sysctl_sched_autogroup_enabled;



2011-02-16 17:45:37

by Balbir Singh

[permalink] [raw]
Subject: Re: [CFS Bandwidth Control v4 2/7] sched: accumulate per-cfs_rq cpu usage

* Paul Turner <[email protected]> [2011-02-15 19:18:33]:

> Introduce account_cfs_rq_quota() to account bandwidth usage on the cfs_rq
> level versus task_groups for which bandwidth has been assigned. This is
> tracked by whether the local cfs_rq->quota_assigned is finite or infinite
> (RUNTIME_INF).
>
> For cfs_rq's that belong to a bandwidth constrained task_group we introduce
> tg_request_cfs_quota() which attempts to allocate quota from the global pool
> for use locally. Updates involving the global pool are currently protected
> under cfs_bandwidth->lock, local pools are protected by rq->lock.
>
> This patch only attempts to assign and track quota, no action is taken in the
> case that cfs_rq->quota_used exceeds cfs_rq->quota_assigned.
>
> Signed-off-by: Paul Turner <[email protected]>
> Signed-off-by: Nikhil Rao <[email protected]>
> Signed-off-by: Bharata B Rao <[email protected]>
> ---


Acked-by: Balbir Singh <[email protected]>


--
Three Cheers,
Balbir

2011-02-23 13:32:40

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [CFS Bandwidth Control v4 2/7] sched: accumulate per-cfs_rq cpu usage

On Tue, 2011-02-15 at 19:18 -0800, Paul Turner wrote:

> @@ -609,6 +631,9 @@ static void update_curr(struct cfs_rq *c
> cpuacct_charge(curtask, delta_exec);
> account_group_exec_runtime(curtask, delta_exec);
> }
> +#ifdef CONFIG_CFS_BANDWIDTH
> + account_cfs_rq_quota(cfs_rq, delta_exec);
> +#endif
> }

Not too hard to make the #ifdef'ery go away I'd guess.

> static inline void
> @@ -1382,6 +1407,43 @@ static void dequeue_task_fair(struct rq
> }
>
> #ifdef CONFIG_CFS_BANDWIDTH
> +static u64 tg_request_cfs_quota(struct task_group *tg)
> +{
> + struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
> + u64 delta = 0;
> +
> + if (cfs_b->runtime > 0 || cfs_b->quota == RUNTIME_INF) {
> + raw_spin_lock(&cfs_b->lock);
> + /*
> + * it's possible a bandwidth update has changed the global
> + * pool.
> + */
> + if (cfs_b->quota == RUNTIME_INF)
> + delta = sched_cfs_bandwidth_slice();

Why do we bother at all when there's infinite time? Shouldn't the action
that sets it to infinite also make cfs_rq->quota_assinged to to
RUNTIME_INF, in which case the below check will make it all go away?

> + else {
> + delta = min(cfs_b->runtime,
> + sched_cfs_bandwidth_slice());
> + cfs_b->runtime -= delta;
> + }
> + raw_spin_unlock(&cfs_b->lock);
> + }
> + return delta;
> +}

Also, shouldn't this all try and steal time from other cpus when the
global limit ran out? Suppose you have say 24 cpus, and you had a short
burst where all 24 cpus had some runtime, so you distribute 240ms. But
23 of those cpus only ran for 0.5ms, leaving 23.5ms of unused time on 23
cpus while your one active cpu will then throttle.

I would much rather see all the accounting tight first and optimize
later than start with leaky stuff and try and plug holes later.

> +
> +static void account_cfs_rq_quota(struct cfs_rq *cfs_rq,
> + unsigned long delta_exec)
> +{
> + if (cfs_rq->quota_assigned == RUNTIME_INF)
> + return;
> +
> + cfs_rq->quota_used += delta_exec;
> +
> + if (cfs_rq->quota_used < cfs_rq->quota_assigned)
> + return;
> +
> + cfs_rq->quota_assigned += tg_request_cfs_quota(cfs_rq->tg);
> +}

So why isn't this hierarchical?, also all this positive quota stuff
looks weird, why not decrement and try to supplement when negative?

2011-02-25 03:34:27

by Paul Turner

[permalink] [raw]
Subject: Re: [CFS Bandwidth Control v4 2/7] sched: accumulate per-cfs_rq cpu usage

On Wed, Feb 23, 2011 at 5:32 AM, Peter Zijlstra <[email protected]> wrote:
> On Tue, 2011-02-15 at 19:18 -0800, Paul Turner wrote:
>
>> @@ -609,6 +631,9 @@ static void update_curr(struct cfs_rq *c
>> ? ? ? ? ? ? ? cpuacct_charge(curtask, delta_exec);
>> ? ? ? ? ? ? ? account_group_exec_runtime(curtask, delta_exec);
>> ? ? ? }
>> +#ifdef CONFIG_CFS_BANDWIDTH
>> + ? ? account_cfs_rq_quota(cfs_rq, delta_exec);
>> +#endif
>> ?}
>
> Not too hard to make the #ifdef'ery go away I'd guess.
>

Done

>> ?static inline void
>> @@ -1382,6 +1407,43 @@ static void dequeue_task_fair(struct rq
>> ?}
>>
>> ?#ifdef CONFIG_CFS_BANDWIDTH
>> +static u64 tg_request_cfs_quota(struct task_group *tg)
>> +{
>> + ? ? struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
>> + ? ? u64 delta = 0;
>> +
>> + ? ? if (cfs_b->runtime > 0 || cfs_b->quota == RUNTIME_INF) {
>> + ? ? ? ? ? ? raw_spin_lock(&cfs_b->lock);
>> + ? ? ? ? ? ? /*
>> + ? ? ? ? ? ? ?* it's possible a bandwidth update has changed the global
>> + ? ? ? ? ? ? ?* pool.
>> + ? ? ? ? ? ? ?*/
>> + ? ? ? ? ? ? if (cfs_b->quota == RUNTIME_INF)
>> + ? ? ? ? ? ? ? ? ? ? delta = sched_cfs_bandwidth_slice();
>
> Why do we bother at all when there's infinite time? Shouldn't the action
> that sets it to infinite also make cfs_rq->quota_assinged to to
> RUNTIME_INF, in which case the below check will make it all go away?
>

The request for bandwidth might be racing with an entity's request for
bandwidth.

e.g. someone updates cpu.cfs_quota_us to infinite while there's
bandwidth distribution in flight.

In this case we need to return some sane value so that the thread
requesting bandwidth can complete that operation (releasing the lock
which will then be taken to set quota_assigned to INF).

But more importantly we don't want to decrement the value doled out
FROM cfs_b->runtime since that would change it from the magic
RUNTIME_INF. That's why the check exists.


>> + ? ? ? ? ? ? else {
>> + ? ? ? ? ? ? ? ? ? ? delta = min(cfs_b->runtime,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? sched_cfs_bandwidth_slice());
>> + ? ? ? ? ? ? ? ? ? ? cfs_b->runtime -= delta;
>> + ? ? ? ? ? ? }
>> + ? ? ? ? ? ? raw_spin_unlock(&cfs_b->lock);
>> + ? ? }
>> + ? ? return delta;
>> +}
>
> Also, shouldn't this all try and steal time from other cpus when the
> global limit ran out? Suppose you have say 24 cpus, and you had a short
> burst where all 24 cpus had some runtime, so you distribute 240ms. But
> 23 of those cpus only ran for 0.5ms, leaving 23.5ms of unused time on 23
> cpus while your one active cpu will then throttle.
>

In practice this only affects the first period since that slightly
stale bandwidth is then available on those other 23 cpus the next time
a micro-burst occurs. In testing this has resulted in very stable
performance and "smooth" perturbations that resemble hardcapping by
affinity (for integer points).

The notion of stealing could certainly be introduced, the juncture of
reaching the zero point here would be a possible place to consider
that (although we would need to do a steal that avoids the asymptotic
convergence problem of RT).

I think returning (most) of the bandwidth to the global pool on
(voluntary) dequeue is a more scalable solution

> I would much rather see all the accounting tight first and optimize
> later than start with leaky stuff and try and plug holes later.
>

The complexity this introduces is non-trivial since the idea of
returning quota to the pool means you need to introduce the notion of
when that quota came to life (otherwise you get leaks in the reverse
direction!) -- as well as reversing some of the lock ordering.

While generations do this they don't greatly increase the efficacy and
I think there is value in performing the detailed review we are doing
now in isolation of that.

It's also still consistent regarding leakage since in that in any N
consecutive periods the maximum additional quota (by a user abusing
this) that can be received is N+1. Does the complexity trade-off
merit improving this bound at this point?

>> +
>> +static void account_cfs_rq_quota(struct cfs_rq *cfs_rq,
>> + ? ? ? ? ? ? unsigned long delta_exec)
>> +{
>> + ? ? if (cfs_rq->quota_assigned == RUNTIME_INF)
>> + ? ? ? ? ? ? return;
>> +
>> + ? ? cfs_rq->quota_used += delta_exec;
>> +
>> + ? ? if (cfs_rq->quota_used < cfs_rq->quota_assigned)
>> + ? ? ? ? ? ? return;
>> +
>> + ? ? cfs_rq->quota_assigned += tg_request_cfs_quota(cfs_rq->tg);
>> +}
>
> So why isn't this hierarchical?,

It is naturally (since charging occurs within the existing hierarchal
accounting)

> also all this positive quota stuff
> looks weird, why not decrement and try to supplement when negative?
>

I thought I had a good reason for this.. at least I remember at one
point I think I did.. but I cannot see any need for it in the current
form. I will revise it to a single decremented quota_remaining.

2011-02-25 12:31:37

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [CFS Bandwidth Control v4 2/7] sched: accumulate per-cfs_rq cpu usage

On Thu, 2011-02-24 at 19:33 -0800, Paul Turner wrote:

> >> #ifdef CONFIG_CFS_BANDWIDTH
> >> +static u64 tg_request_cfs_quota(struct task_group *tg)
> >> +{
> >> + struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
> >> + u64 delta = 0;
> >> +
> >> + if (cfs_b->runtime > 0 || cfs_b->quota == RUNTIME_INF) {
> >> + raw_spin_lock(&cfs_b->lock);
> >> + /*
> >> + * it's possible a bandwidth update has changed the global
> >> + * pool.
> >> + */
> >> + if (cfs_b->quota == RUNTIME_INF)
> >> + delta = sched_cfs_bandwidth_slice();
> >
> > Why do we bother at all when there's infinite time? Shouldn't the action
> > that sets it to infinite also make cfs_rq->quota_assinged to to
> > RUNTIME_INF, in which case the below check will make it all go away?

> But more importantly we don't want to decrement the value doled out
> FROM cfs_b->runtime since that would change it from the magic
> RUNTIME_INF. That's why the check exists.

Ah, quite so.

> >> + else {
> >> + delta = min(cfs_b->runtime,
> >> + sched_cfs_bandwidth_slice());
> >> + cfs_b->runtime -= delta;
> >> + }
> >> + raw_spin_unlock(&cfs_b->lock);
> >> + }
> >> + return delta;
> >> +}
> >
> > Also, shouldn't this all try and steal time from other cpus when the
> > global limit ran out? Suppose you have say 24 cpus, and you had a short
> > burst where all 24 cpus had some runtime, so you distribute 240ms. But
> > 23 of those cpus only ran for 0.5ms, leaving 23.5ms of unused time on 23
> > cpus while your one active cpu will then throttle.
> >
>
> In practice this only affects the first period since that slightly
> stale bandwidth is then available on those other 23 cpus the next time
> a micro-burst occurs. In testing this has resulted in very stable
> performance and "smooth" perturbations that resemble hardcapping by
> affinity (for integer points).
>
> The notion of stealing could certainly be introduced, the juncture of
> reaching the zero point here would be a possible place to consider
> that (although we would need to do a steal that avoids the asymptotic
> convergence problem of RT).
>
> I think returning (most) of the bandwidth to the global pool on
> (voluntary) dequeue is a more scalable solution
>
> > I would much rather see all the accounting tight first and optimize
> > later than start with leaky stuff and try and plug holes later.
> >
>
> The complexity this introduces is non-trivial since the idea of
> returning quota to the pool means you need to introduce the notion of
> when that quota came to life (otherwise you get leaks in the reverse
> direction!) -- as well as reversing some of the lock ordering.

Right, nasty that, RT doesn't suffer this because of the lack of
over-commit.

> While generations do this they don't greatly increase the efficacy and
> I think there is value in performing the detailed review we are doing
> now in isolation of that.
>
> It's also still consistent regarding leakage since in that in any N
> consecutive periods the maximum additional quota (by a user abusing
> this) that can be received is N+1. Does the complexity trade-off
> merit improving this bound at this point?

Well, something yes, with N being potentially very large indeed these
days we need some feedback.

One idea would be to keep a cpu mask in the bandwidth thing and setting
the cpu when a cpu claims bandwidth from the global pool and iterate and
clear the complete mask on tick.

That also limits the scope on where to look for stealing time.

> >> +static void account_cfs_rq_quota(struct cfs_rq *cfs_rq,
> >> + unsigned long delta_exec)
> >> +{
> >> + if (cfs_rq->quota_assigned == RUNTIME_INF)
> >> + return;
> >> +
> >> + cfs_rq->quota_used += delta_exec;
> >> +
> >> + if (cfs_rq->quota_used < cfs_rq->quota_assigned)
> >> + return;
> >> +
> >> + cfs_rq->quota_assigned += tg_request_cfs_quota(cfs_rq->tg);
> >> +}
> >
> > So why isn't this hierarchical?,
>
> It is naturally (since charging occurs within the existing hierarchal
> accounting)

D'0h yes.. somehow I totally missed that.