2010-01-05 07:57:35

by Bharata B Rao

[permalink] [raw]
Subject: [RFC v5 PATCH 0/8] CFS Hard limits - v5

Hi,

This is the v5 post of CFS hard limits. In this patchset, I have
pulled out bandwidth and runtime handling code from RT into sched.c
so that the same code can be used by CFS hard limits.

Also I have addressed the review comments given by Peter Zijlstra for v4.

Changes
-------
RFC v5:
- Make RT bandwidth and runtime handing code generic and use it in CFS also.
- Remove the *_locked() version from sched_fair.c by simplifying the locking.
This fixes the unlock imbalance bug seen by Jarek Dylag who observed it while
using CFS hard limits with Linux vserver.

RFC v4:
- http://lkml.org/lkml/2009/11/17/191
- Reclaim runtimes lent to other cpus when a cpu goes
offline. (Kamalesh Babulal)
- Fixed a few bugs.
- Some cleanups.

RFC v3:
- http://lkml.org/lkml/2009/11/9/65
- Till v2, I was updating rq->nr_running when tasks go and come back on
runqueue during throttling and unthrottling. Don't do this.
- With the above change, quite a bit of code simplification is achieved.
Runtime related fields of cfs_rq are now being protected by per cfs_rq
lock instead of per rq lock. With this it looks more similar to rt.
- Remove the control file cpu.cfs_hard_limit which enabled/disabled hard limits
for groups. Now hard limits is enabled by having a non-zero runtime.
- Don't explicitly prevent movement of tasks into throttled groups during
load balancing as throttled entities are anyway prevented from being
enqueued in enqueue_task_fair().
- Moved to 2.6.32-rc6

RFC v2:
- http://lkml.org/lkml/2009/9/30/115
- Upgraded to 2.6.31.
- Added CFS runtime borrowing.
- New locking scheme
The hard limit specific fields of cfs_rq (cfs_runtime, cfs_time and
cfs_throttled) were being protected by rq->lock. This simple scheme will
not work when runtime rebalancing is introduced where it will be required
to look at these fields on other CPU's which requires us to acquire
rq->lock of other CPUs. This will not be feasible from update_curr().
Hence introduce a separate lock (rq->runtime_lock) to protect these
fields of all cfs_rq under it.
- Handle the task wakeup in a throttled group correctly.
- Make CFS_HARD_LIMITS dependent on CGROUP_SCHED (Thanks to Andrea Righi)

RFC v1:
- First version of the patches with minimal features was posted at
http://lkml.org/lkml/2009/8/25/128

RFC v0:
- The CFS hard limits proposal was first posted at
http://lkml.org/lkml/2009/6/4/24

Patches description
-------------------
This post has the following patches:

1/8 sched: Rename struct rt_bandwidth to sched_bandwidth
2/8 sched: Make rt bandwidth timer and runtime related code generic
3/8 sched: Bandwidth initialization for fair task groups
4/8 sched: Enforce hard limits by throttling
5/8 sched: Unthrottle the throttled tasks
6/8 sched: Add throttle time statistics to /proc/sched_debug
7/8 sched: CFS runtime borrowing
8/8 sched: Hard limits documentation

Documentation/scheduler/sched-cfs-hard-limits.txt | 48 +
include/linux/sched.h | 6
init/Kconfig | 13
kernel/sched.c | 591 +++++++++++++++++---
kernel/sched_debug.c | 23
kernel/sched_fair.c | 354 +++++++++++
kernel/sched_rt.c | 268 +--------
7 files changed, 964 insertions(+), 339 deletions(-)

Regards,
Bharata.


2010-01-05 07:58:49

by Bharata B Rao

[permalink] [raw]
Subject: [RFC v5 PATCH 1/8] sched: Rename struct rt_bandwidth to sched_bandwidth

sched: Rename struct rt_bandwidth to sched_bandwidth

From: Dhaval Giani <[email protected]>

Rename struct rt_bandwidth to sched_bandwidth and rename some of the
routines to generic names (s/rt_/sched_) so that they can be used
by CFS hard limits code in the subsequent patches.

No functionality change by this patch.

Signed-off-by: Dhaval Giani <[email protected]>
Signed-off-by: Bharata B Rao <[email protected]>
---
kernel/sched.c | 127 ++++++++++++++++++++++++++---------------------------
kernel/sched_rt.c | 46 ++++++++++---------
2 files changed, 86 insertions(+), 87 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index c535cc4..21cf0d5 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -139,50 +139,50 @@ struct rt_prio_array {
struct list_head queue[MAX_RT_PRIO];
};

-struct rt_bandwidth {
+struct sched_bandwidth {
/* nests inside the rq lock: */
- raw_spinlock_t rt_runtime_lock;
- ktime_t rt_period;
- u64 rt_runtime;
- struct hrtimer rt_period_timer;
+ raw_spinlock_t runtime_lock;
+ ktime_t period;
+ u64 runtime;
+ struct hrtimer period_timer;
};

-static struct rt_bandwidth def_rt_bandwidth;
+static struct sched_bandwidth def_rt_bandwidth;

-static int do_sched_rt_period_timer(struct rt_bandwidth *rt_b, int overrun);
+static int do_sched_rt_period_timer(struct sched_bandwidth *sched_b, int overrun);

static enum hrtimer_restart sched_rt_period_timer(struct hrtimer *timer)
{
- struct rt_bandwidth *rt_b =
- container_of(timer, struct rt_bandwidth, rt_period_timer);
+ struct sched_bandwidth *sched_b =
+ container_of(timer, struct sched_bandwidth, period_timer);
ktime_t now;
int overrun;
int idle = 0;

for (;;) {
now = hrtimer_cb_get_time(timer);
- overrun = hrtimer_forward(timer, now, rt_b->rt_period);
+ overrun = hrtimer_forward(timer, now, sched_b->period);

if (!overrun)
break;

- idle = do_sched_rt_period_timer(rt_b, overrun);
+ idle = do_sched_rt_period_timer(sched_b, overrun);
}

return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
}

-static
-void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime)
+static void init_sched_bandwidth(struct sched_bandwidth *sched_b, u64 period,
+ u64 runtime, enum hrtimer_restart (*period_timer)(struct hrtimer *))
{
- rt_b->rt_period = ns_to_ktime(period);
- rt_b->rt_runtime = runtime;
+ sched_b->period = ns_to_ktime(period);
+ sched_b->runtime = runtime;

- raw_spin_lock_init(&rt_b->rt_runtime_lock);
+ raw_spin_lock_init(&sched_b->runtime_lock);

- hrtimer_init(&rt_b->rt_period_timer,
+ hrtimer_init(&sched_b->period_timer,
CLOCK_MONOTONIC, HRTIMER_MODE_REL);
- rt_b->rt_period_timer.function = sched_rt_period_timer;
+ sched_b->period_timer.function = *period_timer;
}

static inline int rt_bandwidth_enabled(void)
@@ -190,42 +190,40 @@ static inline int rt_bandwidth_enabled(void)
return sysctl_sched_rt_runtime >= 0;
}

-static void start_rt_bandwidth(struct rt_bandwidth *rt_b)
+static void start_sched_bandwidth(struct sched_bandwidth *sched_b)
{
ktime_t now;

- if (!rt_bandwidth_enabled() || rt_b->rt_runtime == RUNTIME_INF)
+ if (!rt_bandwidth_enabled() || sched_b->runtime == RUNTIME_INF)
return;

- if (hrtimer_active(&rt_b->rt_period_timer))
+ if (hrtimer_active(&sched_b->period_timer))
return;

- raw_spin_lock(&rt_b->rt_runtime_lock);
+ raw_spin_lock(&sched_b->runtime_lock);
for (;;) {
unsigned long delta;
ktime_t soft, hard;

- if (hrtimer_active(&rt_b->rt_period_timer))
+ if (hrtimer_active(&sched_b->period_timer))
break;

- now = hrtimer_cb_get_time(&rt_b->rt_period_timer);
- hrtimer_forward(&rt_b->rt_period_timer, now, rt_b->rt_period);
+ now = hrtimer_cb_get_time(&sched_b->period_timer);
+ hrtimer_forward(&sched_b->period_timer, now, sched_b->period);

- soft = hrtimer_get_softexpires(&rt_b->rt_period_timer);
- hard = hrtimer_get_expires(&rt_b->rt_period_timer);
+ soft = hrtimer_get_softexpires(&sched_b->period_timer);
+ hard = hrtimer_get_expires(&sched_b->period_timer);
delta = ktime_to_ns(ktime_sub(hard, soft));
- __hrtimer_start_range_ns(&rt_b->rt_period_timer, soft, delta,
+ __hrtimer_start_range_ns(&sched_b->period_timer, soft, delta,
HRTIMER_MODE_ABS_PINNED, 0);
}
- raw_spin_unlock(&rt_b->rt_runtime_lock);
+ raw_spin_unlock(&sched_b->runtime_lock);
}

-#ifdef CONFIG_RT_GROUP_SCHED
-static void destroy_rt_bandwidth(struct rt_bandwidth *rt_b)
+static void destroy_sched_bandwidth(struct sched_bandwidth *sched_b)
{
- hrtimer_cancel(&rt_b->rt_period_timer);
+ hrtimer_cancel(&sched_b->period_timer);
}
-#endif

/*
* sched_domains_mutex serializes calls to arch_init_sched_domains,
@@ -263,7 +261,7 @@ struct task_group {
struct sched_rt_entity **rt_se;
struct rt_rq **rt_rq;

- struct rt_bandwidth rt_bandwidth;
+ struct sched_bandwidth rt_bandwidth;
#endif

struct rcu_head rcu;
@@ -6344,7 +6342,7 @@ recheck:
* assigned.
*/
if (rt_bandwidth_enabled() && rt_policy(policy) &&
- task_group(p)->rt_bandwidth.rt_runtime == 0)
+ task_group(p)->rt_bandwidth.runtime == 0)
return -EPERM;
#endif

@@ -9435,7 +9433,7 @@ static void init_tg_rt_entry(struct task_group *tg, struct rt_rq *rt_rq,
init_rt_rq(rt_rq, rq);
rt_rq->tg = tg;
rt_rq->rt_se = rt_se;
- rt_rq->rt_runtime = tg->rt_bandwidth.rt_runtime;
+ rt_rq->rt_runtime = tg->rt_bandwidth.runtime;
if (add)
list_add(&rt_rq->leaf_rt_rq_list, &rq->leaf_rt_rq_list);

@@ -9516,15 +9514,15 @@ void __init sched_init(void)
init_defrootdomain();
#endif

- init_rt_bandwidth(&def_rt_bandwidth,
- global_rt_period(), global_rt_runtime());
+ init_sched_bandwidth(&def_rt_bandwidth, global_rt_period(),
+ global_rt_runtime(), &sched_rt_period_timer);

#ifdef CONFIG_RT_GROUP_SCHED
- init_rt_bandwidth(&init_task_group.rt_bandwidth,
- global_rt_period(), global_rt_runtime());
+ init_sched_bandwidth(&init_task_group.rt_bandwidth, global_rt_period(),
+ global_rt_runtime(), &sched_rt_period_timer);
#ifdef CONFIG_USER_SCHED
- init_rt_bandwidth(&root_task_group.rt_bandwidth,
- global_rt_period(), RUNTIME_INF);
+ init_sched_bandwidth(&root_task_group.rt_bandwidth, global_rt_period(),
+ RUNTIME_INF, &sched_rt_period_timer);
#endif /* CONFIG_USER_SCHED */
#endif /* CONFIG_RT_GROUP_SCHED */

@@ -9599,7 +9597,7 @@ void __init sched_init(void)
#endif
#endif /* CONFIG_FAIR_GROUP_SCHED */

- rq->rt.rt_runtime = def_rt_bandwidth.rt_runtime;
+ rq->rt.rt_runtime = def_rt_bandwidth.runtime;
#ifdef CONFIG_RT_GROUP_SCHED
INIT_LIST_HEAD(&rq->leaf_rt_rq_list);
#ifdef CONFIG_CGROUP_SCHED
@@ -9920,7 +9918,7 @@ static void free_rt_sched_group(struct task_group *tg)
{
int i;

- destroy_rt_bandwidth(&tg->rt_bandwidth);
+ destroy_sched_bandwidth(&tg->rt_bandwidth);

for_each_possible_cpu(i) {
if (tg->rt_rq)
@@ -9948,8 +9946,9 @@ int alloc_rt_sched_group(struct task_group *tg, struct task_group *parent)
if (!tg->rt_se)
goto err;

- init_rt_bandwidth(&tg->rt_bandwidth,
- ktime_to_ns(def_rt_bandwidth.rt_period), 0);
+ init_sched_bandwidth(&tg->rt_bandwidth,
+ ktime_to_ns(def_rt_bandwidth.period), 0,
+ &sched_rt_period_timer);

for_each_possible_cpu(i) {
rq = cpu_rq(i);
@@ -10248,8 +10247,8 @@ static int tg_schedulable(struct task_group *tg, void *data)
unsigned long total, sum = 0;
u64 period, runtime;

- period = ktime_to_ns(tg->rt_bandwidth.rt_period);
- runtime = tg->rt_bandwidth.rt_runtime;
+ period = ktime_to_ns(tg->rt_bandwidth.period);
+ runtime = tg->rt_bandwidth.runtime;

if (tg == d->tg) {
period = d->rt_period;
@@ -10287,8 +10286,8 @@ static int tg_schedulable(struct task_group *tg, void *data)
* The sum of our children's runtime should not exceed our own.
*/
list_for_each_entry_rcu(child, &tg->children, siblings) {
- period = ktime_to_ns(child->rt_bandwidth.rt_period);
- runtime = child->rt_bandwidth.rt_runtime;
+ period = ktime_to_ns(child->rt_bandwidth.period);
+ runtime = child->rt_bandwidth.runtime;

if (child == d->tg) {
period = d->rt_period;
@@ -10326,9 +10325,9 @@ static int tg_set_bandwidth(struct task_group *tg,
if (err)
goto unlock;

- raw_spin_lock_irq(&tg->rt_bandwidth.rt_runtime_lock);
- tg->rt_bandwidth.rt_period = ns_to_ktime(rt_period);
- tg->rt_bandwidth.rt_runtime = rt_runtime;
+ raw_spin_lock_irq(&tg->rt_bandwidth.runtime_lock);
+ tg->rt_bandwidth.period = ns_to_ktime(rt_period);
+ tg->rt_bandwidth.runtime = rt_runtime;

for_each_possible_cpu(i) {
struct rt_rq *rt_rq = tg->rt_rq[i];
@@ -10337,7 +10336,7 @@ static int tg_set_bandwidth(struct task_group *tg,
rt_rq->rt_runtime = rt_runtime;
raw_spin_unlock(&rt_rq->rt_runtime_lock);
}
- raw_spin_unlock_irq(&tg->rt_bandwidth.rt_runtime_lock);
+ raw_spin_unlock_irq(&tg->rt_bandwidth.runtime_lock);
unlock:
read_unlock(&tasklist_lock);
mutex_unlock(&rt_constraints_mutex);
@@ -10349,7 +10348,7 @@ int sched_group_set_rt_runtime(struct task_group *tg, long rt_runtime_us)
{
u64 rt_runtime, rt_period;

- rt_period = ktime_to_ns(tg->rt_bandwidth.rt_period);
+ rt_period = ktime_to_ns(tg->rt_bandwidth.period);
rt_runtime = (u64)rt_runtime_us * NSEC_PER_USEC;
if (rt_runtime_us < 0)
rt_runtime = RUNTIME_INF;
@@ -10361,10 +10360,10 @@ long sched_group_rt_runtime(struct task_group *tg)
{
u64 rt_runtime_us;

- if (tg->rt_bandwidth.rt_runtime == RUNTIME_INF)
+ if (tg->rt_bandwidth.runtime == RUNTIME_INF)
return -1;

- rt_runtime_us = tg->rt_bandwidth.rt_runtime;
+ rt_runtime_us = tg->rt_bandwidth.runtime;
do_div(rt_runtime_us, NSEC_PER_USEC);
return rt_runtime_us;
}
@@ -10374,7 +10373,7 @@ int sched_group_set_rt_period(struct task_group *tg, long rt_period_us)
u64 rt_runtime, rt_period;

rt_period = (u64)rt_period_us * NSEC_PER_USEC;
- rt_runtime = tg->rt_bandwidth.rt_runtime;
+ rt_runtime = tg->rt_bandwidth.runtime;

if (rt_period == 0)
return -EINVAL;
@@ -10386,7 +10385,7 @@ long sched_group_rt_period(struct task_group *tg)
{
u64 rt_period_us;

- rt_period_us = ktime_to_ns(tg->rt_bandwidth.rt_period);
+ rt_period_us = ktime_to_ns(tg->rt_bandwidth.period);
do_div(rt_period_us, NSEC_PER_USEC);
return rt_period_us;
}
@@ -10420,7 +10419,7 @@ static int sched_rt_global_constraints(void)
int sched_rt_can_attach(struct task_group *tg, struct task_struct *tsk)
{
/* Don't accept realtime tasks when there is no way for them to run */
- if (rt_task(tsk) && tg->rt_bandwidth.rt_runtime == 0)
+ if (rt_task(tsk) && tg->rt_bandwidth.runtime == 0)
return 0;

return 1;
@@ -10442,7 +10441,7 @@ static int sched_rt_global_constraints(void)
if (sysctl_sched_rt_runtime == 0)
return -EBUSY;

- raw_spin_lock_irqsave(&def_rt_bandwidth.rt_runtime_lock, flags);
+ raw_spin_lock_irqsave(&def_rt_bandwidth.runtime_lock, flags);
for_each_possible_cpu(i) {
struct rt_rq *rt_rq = &cpu_rq(i)->rt;

@@ -10450,7 +10449,7 @@ static int sched_rt_global_constraints(void)
rt_rq->rt_runtime = global_rt_runtime();
raw_spin_unlock(&rt_rq->rt_runtime_lock);
}
- raw_spin_unlock_irqrestore(&def_rt_bandwidth.rt_runtime_lock, flags);
+ raw_spin_unlock_irqrestore(&def_rt_bandwidth.runtime_lock, flags);

return 0;
}
@@ -10476,8 +10475,8 @@ int sched_rt_handler(struct ctl_table *table, int write,
sysctl_sched_rt_period = old_period;
sysctl_sched_rt_runtime = old_runtime;
} else {
- def_rt_bandwidth.rt_runtime = global_rt_runtime();
- def_rt_bandwidth.rt_period =
+ def_rt_bandwidth.runtime = global_rt_runtime();
+ def_rt_bandwidth.period =
ns_to_ktime(global_rt_period());
}
}
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index f48328a..1827a10 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -180,7 +180,7 @@ static inline u64 sched_rt_runtime(struct rt_rq *rt_rq)

static inline u64 sched_rt_period(struct rt_rq *rt_rq)
{
- return ktime_to_ns(rt_rq->tg->rt_bandwidth.rt_period);
+ return ktime_to_ns(rt_rq->tg->rt_bandwidth.period);
}

#define for_each_leaf_rt_rq(rt_rq, rq) \
@@ -248,12 +248,12 @@ static inline const struct cpumask *sched_rt_period_mask(void)
#endif

static inline
-struct rt_rq *sched_rt_period_rt_rq(struct rt_bandwidth *rt_b, int cpu)
+struct rt_rq *sched_rt_period_rt_rq(struct sched_bandwidth *rt_b, int cpu)
{
return container_of(rt_b, struct task_group, rt_bandwidth)->rt_rq[cpu];
}

-static inline struct rt_bandwidth *sched_rt_bandwidth(struct rt_rq *rt_rq)
+static inline struct sched_bandwidth *sched_rt_bandwidth(struct rt_rq *rt_rq)
{
return &rt_rq->tg->rt_bandwidth;
}
@@ -267,7 +267,7 @@ static inline u64 sched_rt_runtime(struct rt_rq *rt_rq)

static inline u64 sched_rt_period(struct rt_rq *rt_rq)
{
- return ktime_to_ns(def_rt_bandwidth.rt_period);
+ return ktime_to_ns(def_rt_bandwidth.period);
}

#define for_each_leaf_rt_rq(rt_rq, rq) \
@@ -302,12 +302,12 @@ static inline const struct cpumask *sched_rt_period_mask(void)
}

static inline
-struct rt_rq *sched_rt_period_rt_rq(struct rt_bandwidth *rt_b, int cpu)
+struct rt_rq *sched_rt_period_rt_rq(struct sched_bandwidth *rt_b, int cpu)
{
return &cpu_rq(cpu)->rt;
}

-static inline struct rt_bandwidth *sched_rt_bandwidth(struct rt_rq *rt_rq)
+static inline struct sched_bandwidth *sched_rt_bandwidth(struct rt_rq *rt_rq)
{
return &def_rt_bandwidth;
}
@@ -320,15 +320,15 @@ static inline struct rt_bandwidth *sched_rt_bandwidth(struct rt_rq *rt_rq)
*/
static int do_balance_runtime(struct rt_rq *rt_rq)
{
- struct rt_bandwidth *rt_b = sched_rt_bandwidth(rt_rq);
+ struct sched_bandwidth *rt_b = sched_rt_bandwidth(rt_rq);
struct root_domain *rd = cpu_rq(smp_processor_id())->rd;
int i, weight, more = 0;
u64 rt_period;

weight = cpumask_weight(rd->span);

- raw_spin_lock(&rt_b->rt_runtime_lock);
- rt_period = ktime_to_ns(rt_b->rt_period);
+ raw_spin_lock(&rt_b->runtime_lock);
+ rt_period = ktime_to_ns(rt_b->period);
for_each_cpu(i, rd->span) {
struct rt_rq *iter = sched_rt_period_rt_rq(rt_b, i);
s64 diff;
@@ -365,7 +365,7 @@ static int do_balance_runtime(struct rt_rq *rt_rq)
next:
raw_spin_unlock(&iter->rt_runtime_lock);
}
- raw_spin_unlock(&rt_b->rt_runtime_lock);
+ raw_spin_unlock(&rt_b->runtime_lock);

return more;
}
@@ -382,11 +382,11 @@ static void __disable_runtime(struct rq *rq)
return;

for_each_leaf_rt_rq(rt_rq, rq) {
- struct rt_bandwidth *rt_b = sched_rt_bandwidth(rt_rq);
+ struct sched_bandwidth *rt_b = sched_rt_bandwidth(rt_rq);
s64 want;
int i;

- raw_spin_lock(&rt_b->rt_runtime_lock);
+ raw_spin_lock(&rt_b->runtime_lock);
raw_spin_lock(&rt_rq->rt_runtime_lock);
/*
* Either we're all inf and nobody needs to borrow, or we're
@@ -394,7 +394,7 @@ static void __disable_runtime(struct rq *rq)
* exactly the right amount of runtime to take out.
*/
if (rt_rq->rt_runtime == RUNTIME_INF ||
- rt_rq->rt_runtime == rt_b->rt_runtime)
+ rt_rq->rt_runtime == rt_b->runtime)
goto balanced;
raw_spin_unlock(&rt_rq->rt_runtime_lock);

@@ -403,7 +403,7 @@ static void __disable_runtime(struct rq *rq)
* and what we current have, that's the amount of runtime
* we lend and now have to reclaim.
*/
- want = rt_b->rt_runtime - rt_rq->rt_runtime;
+ want = rt_b->runtime - rt_rq->rt_runtime;

/*
* Greedy reclaim, take back as much as we can.
@@ -446,7 +446,7 @@ balanced:
*/
rt_rq->rt_runtime = RUNTIME_INF;
raw_spin_unlock(&rt_rq->rt_runtime_lock);
- raw_spin_unlock(&rt_b->rt_runtime_lock);
+ raw_spin_unlock(&rt_b->runtime_lock);
}
}

@@ -470,15 +470,15 @@ static void __enable_runtime(struct rq *rq)
* Reset each runqueue's bandwidth settings
*/
for_each_leaf_rt_rq(rt_rq, rq) {
- struct rt_bandwidth *rt_b = sched_rt_bandwidth(rt_rq);
+ struct sched_bandwidth *rt_b = sched_rt_bandwidth(rt_rq);

- raw_spin_lock(&rt_b->rt_runtime_lock);
+ raw_spin_lock(&rt_b->runtime_lock);
raw_spin_lock(&rt_rq->rt_runtime_lock);
- rt_rq->rt_runtime = rt_b->rt_runtime;
+ rt_rq->rt_runtime = rt_b->runtime;
rt_rq->rt_time = 0;
rt_rq->rt_throttled = 0;
raw_spin_unlock(&rt_rq->rt_runtime_lock);
- raw_spin_unlock(&rt_b->rt_runtime_lock);
+ raw_spin_unlock(&rt_b->runtime_lock);
}
}

@@ -510,12 +510,12 @@ static inline int balance_runtime(struct rt_rq *rt_rq)
}
#endif /* CONFIG_SMP */

-static int do_sched_rt_period_timer(struct rt_bandwidth *rt_b, int overrun)
+static int do_sched_rt_period_timer(struct sched_bandwidth *rt_b, int overrun)
{
int i, idle = 1;
const struct cpumask *span;

- if (!rt_bandwidth_enabled() || rt_b->rt_runtime == RUNTIME_INF)
+ if (!rt_bandwidth_enabled() || rt_b->runtime == RUNTIME_INF)
return 1;

span = sched_rt_period_mask();
@@ -753,7 +753,7 @@ inc_rt_group(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
rt_rq->rt_nr_boosted++;

if (rt_rq->tg)
- start_rt_bandwidth(&rt_rq->tg->rt_bandwidth);
+ start_sched_bandwidth(&rt_rq->tg->rt_bandwidth);
}

static void
@@ -770,7 +770,7 @@ dec_rt_group(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
static void
inc_rt_group(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
{
- start_rt_bandwidth(&def_rt_bandwidth);
+ start_sched_bandwidth(&def_rt_bandwidth);
}

static inline

2010-01-05 07:59:59

by Bharata B Rao

[permalink] [raw]
Subject: [RFC v5 PATCH 2/8] sched: Make rt bandwidth timer and runtime related code generic

sched: Make rt bandwidth timer and runtime related code generic

From: Bharata B Rao <[email protected]>

CFS hard limits requires most of the rt bandwidth timer and runtime related
code. Hence make it generic (move generic parts from sched_rt.c to
sched.c) and make it available for CFS to use.

- Separate out the runtime related fields of rt_rq (rt_throttled, rt_time,
rt_runtime, rt_runtime_lock) into a new generic structure rq_bandwidth.
- Rename sched_rt_period_mask() to sched_bw_period_mask() and move it to
sched.c
- Make start_sched_bandwidth() generic so that it can be used by both
rt and cfs.
- Make disable[enable]_runtime() generic and move them to sched.c so that
they can be used by cfs also.
- Make rt runtime balancing code generic and move it to sched.c so that
cfs can make use of it.

No functionality change by this patch.

Signed-off-by: Bharata B Rao <[email protected]>
---
kernel/sched.c | 283 ++++++++++++++++++++++++++++++++++++++++++++++----
kernel/sched_debug.c | 6 +
kernel/sched_rt.c | 256 ++++++---------------------------------------
3 files changed, 298 insertions(+), 247 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 21cf0d5..4a24d62 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -151,7 +151,7 @@ static struct sched_bandwidth def_rt_bandwidth;

static int do_sched_rt_period_timer(struct sched_bandwidth *sched_b, int overrun);

-static enum hrtimer_restart sched_rt_period_timer(struct hrtimer *timer)
+static enum hrtimer_restart sched_period_timer(struct hrtimer *timer, int rt)
{
struct sched_bandwidth *sched_b =
container_of(timer, struct sched_bandwidth, period_timer);
@@ -166,12 +166,18 @@ static enum hrtimer_restart sched_rt_period_timer(struct hrtimer *timer)
if (!overrun)
break;

- idle = do_sched_rt_period_timer(sched_b, overrun);
+ if (rt)
+ idle = do_sched_rt_period_timer(sched_b, overrun);
}

return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
}

+static enum hrtimer_restart sched_rt_period_timer(struct hrtimer *timer)
+{
+ return sched_period_timer(timer, 1);
+}
+
static void init_sched_bandwidth(struct sched_bandwidth *sched_b, u64 period,
u64 runtime, enum hrtimer_restart (*period_timer)(struct hrtimer *))
{
@@ -190,11 +196,14 @@ static inline int rt_bandwidth_enabled(void)
return sysctl_sched_rt_runtime >= 0;
}

-static void start_sched_bandwidth(struct sched_bandwidth *sched_b)
+static void start_sched_bandwidth(struct sched_bandwidth *sched_b, int rt)
{
ktime_t now;

- if (!rt_bandwidth_enabled() || sched_b->runtime == RUNTIME_INF)
+ if (rt && !rt_bandwidth_enabled())
+ return;
+
+ if (sched_b->runtime == RUNTIME_INF)
return;

if (hrtimer_active(&sched_b->period_timer))
@@ -220,10 +229,12 @@ static void start_sched_bandwidth(struct sched_bandwidth *sched_b)
raw_spin_unlock(&sched_b->runtime_lock);
}

+#if defined CONFIG_RT_GROUP_SCHED || defined CONFIG_FAIR_GROUP_SCHED
static void destroy_sched_bandwidth(struct sched_bandwidth *sched_b)
{
hrtimer_cancel(&sched_b->period_timer);
}
+#endif

/*
* sched_domains_mutex serializes calls to arch_init_sched_domains,
@@ -383,6 +394,14 @@ static inline struct task_group *task_group(struct task_struct *p)

#endif /* CONFIG_GROUP_SCHED */

+struct rq_bandwidth {
+ int throttled;
+ u64 time;
+ u64 runtime;
+ /* Nests inside the rq lock: */
+ raw_spinlock_t runtime_lock;
+};
+
/* CFS-related fields in a runqueue */
struct cfs_rq {
struct load_weight load;
@@ -464,11 +483,7 @@ struct rt_rq {
int overloaded;
struct plist_head pushable_tasks;
#endif
- int rt_throttled;
- u64 rt_time;
- u64 rt_runtime;
- /* Nests inside the rq lock: */
- raw_spinlock_t rt_runtime_lock;
+ struct rq_bandwidth rq_bandwidth;

#ifdef CONFIG_RT_GROUP_SCHED
unsigned long rt_nr_boosted;
@@ -1832,6 +1847,234 @@ static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)
#endif
}

+
+#if defined(CONFIG_RT_GROUP_SCHED) || defined(CONFIG_FAIR_GROUP_SCHED)
+
+#ifdef CONFIG_SMP
+static inline const struct cpumask *sched_bw_period_mask(void)
+{
+ return cpu_rq(smp_processor_id())->rd->span;
+}
+#else /* !CONFIG_SMP */
+static inline const struct cpumask *sched_bw_period_mask(void)
+{
+ return cpu_online_mask;
+}
+#endif /* CONFIG_SMP */
+
+#else
+static inline const struct cpumask *sched_bw_period_mask(void)
+{
+ return cpu_online_mask;
+}
+
+#endif
+
+static void init_rq_bandwidth(struct rq_bandwidth *rq_b, u64 runtime)
+{
+ rq_b->time = 0;
+ rq_b->throttled = 0;
+ rq_b->runtime = runtime;
+ raw_spin_lock_init(&rq_b->runtime_lock);
+}
+
+#ifdef CONFIG_RT_GROUP_SCHED
+
+static inline
+struct rt_rq *sched_rt_period_rt_rq(struct sched_bandwidth *rt_b, int cpu)
+{
+ return container_of(rt_b, struct task_group, rt_bandwidth)->rt_rq[cpu];
+}
+
+#else
+
+static inline
+struct rt_rq *sched_rt_period_rt_rq(struct sched_bandwidth *rt_b, int cpu)
+{
+ return &cpu_rq(cpu)->rt;
+}
+
+#endif
+
+#ifdef CONFIG_SMP
+
+void __disable_runtime(struct rq *rq, struct sched_bandwidth *sched_b,
+ struct rq_bandwidth *rq_b, int rt)
+{
+ struct root_domain *rd = rq->rd;
+ s64 want;
+ int i;
+
+ raw_spin_lock(&sched_b->runtime_lock);
+ raw_spin_lock(&rq_b->runtime_lock);
+
+ /*
+ * Either we're all inf and nobody needs to borrow, or we're
+ * already disabled and thus have nothing to do, or we have
+ * exactly the right amount of runtime to take out.
+ */
+ if (rq_b->runtime == RUNTIME_INF || rq_b->runtime == sched_b->runtime)
+ goto balanced;
+
+ raw_spin_unlock(&rq_b->runtime_lock);
+
+ /*
+ * Calculate the difference between what we started out with
+ * and what we current have, that's the amount of runtime
+ * we lend and now have to reclaim.
+ */
+ want = sched_b->runtime - rq_b->runtime;
+
+ /*
+ * Greedy reclaim, take back as much as we can.
+ */
+ for_each_cpu(i, rd->span) {
+ struct rq_bandwidth *iter;
+ s64 diff;
+
+ if (rt)
+ iter = &(sched_rt_period_rt_rq(sched_b, i)->rq_bandwidth);
+ /*
+ * Can't reclaim from ourselves or disabled runqueues.
+ */
+ if (iter == rq_b || iter->runtime == RUNTIME_INF)
+ continue;
+
+ raw_spin_lock(&iter->runtime_lock);
+ if (want > 0) {
+ diff = min_t(s64, iter->runtime, want);
+ iter->runtime -= diff;
+ want -= diff;
+ } else {
+ iter->runtime -= want;
+ want -= want;
+ }
+ raw_spin_unlock(&iter->runtime_lock);
+
+ if (!want)
+ break;
+ }
+
+ raw_spin_lock(&rq_b->runtime_lock);
+ /*
+ * We cannot be left wanting - that would mean some runtime
+ * leaked out of the system.
+ */
+ BUG_ON(want);
+
+balanced:
+ /*
+ * Disable all the borrow logic by pretending we have inf
+ * runtime - in which case borrowing doesn't make sense.
+ */
+ rq_b->runtime = RUNTIME_INF;
+ raw_spin_unlock(&rq_b->runtime_lock);
+ raw_spin_unlock(&sched_b->runtime_lock);
+}
+
+void disable_runtime_rt(struct rq *rq);
+static void disable_runtime(struct rq *rq)
+{
+ unsigned long flags;
+
+ raw_spin_lock_irqsave(&rq->lock, flags);
+ disable_runtime_rt(rq);
+ raw_spin_unlock_irqrestore(&rq->lock, flags);
+}
+
+void __enable_runtime(struct sched_bandwidth *sched_b,
+ struct rq_bandwidth *rq_b)
+{
+ raw_spin_lock(&sched_b->runtime_lock);
+ raw_spin_lock(&rq_b->runtime_lock);
+ rq_b->runtime = sched_b->runtime;
+ rq_b->time = 0;
+ rq_b->throttled = 0;
+ raw_spin_unlock(&rq_b->runtime_lock);
+ raw_spin_unlock(&sched_b->runtime_lock);
+}
+
+void enable_runtime_rt(struct rq *rq);
+static void enable_runtime(struct rq *rq)
+{
+ unsigned long flags;
+
+ raw_spin_lock_irqsave(&rq->lock, flags);
+ enable_runtime_rt(rq);
+ raw_spin_unlock_irqrestore(&rq->lock, flags);
+}
+
+/*
+ * We ran out of runtime, see if we can borrow some from our neighbours.
+ */
+static void do_balance_runtime(struct rq_bandwidth *rq_b,
+ struct sched_bandwidth *sched_b, int rt)
+{
+ struct root_domain *rd = cpu_rq(smp_processor_id())->rd;
+ int i, weight;
+ u64 period;
+
+ weight = cpumask_weight(rd->span);
+
+ raw_spin_lock(&sched_b->runtime_lock);
+ period = ktime_to_ns(sched_b->period);
+ for_each_cpu(i, rd->span) {
+ struct rq_bandwidth *iter;
+ s64 diff;
+
+ if (rt)
+ iter = &(sched_rt_period_rt_rq(sched_b, i)->rq_bandwidth);
+ if (iter == rq_b)
+ continue;
+
+ raw_spin_lock(&iter->runtime_lock);
+ /*
+ * Either all rqs have inf runtime and there's nothing to steal
+ * or __disable_runtime() below sets a specific rq to inf to
+ * indicate its been disabled and disalow stealing.
+ */
+ if (iter->runtime == RUNTIME_INF)
+ goto next;
+
+ /*
+ * From runqueues with spare time, take 1/n part of their
+ * spare time, but no more than our period.
+ */
+ diff = iter->runtime - iter->time;
+ if (diff > 0) {
+ diff = div_u64((u64)diff, weight);
+ if (rq_b->runtime + diff > period)
+ diff = period - rq_b->runtime;
+ iter->runtime -= diff;
+ rq_b->runtime += diff;
+ if (rq_b->runtime == period) {
+ raw_spin_unlock(&iter->runtime_lock);
+ break;
+ }
+ }
+next:
+ raw_spin_unlock(&iter->runtime_lock);
+ }
+ raw_spin_unlock(&sched_b->runtime_lock);
+}
+
+static void balance_runtime(struct rq_bandwidth *rq_b,
+ struct sched_bandwidth *sched_b, int rt)
+{
+ if (rq_b->time > rq_b->runtime) {
+ raw_spin_unlock(&rq_b->runtime_lock);
+ do_balance_runtime(rq_b, sched_b, rt);
+ raw_spin_lock(&rq_b->runtime_lock);
+ }
+}
+#else /* !CONFIG_SMP */
+static inline void balance_runtime(struct rq_bandwidth *rq_b,
+ struct sched_bandwidth *sched_b, int rt)
+{
+ return;
+}
+#endif /* CONFIG_SMP */
+
#include "sched_stats.h"
#include "sched_idletask.c"
#include "sched_fair.c"
@@ -9381,11 +9624,7 @@ static void init_rt_rq(struct rt_rq *rt_rq, struct rq *rq)
rt_rq->overloaded = 0;
plist_head_init_raw(&rt_rq->pushable_tasks, &rq->lock);
#endif
-
- rt_rq->rt_time = 0;
- rt_rq->rt_throttled = 0;
- rt_rq->rt_runtime = 0;
- raw_spin_lock_init(&rt_rq->rt_runtime_lock);
+ init_rq_bandwidth(&rt_rq->rq_bandwidth, 0);

#ifdef CONFIG_RT_GROUP_SCHED
rt_rq->rt_nr_boosted = 0;
@@ -9433,7 +9672,7 @@ static void init_tg_rt_entry(struct task_group *tg, struct rt_rq *rt_rq,
init_rt_rq(rt_rq, rq);
rt_rq->tg = tg;
rt_rq->rt_se = rt_se;
- rt_rq->rt_runtime = tg->rt_bandwidth.runtime;
+ rt_rq->rq_bandwidth.runtime = tg->rt_bandwidth.runtime;
if (add)
list_add(&rt_rq->leaf_rt_rq_list, &rq->leaf_rt_rq_list);

@@ -9597,7 +9836,7 @@ void __init sched_init(void)
#endif
#endif /* CONFIG_FAIR_GROUP_SCHED */

- rq->rt.rt_runtime = def_rt_bandwidth.runtime;
+ rq->rt.rq_bandwidth.runtime = def_rt_bandwidth.runtime;
#ifdef CONFIG_RT_GROUP_SCHED
INIT_LIST_HEAD(&rq->leaf_rt_rq_list);
#ifdef CONFIG_CGROUP_SCHED
@@ -10332,9 +10571,9 @@ static int tg_set_bandwidth(struct task_group *tg,
for_each_possible_cpu(i) {
struct rt_rq *rt_rq = tg->rt_rq[i];

- raw_spin_lock(&rt_rq->rt_runtime_lock);
- rt_rq->rt_runtime = rt_runtime;
- raw_spin_unlock(&rt_rq->rt_runtime_lock);
+ raw_spin_lock(&rt_rq->rq_bandwidth.runtime_lock);
+ rt_rq->rq_bandwidth.runtime = rt_runtime;
+ raw_spin_unlock(&rt_rq->rq_bandwidth.runtime_lock);
}
raw_spin_unlock_irq(&tg->rt_bandwidth.runtime_lock);
unlock:
@@ -10445,9 +10684,9 @@ static int sched_rt_global_constraints(void)
for_each_possible_cpu(i) {
struct rt_rq *rt_rq = &cpu_rq(i)->rt;

- raw_spin_lock(&rt_rq->rt_runtime_lock);
- rt_rq->rt_runtime = global_rt_runtime();
- raw_spin_unlock(&rt_rq->rt_runtime_lock);
+ raw_spin_lock(&rt_rq->rq_bandwidth.runtime_lock);
+ rt_rq->rq_bandwidth.runtime = global_rt_runtime();
+ raw_spin_unlock(&rt_rq->rq_bandwidth.runtime_lock);
}
raw_spin_unlock_irqrestore(&def_rt_bandwidth.runtime_lock, flags);

diff --git a/kernel/sched_debug.c b/kernel/sched_debug.c
index 67f95aa..1b67698 100644
--- a/kernel/sched_debug.c
+++ b/kernel/sched_debug.c
@@ -238,9 +238,9 @@ void print_rt_rq(struct seq_file *m, int cpu, struct rt_rq *rt_rq)
SEQ_printf(m, " .%-30s: %Ld.%06ld\n", #x, SPLIT_NS(rt_rq->x))

P(rt_nr_running);
- P(rt_throttled);
- PN(rt_time);
- PN(rt_runtime);
+ P(rq_bandwidth.throttled);
+ PN(rq_bandwidth.time);
+ PN(rq_bandwidth.runtime);

#undef PN
#undef P
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 1827a10..7531d0f 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -175,7 +175,7 @@ static inline u64 sched_rt_runtime(struct rt_rq *rt_rq)
if (!rt_rq->tg)
return RUNTIME_INF;

- return rt_rq->rt_runtime;
+ return rt_rq->rq_bandwidth.runtime;
}

static inline u64 sched_rt_period(struct rt_rq *rt_rq)
@@ -220,7 +220,7 @@ static void sched_rt_rq_dequeue(struct rt_rq *rt_rq)

static inline int rt_rq_throttled(struct rt_rq *rt_rq)
{
- return rt_rq->rt_throttled && !rt_rq->rt_nr_boosted;
+ return rt_rq->rq_bandwidth.throttled && !rt_rq->rt_nr_boosted;
}

static int rt_se_boosted(struct sched_rt_entity *rt_se)
@@ -235,24 +235,6 @@ static int rt_se_boosted(struct sched_rt_entity *rt_se)
return p->prio != p->normal_prio;
}

-#ifdef CONFIG_SMP
-static inline const struct cpumask *sched_rt_period_mask(void)
-{
- return cpu_rq(smp_processor_id())->rd->span;
-}
-#else
-static inline const struct cpumask *sched_rt_period_mask(void)
-{
- return cpu_online_mask;
-}
-#endif
-
-static inline
-struct rt_rq *sched_rt_period_rt_rq(struct sched_bandwidth *rt_b, int cpu)
-{
- return container_of(rt_b, struct task_group, rt_bandwidth)->rt_rq[cpu];
-}
-
static inline struct sched_bandwidth *sched_rt_bandwidth(struct rt_rq *rt_rq)
{
return &rt_rq->tg->rt_bandwidth;
@@ -262,7 +244,7 @@ static inline struct sched_bandwidth *sched_rt_bandwidth(struct rt_rq *rt_rq)

static inline u64 sched_rt_runtime(struct rt_rq *rt_rq)
{
- return rt_rq->rt_runtime;
+ return rt_rq->rq_bandwidth.runtime;
}

static inline u64 sched_rt_period(struct rt_rq *rt_rq)
@@ -293,18 +275,7 @@ static inline void sched_rt_rq_dequeue(struct rt_rq *rt_rq)

static inline int rt_rq_throttled(struct rt_rq *rt_rq)
{
- return rt_rq->rt_throttled;
-}
-
-static inline const struct cpumask *sched_rt_period_mask(void)
-{
- return cpu_online_mask;
-}
-
-static inline
-struct rt_rq *sched_rt_period_rt_rq(struct sched_bandwidth *rt_b, int cpu)
-{
- return &cpu_rq(cpu)->rt;
+ return rt_rq->rq_bandwidth.throttled;
}

static inline struct sched_bandwidth *sched_rt_bandwidth(struct rt_rq *rt_rq)
@@ -315,151 +286,24 @@ static inline struct sched_bandwidth *sched_rt_bandwidth(struct rt_rq *rt_rq)
#endif /* CONFIG_RT_GROUP_SCHED */

#ifdef CONFIG_SMP
-/*
- * We ran out of runtime, see if we can borrow some from our neighbours.
- */
-static int do_balance_runtime(struct rt_rq *rt_rq)
-{
- struct sched_bandwidth *rt_b = sched_rt_bandwidth(rt_rq);
- struct root_domain *rd = cpu_rq(smp_processor_id())->rd;
- int i, weight, more = 0;
- u64 rt_period;
-
- weight = cpumask_weight(rd->span);
-
- raw_spin_lock(&rt_b->runtime_lock);
- rt_period = ktime_to_ns(rt_b->period);
- for_each_cpu(i, rd->span) {
- struct rt_rq *iter = sched_rt_period_rt_rq(rt_b, i);
- s64 diff;
-
- if (iter == rt_rq)
- continue;
-
- raw_spin_lock(&iter->rt_runtime_lock);
- /*
- * Either all rqs have inf runtime and there's nothing to steal
- * or __disable_runtime() below sets a specific rq to inf to
- * indicate its been disabled and disalow stealing.
- */
- if (iter->rt_runtime == RUNTIME_INF)
- goto next;
-
- /*
- * From runqueues with spare time, take 1/n part of their
- * spare time, but no more than our period.
- */
- diff = iter->rt_runtime - iter->rt_time;
- if (diff > 0) {
- diff = div_u64((u64)diff, weight);
- if (rt_rq->rt_runtime + diff > rt_period)
- diff = rt_period - rt_rq->rt_runtime;
- iter->rt_runtime -= diff;
- rt_rq->rt_runtime += diff;
- more = 1;
- if (rt_rq->rt_runtime == rt_period) {
- raw_spin_unlock(&iter->rt_runtime_lock);
- break;
- }
- }
-next:
- raw_spin_unlock(&iter->rt_runtime_lock);
- }
- raw_spin_unlock(&rt_b->runtime_lock);
-
- return more;
-}

/*
* Ensure this RQ takes back all the runtime it lend to its neighbours.
*/
-static void __disable_runtime(struct rq *rq)
+void disable_runtime_rt(struct rq *rq)
{
- struct root_domain *rd = rq->rd;
struct rt_rq *rt_rq;

if (unlikely(!scheduler_running))
return;

for_each_leaf_rt_rq(rt_rq, rq) {
- struct sched_bandwidth *rt_b = sched_rt_bandwidth(rt_rq);
- s64 want;
- int i;
-
- raw_spin_lock(&rt_b->runtime_lock);
- raw_spin_lock(&rt_rq->rt_runtime_lock);
- /*
- * Either we're all inf and nobody needs to borrow, or we're
- * already disabled and thus have nothing to do, or we have
- * exactly the right amount of runtime to take out.
- */
- if (rt_rq->rt_runtime == RUNTIME_INF ||
- rt_rq->rt_runtime == rt_b->runtime)
- goto balanced;
- raw_spin_unlock(&rt_rq->rt_runtime_lock);
-
- /*
- * Calculate the difference between what we started out with
- * and what we current have, that's the amount of runtime
- * we lend and now have to reclaim.
- */
- want = rt_b->runtime - rt_rq->rt_runtime;
-
- /*
- * Greedy reclaim, take back as much as we can.
- */
- for_each_cpu(i, rd->span) {
- struct rt_rq *iter = sched_rt_period_rt_rq(rt_b, i);
- s64 diff;
-
- /*
- * Can't reclaim from ourselves or disabled runqueues.
- */
- if (iter == rt_rq || iter->rt_runtime == RUNTIME_INF)
- continue;
-
- raw_spin_lock(&iter->rt_runtime_lock);
- if (want > 0) {
- diff = min_t(s64, iter->rt_runtime, want);
- iter->rt_runtime -= diff;
- want -= diff;
- } else {
- iter->rt_runtime -= want;
- want -= want;
- }
- raw_spin_unlock(&iter->rt_runtime_lock);
-
- if (!want)
- break;
- }
-
- raw_spin_lock(&rt_rq->rt_runtime_lock);
- /*
- * We cannot be left wanting - that would mean some runtime
- * leaked out of the system.
- */
- BUG_ON(want);
-balanced:
- /*
- * Disable all the borrow logic by pretending we have inf
- * runtime - in which case borrowing doesn't make sense.
- */
- rt_rq->rt_runtime = RUNTIME_INF;
- raw_spin_unlock(&rt_rq->rt_runtime_lock);
- raw_spin_unlock(&rt_b->runtime_lock);
+ struct sched_bandwidth *sched_b = sched_rt_bandwidth(rt_rq);
+ __disable_runtime(rq, sched_b, &rt_rq->rq_bandwidth, 1);
}
}

-static void disable_runtime(struct rq *rq)
-{
- unsigned long flags;
-
- raw_spin_lock_irqsave(&rq->lock, flags);
- __disable_runtime(rq);
- raw_spin_unlock_irqrestore(&rq->lock, flags);
-}
-
-static void __enable_runtime(struct rq *rq)
+void enable_runtime_rt(struct rq *rq)
{
struct rt_rq *rt_rq;

@@ -470,45 +314,12 @@ static void __enable_runtime(struct rq *rq)
* Reset each runqueue's bandwidth settings
*/
for_each_leaf_rt_rq(rt_rq, rq) {
- struct sched_bandwidth *rt_b = sched_rt_bandwidth(rt_rq);
-
- raw_spin_lock(&rt_b->runtime_lock);
- raw_spin_lock(&rt_rq->rt_runtime_lock);
- rt_rq->rt_runtime = rt_b->runtime;
- rt_rq->rt_time = 0;
- rt_rq->rt_throttled = 0;
- raw_spin_unlock(&rt_rq->rt_runtime_lock);
- raw_spin_unlock(&rt_b->runtime_lock);
+ struct sched_bandwidth *sched_b = sched_rt_bandwidth(rt_rq);
+ __enable_runtime(sched_b, &rt_rq->rq_bandwidth);
}
}

-static void enable_runtime(struct rq *rq)
-{
- unsigned long flags;
-
- raw_spin_lock_irqsave(&rq->lock, flags);
- __enable_runtime(rq);
- raw_spin_unlock_irqrestore(&rq->lock, flags);
-}
-
-static int balance_runtime(struct rt_rq *rt_rq)
-{
- int more = 0;
-
- if (rt_rq->rt_time > rt_rq->rt_runtime) {
- raw_spin_unlock(&rt_rq->rt_runtime_lock);
- more = do_balance_runtime(rt_rq);
- raw_spin_lock(&rt_rq->rt_runtime_lock);
- }
-
- return more;
-}
-#else /* !CONFIG_SMP */
-static inline int balance_runtime(struct rt_rq *rt_rq)
-{
- return 0;
-}
-#endif /* CONFIG_SMP */
+#endif

static int do_sched_rt_period_timer(struct sched_bandwidth *rt_b, int overrun)
{
@@ -518,28 +329,29 @@ static int do_sched_rt_period_timer(struct sched_bandwidth *rt_b, int overrun)
if (!rt_bandwidth_enabled() || rt_b->runtime == RUNTIME_INF)
return 1;

- span = sched_rt_period_mask();
+ span = sched_bw_period_mask();
for_each_cpu(i, span) {
int enqueue = 0;
struct rt_rq *rt_rq = sched_rt_period_rt_rq(rt_b, i);
struct rq *rq = rq_of_rt_rq(rt_rq);

raw_spin_lock(&rq->lock);
- if (rt_rq->rt_time) {
+ if (rt_rq->rq_bandwidth.time) {
u64 runtime;

- raw_spin_lock(&rt_rq->rt_runtime_lock);
- if (rt_rq->rt_throttled)
- balance_runtime(rt_rq);
- runtime = rt_rq->rt_runtime;
- rt_rq->rt_time -= min(rt_rq->rt_time, overrun*runtime);
- if (rt_rq->rt_throttled && rt_rq->rt_time < runtime) {
- rt_rq->rt_throttled = 0;
+ raw_spin_lock(&rt_rq->rq_bandwidth.runtime_lock);
+ if (rt_rq->rq_bandwidth.throttled)
+ balance_runtime(&rt_rq->rq_bandwidth,
+ sched_rt_bandwidth(rt_rq), 1);
+ runtime = rt_rq->rq_bandwidth.runtime;
+ rt_rq->rq_bandwidth.time -= min(rt_rq->rq_bandwidth.time, overrun*runtime);
+ if (rt_rq->rq_bandwidth.throttled && rt_rq->rq_bandwidth.time < runtime) {
+ rt_rq->rq_bandwidth.throttled = 0;
enqueue = 1;
}
- if (rt_rq->rt_time || rt_rq->rt_nr_running)
+ if (rt_rq->rq_bandwidth.time || rt_rq->rt_nr_running)
idle = 0;
- raw_spin_unlock(&rt_rq->rt_runtime_lock);
+ raw_spin_unlock(&rt_rq->rq_bandwidth.runtime_lock);
} else if (rt_rq->rt_nr_running)
idle = 0;

@@ -567,19 +379,19 @@ static int sched_rt_runtime_exceeded(struct rt_rq *rt_rq)
{
u64 runtime = sched_rt_runtime(rt_rq);

- if (rt_rq->rt_throttled)
+ if (rt_rq->rq_bandwidth.throttled)
return rt_rq_throttled(rt_rq);

if (sched_rt_runtime(rt_rq) >= sched_rt_period(rt_rq))
return 0;

- balance_runtime(rt_rq);
+ balance_runtime(&rt_rq->rq_bandwidth, sched_rt_bandwidth(rt_rq), 1);
runtime = sched_rt_runtime(rt_rq);
if (runtime == RUNTIME_INF)
return 0;

- if (rt_rq->rt_time > runtime) {
- rt_rq->rt_throttled = 1;
+ if (rt_rq->rq_bandwidth.time > runtime) {
+ rt_rq->rq_bandwidth.throttled = 1;
if (rt_rq_throttled(rt_rq)) {
sched_rt_rq_dequeue(rt_rq);
return 1;
@@ -624,11 +436,11 @@ static void update_curr_rt(struct rq *rq)
rt_rq = rt_rq_of_se(rt_se);

if (sched_rt_runtime(rt_rq) != RUNTIME_INF) {
- raw_spin_lock(&rt_rq->rt_runtime_lock);
- rt_rq->rt_time += delta_exec;
+ raw_spin_lock(&rt_rq->rq_bandwidth.runtime_lock);
+ rt_rq->rq_bandwidth.time += delta_exec;
if (sched_rt_runtime_exceeded(rt_rq))
resched_task(curr);
- raw_spin_unlock(&rt_rq->rt_runtime_lock);
+ raw_spin_unlock(&rt_rq->rq_bandwidth.runtime_lock);
}
}
}
@@ -753,7 +565,7 @@ inc_rt_group(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
rt_rq->rt_nr_boosted++;

if (rt_rq->tg)
- start_sched_bandwidth(&rt_rq->tg->rt_bandwidth);
+ start_sched_bandwidth(&rt_rq->tg->rt_bandwidth, 1);
}

static void
@@ -770,7 +582,7 @@ dec_rt_group(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
static void
inc_rt_group(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
{
- start_sched_bandwidth(&def_rt_bandwidth);
+ start_sched_bandwidth(&def_rt_bandwidth, 1);
}

static inline
@@ -1551,7 +1363,7 @@ static void rq_online_rt(struct rq *rq)
if (rq->rt.overloaded)
rt_set_overload(rq);

- __enable_runtime(rq);
+ enable_runtime_rt(rq);

cpupri_set(&rq->rd->cpupri, rq->cpu, rq->rt.highest_prio.curr);
}
@@ -1562,7 +1374,7 @@ static void rq_offline_rt(struct rq *rq)
if (rq->rt.overloaded)
rt_clear_overload(rq);

- __disable_runtime(rq);
+ disable_runtime_rt(rq);

cpupri_set(&rq->rd->cpupri, rq->cpu, CPUPRI_INVALID);
}

2010-01-05 08:01:09

by Bharata B Rao

[permalink] [raw]
Subject: [RFC v5 PATCH 3/8] sched: Bandwidth initialization for fair task groups

sched: Bandwidth initialization for fair task groups.

From: Bharata B Rao <[email protected]>

Introduce the notion of hard limiting for CFS groups by bringing in
the concept of runtime and period for them. Add cgroup files to control
runtime and period.

Signed-off-by: Bharata B Rao <[email protected]>
---
init/Kconfig | 13 ++++
kernel/sched.c | 164 +++++++++++++++++++++++++++++++++++++++++++++++++++
kernel/sched_fair.c | 18 ++++++
3 files changed, 195 insertions(+), 0 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index a23da9f..6b76df4 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -486,6 +486,19 @@ config CGROUP_SCHED

endchoice

+config CFS_HARD_LIMITS
+ bool "Hard Limits for CFS Group Scheduler"
+ depends on EXPERIMENTAL
+ depends on FAIR_GROUP_SCHED && CGROUP_SCHED
+ default n
+ help
+ This option enables hard limiting of CPU time obtained by
+ a fair task group. Use this if you want to throttle a group of tasks
+ based on its CPU usage. For more details refer to
+ Documentation/scheduler/sched-cfs-hard-limits.txt
+
+ Say N if unsure.
+
menuconfig CGROUPS
boolean "Control Group support"
help
diff --git a/kernel/sched.c b/kernel/sched.c
index 4a24d62..48d5483 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -151,6 +151,14 @@ static struct sched_bandwidth def_rt_bandwidth;

static int do_sched_rt_period_timer(struct sched_bandwidth *sched_b, int overrun);

+/*
+ * Nothing much to do now. Will be populated in subsequent hard limit patches.
+ */
+static int do_sched_cfs_period_timer(struct sched_bandwidth *sched_b, int overrun)
+{
+ return 0;
+}
+
static enum hrtimer_restart sched_period_timer(struct hrtimer *timer, int rt)
{
struct sched_bandwidth *sched_b =
@@ -168,6 +176,8 @@ static enum hrtimer_restart sched_period_timer(struct hrtimer *timer, int rt)

if (rt)
idle = do_sched_rt_period_timer(sched_b, overrun);
+ else
+ idle = do_sched_cfs_period_timer(sched_b, overrun);
}

return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
@@ -266,6 +276,7 @@ struct task_group {
/* runqueue "owned" by this group on each cpu */
struct cfs_rq **cfs_rq;
unsigned long shares;
+ struct sched_bandwidth cfs_bandwidth;
#endif

#ifdef CONFIG_RT_GROUP_SCHED
@@ -463,6 +474,7 @@ struct cfs_rq {
unsigned long rq_weight;
#endif
#endif
+ struct rq_bandwidth rq_bandwidth;
};

/* Real-Time classes' related field in a runqueue: */
@@ -2075,6 +2087,38 @@ static inline void balance_runtime(struct rq_bandwidth *rq_b,
}
#endif /* CONFIG_SMP */

+/*
+ * Runtime allowed for a cfs group before it is hard limited.
+ * default: Infinite which means no hard limiting.
+ */
+u64 sched_cfs_runtime = RUNTIME_INF;
+
+/*
+ * period over which we hard limit the cfs group's bandwidth.
+ * default: 0.5s
+ */
+u64 sched_cfs_period = 500000;
+
+static inline u64 global_cfs_period(void)
+{
+ return sched_cfs_period * NSEC_PER_USEC;
+}
+
+static inline u64 global_cfs_runtime(void)
+{
+ return RUNTIME_INF;
+}
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+/*
+ * Refresh the runtimes of the throttled groups.
+ */
+static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
+{
+ return sched_period_timer(timer, 0);
+}
+#endif
+
#include "sched_stats.h"
#include "sched_idletask.c"
#include "sched_fair.c"
@@ -9640,6 +9684,7 @@ static void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
struct rq *rq = cpu_rq(cpu);
tg->cfs_rq[cpu] = cfs_rq;
init_cfs_rq(cfs_rq, rq);
+ init_rq_bandwidth(&cfs_rq->rq_bandwidth, tg->cfs_bandwidth.runtime);
cfs_rq->tg = tg;
if (add)
list_add(&cfs_rq->leaf_cfs_rq_list, &rq->leaf_cfs_rq_list);
@@ -9765,6 +9810,12 @@ void __init sched_init(void)
#endif /* CONFIG_USER_SCHED */
#endif /* CONFIG_RT_GROUP_SCHED */

+#ifdef CONFIG_FAIR_GROUP_SCHED
+ init_sched_bandwidth(&init_task_group.cfs_bandwidth,
+ global_cfs_period(), global_cfs_runtime(),
+ &sched_cfs_period_timer);
+#endif
+
#ifdef CONFIG_GROUP_SCHED
list_add(&init_task_group.list, &task_groups);
INIT_LIST_HEAD(&init_task_group.children);
@@ -9791,6 +9842,8 @@ void __init sched_init(void)
init_cfs_rq(&rq->cfs, rq);
init_rt_rq(&rq->rt, rq);
#ifdef CONFIG_FAIR_GROUP_SCHED
+ init_rq_bandwidth(&rq->cfs.rq_bandwidth,
+ init_task_group.cfs_bandwidth.runtime);
init_task_group.shares = init_task_group_load;
INIT_LIST_HEAD(&rq->leaf_cfs_rq_list);
#ifdef CONFIG_CGROUP_SCHED
@@ -10070,6 +10123,7 @@ static void free_fair_sched_group(struct task_group *tg)
{
int i;

+ destroy_sched_bandwidth(&tg->cfs_bandwidth);
for_each_possible_cpu(i) {
if (tg->cfs_rq)
kfree(tg->cfs_rq[i]);
@@ -10096,6 +10150,8 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
if (!tg->se)
goto err;

+ init_sched_bandwidth(&tg->cfs_bandwidth, global_cfs_period(),
+ global_cfs_runtime(), &sched_cfs_period_timer);
tg->shares = NICE_0_LOAD;

for_each_possible_cpu(i) {
@@ -10824,6 +10880,102 @@ static u64 cpu_shares_read_u64(struct cgroup *cgrp, struct cftype *cft)

return (u64) tg->shares;
}
+
+#ifdef CONFIG_CFS_HARD_LIMITS
+
+static int tg_set_cfs_bandwidth(struct task_group *tg,
+ u64 cfs_period, u64 cfs_runtime)
+{
+ int i;
+
+ if (tg == &init_task_group)
+ return -EINVAL;
+
+ raw_spin_lock_irq(&tg->cfs_bandwidth.runtime_lock);
+ tg->cfs_bandwidth.period = ns_to_ktime(cfs_period);
+ tg->cfs_bandwidth.runtime = cfs_runtime;
+
+ for_each_possible_cpu(i) {
+ struct cfs_rq *cfs_rq = tg->cfs_rq[i];
+
+ raw_spin_lock(&cfs_rq->rq_bandwidth.runtime_lock);
+ cfs_rq->rq_bandwidth.runtime = cfs_runtime;
+ raw_spin_unlock(&cfs_rq->rq_bandwidth.runtime_lock);
+ }
+
+ raw_spin_unlock_irq(&tg->cfs_bandwidth.runtime_lock);
+ return 0;
+}
+
+int tg_set_cfs_runtime(struct task_group *tg, long cfs_runtime_us)
+{
+ u64 cfs_runtime, cfs_period;
+
+ cfs_period = ktime_to_ns(tg->cfs_bandwidth.period);
+ cfs_runtime = (u64)cfs_runtime_us * NSEC_PER_USEC;
+ if (cfs_runtime_us < 0)
+ cfs_runtime = RUNTIME_INF;
+
+ return tg_set_cfs_bandwidth(tg, cfs_period, cfs_runtime);
+}
+
+long tg_get_cfs_runtime(struct task_group *tg)
+{
+ u64 cfs_runtime_us;
+
+ if (tg->cfs_bandwidth.runtime == RUNTIME_INF)
+ return -1;
+
+ cfs_runtime_us = tg->cfs_bandwidth.runtime;
+ do_div(cfs_runtime_us, NSEC_PER_USEC);
+ return cfs_runtime_us;
+}
+
+int tg_set_cfs_period(struct task_group *tg, long cfs_period_us)
+{
+ u64 cfs_runtime, cfs_period;
+
+ cfs_period = (u64)cfs_period_us * NSEC_PER_USEC;
+ cfs_runtime = tg->cfs_bandwidth.runtime;
+
+ if (cfs_period == 0)
+ return -EINVAL;
+
+ return tg_set_cfs_bandwidth(tg, cfs_period, cfs_runtime);
+}
+
+long tg_get_cfs_period(struct task_group *tg)
+{
+ u64 cfs_period_us;
+
+ cfs_period_us = ktime_to_ns(tg->cfs_bandwidth.period);
+ do_div(cfs_period_us, NSEC_PER_USEC);
+ return cfs_period_us;
+}
+
+static s64 cpu_cfs_runtime_read_s64(struct cgroup *cgrp, struct cftype *cft)
+{
+ return tg_get_cfs_runtime(cgroup_tg(cgrp));
+}
+
+static int cpu_cfs_runtime_write_s64(struct cgroup *cgrp, struct cftype *cftype,
+ s64 cfs_runtime_us)
+{
+ return tg_set_cfs_runtime(cgroup_tg(cgrp), cfs_runtime_us);
+}
+
+static u64 cpu_cfs_period_read_u64(struct cgroup *cgrp, struct cftype *cft)
+{
+ return tg_get_cfs_period(cgroup_tg(cgrp));
+}
+
+static int cpu_cfs_period_write_u64(struct cgroup *cgrp, struct cftype *cftype,
+ u64 cfs_period_us)
+{
+ return tg_set_cfs_period(cgroup_tg(cgrp), cfs_period_us);
+}
+
+#endif /* CONFIG_CFS_HARD_LIMITS */
#endif /* CONFIG_FAIR_GROUP_SCHED */

#ifdef CONFIG_RT_GROUP_SCHED
@@ -10857,6 +11009,18 @@ static struct cftype cpu_files[] = {
.read_u64 = cpu_shares_read_u64,
.write_u64 = cpu_shares_write_u64,
},
+#ifdef CONFIG_CFS_HARD_LIMITS
+ {
+ .name = "cfs_runtime_us",
+ .read_s64 = cpu_cfs_runtime_read_s64,
+ .write_s64 = cpu_cfs_runtime_write_s64,
+ },
+ {
+ .name = "cfs_period_us",
+ .read_u64 = cpu_cfs_period_read_u64,
+ .write_u64 = cpu_cfs_period_write_u64,
+ },
+#endif /* CONFIG_CFS_HARD_LIMITS */
#endif
#ifdef CONFIG_RT_GROUP_SCHED
{
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 42ac3c9..0dfb7a5 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -205,6 +205,18 @@ find_matching_se(struct sched_entity **se, struct sched_entity **pse)
}
}

+static inline struct sched_bandwidth *sched_cfs_bandwidth(struct cfs_rq *cfs_rq)
+{
+ return &cfs_rq->tg->cfs_bandwidth;
+}
+
+static inline void start_cfs_bandwidth(struct cfs_rq *cfs_rq)
+{
+ if (cfs_rq->tg)
+ start_sched_bandwidth(sched_cfs_bandwidth(cfs_rq), 0);
+ return;
+}
+
#else /* !CONFIG_FAIR_GROUP_SCHED */

static inline struct task_struct *task_of(struct sched_entity *se)
@@ -265,6 +277,11 @@ find_matching_se(struct sched_entity **se, struct sched_entity **pse)
{
}

+static inline void start_cfs_bandwidth(struct cfs_rq *cfs_rq)
+{
+ return;
+}
+
#endif /* CONFIG_FAIR_GROUP_SCHED */


@@ -360,6 +377,7 @@ static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)

rb_link_node(&se->run_node, parent, link);
rb_insert_color(&se->run_node, &cfs_rq->tasks_timeline);
+ start_cfs_bandwidth(cfs_rq);
}

static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)

2010-01-05 08:01:50

by Bharata B Rao

[permalink] [raw]
Subject: [RFC v5 PATCH 4/8] sched: Enforce hard limits by throttling

sched: Enforce hard limits by throttling.

From: Bharata B Rao <[email protected]>

Throttle the task-groups which exceed the runtime allocated to them.
Throttled group entities are removed from the run queue.

Signed-off-by: Bharata B Rao <[email protected]>
---
kernel/sched.c | 5 +
kernel/sched_fair.c | 205 +++++++++++++++++++++++++++++++++++++++++++++------
2 files changed, 183 insertions(+), 27 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 48d5483..c91158d 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1633,6 +1633,7 @@ static void update_group_shares_cpu(struct task_group *tg, int cpu,
}
}

+static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq);
/*
* Re-compute the task group their per cpu shares over the given domain.
* This needs to be done in a bottom-up fashion because the rq weight of a
@@ -1661,8 +1662,10 @@ static int tg_shares_up(struct task_group *tg, void *data)
* If there are currently no tasks on the cpu pretend there
* is one of average load so that when a new task gets to
* run here it will not get delayed by group starvation.
+ * Also if the group is throttled on this cpu, pretend that
+ * it has no tasks.
*/
- if (!weight)
+ if (!weight || cfs_rq_throttled(tg->cfs_rq[i]))
weight = NICE_0_LOAD;

sum_weight += weight;
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 0dfb7a5..d1ee88e 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -217,7 +217,66 @@ static inline void start_cfs_bandwidth(struct cfs_rq *cfs_rq)
return;
}

-#else /* !CONFIG_FAIR_GROUP_SCHED */
+#ifdef CONFIG_CFS_HARD_LIMITS
+
+static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
+{
+ return cfs_rq->rq_bandwidth.throttled;
+}
+
+/*
+ * Check if group entity exceeded its runtime. If so, mark the cfs_rq as
+ * throttled mark the current task for reschedling.
+ */
+static void sched_cfs_runtime_exceeded(struct sched_entity *se,
+ struct task_struct *tsk_curr, unsigned long delta_exec)
+{
+ struct cfs_rq *cfs_rq;
+
+ cfs_rq = group_cfs_rq(se);
+
+ if (cfs_rq->rq_bandwidth.runtime == RUNTIME_INF)
+ return;
+
+ cfs_rq->rq_bandwidth.time += delta_exec;
+
+ if (cfs_rq_throttled(cfs_rq))
+ return;
+
+ if (cfs_rq->rq_bandwidth.time > cfs_rq->rq_bandwidth.runtime) {
+ cfs_rq->rq_bandwidth.throttled = 1;
+ resched_task(tsk_curr);
+ }
+}
+
+static inline void update_curr_group(struct sched_entity *curr,
+ unsigned long delta_exec, struct task_struct *tsk_curr)
+{
+ sched_cfs_runtime_exceeded(curr, tsk_curr, delta_exec);
+}
+
+#else
+
+static inline void update_curr_group(struct sched_entity *curr,
+ unsigned long delta_exec, struct task_struct *tsk_curr)
+{
+ return;
+}
+
+static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
+{
+ return 0;
+}
+
+#endif /* CONFIG_CFS_HARD_LIMITS */
+
+#else /* CONFIG_FAIR_GROUP_SCHED */
+
+static inline void update_curr_group(struct sched_entity *curr,
+ unsigned long delta_exec, struct task_struct *tsk_curr)
+{
+ return;
+}

static inline struct task_struct *task_of(struct sched_entity *se)
{
@@ -282,6 +341,11 @@ static inline void start_cfs_bandwidth(struct cfs_rq *cfs_rq)
return;
}

+static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
+{
+ return 0;
+}
+
#endif /* CONFIG_FAIR_GROUP_SCHED */


@@ -533,14 +597,25 @@ __update_curr(struct cfs_rq *cfs_rq, struct sched_entity *curr,
update_min_vruntime(cfs_rq);
}

-static void update_curr(struct cfs_rq *cfs_rq)
+static void update_curr_task(struct sched_entity *curr,
+ unsigned long delta_exec)
+{
+ struct task_struct *curtask = task_of(curr);
+
+ trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
+ cpuacct_charge(curtask, delta_exec);
+ account_group_exec_runtime(curtask, delta_exec);
+}
+
+static int update_curr_common(struct cfs_rq *cfs_rq, unsigned long *delta)
{
struct sched_entity *curr = cfs_rq->curr;
- u64 now = rq_of(cfs_rq)->clock;
+ struct rq *rq = rq_of(cfs_rq);
+ u64 now = rq->clock;
unsigned long delta_exec;

if (unlikely(!curr))
- return;
+ return 1;

/*
* Get the amount of time the current task was running
@@ -549,17 +624,31 @@ static void update_curr(struct cfs_rq *cfs_rq)
*/
delta_exec = (unsigned long)(now - curr->exec_start);
if (!delta_exec)
- return;
+ return 1;

__update_curr(cfs_rq, curr, delta_exec);
curr->exec_start = now;
+ *delta = delta_exec;
+ return 0;
+}

- if (entity_is_task(curr)) {
- struct task_struct *curtask = task_of(curr);
-
- trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
- cpuacct_charge(curtask, delta_exec);
- account_group_exec_runtime(curtask, delta_exec);
+static void update_curr(struct cfs_rq *cfs_rq)
+{
+ struct sched_entity *curr = cfs_rq->curr;
+ struct rq *rq = rq_of(cfs_rq);
+ unsigned long delta_exec;
+ struct rq_bandwidth *rq_b;
+
+ if (update_curr_common(cfs_rq, &delta_exec))
+ return ;
+
+ if (entity_is_task(curr))
+ update_curr_task(curr, delta_exec);
+ else {
+ rq_b = &group_cfs_rq(curr)->rq_bandwidth;
+ raw_spin_lock(&rq_b->runtime_lock);
+ update_curr_group(curr, delta_exec, rq->curr);
+ raw_spin_unlock(&rq_b->runtime_lock);
}
}

@@ -787,6 +876,22 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
#define ENQUEUE_WAKEUP 1
#define ENQUEUE_MIGRATE 2

+static void enqueue_entity_common(struct cfs_rq *cfs_rq,
+ struct sched_entity *se, int flags)
+{
+ account_entity_enqueue(cfs_rq, se);
+
+ if (flags & ENQUEUE_WAKEUP) {
+ place_entity(cfs_rq, se, 0);
+ enqueue_sleeper(cfs_rq, se);
+ }
+
+ update_stats_enqueue(cfs_rq, se);
+ check_spread(cfs_rq, se);
+ if (se != cfs_rq->curr)
+ __enqueue_entity(cfs_rq, se);
+}
+
static void
enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
@@ -801,17 +906,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
* Update run-time statistics of the 'current'.
*/
update_curr(cfs_rq);
- account_entity_enqueue(cfs_rq, se);
-
- if (flags & ENQUEUE_WAKEUP) {
- place_entity(cfs_rq, se, 0);
- enqueue_sleeper(cfs_rq, se);
- }
-
- update_stats_enqueue(cfs_rq, se);
- check_spread(cfs_rq, se);
- if (se != cfs_rq->curr)
- __enqueue_entity(cfs_rq, se);
+ enqueue_entity_common(cfs_rq, se, flags);
}

static void __clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se)
@@ -959,6 +1054,28 @@ static struct sched_entity *pick_next_entity(struct cfs_rq *cfs_rq)
return se;
}

+/*
+ * Called from put_prev_entity()
+ * If a group entity (@se) is found to be throttled, it will not be put back
+ * on @cfs_rq, which is equivalent to dequeing it.
+ */
+static int dequeue_throttled_entity(struct cfs_rq *cfs_rq,
+ struct sched_entity *se)
+{
+ struct cfs_rq *gcfs_rq = group_cfs_rq(se);
+
+ if (entity_is_task(se))
+ return 0;
+
+ if (!cfs_rq_throttled(gcfs_rq) && gcfs_rq->nr_running)
+ return 0;
+
+ __clear_buddies(cfs_rq, se);
+ account_entity_dequeue(cfs_rq, se);
+ cfs_rq->curr = NULL;
+ return 1;
+}
+
static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
{
/*
@@ -970,6 +1087,8 @@ static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)

check_spread(cfs_rq, prev);
if (prev->on_rq) {
+ if (dequeue_throttled_entity(cfs_rq, prev))
+ return;
update_stats_wait_start(cfs_rq, prev);
/* Put 'current' back into the tree. */
__enqueue_entity(cfs_rq, prev);
@@ -1066,10 +1185,26 @@ static inline void hrtick_update(struct rq *rq)
}
#endif

+static int enqueue_group_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
+ int flags)
+{
+ struct cfs_rq *gcfs_rq = group_cfs_rq(se);
+ int ret = 0;
+
+ if (cfs_rq_throttled(gcfs_rq)) {
+ ret = 1;
+ goto out;
+ }
+ enqueue_entity(cfs_rq, se, flags);
+out:
+ return ret;
+}
+
/*
* The enqueue_task method is called before nr_running is
* increased. Here we update the fair scheduling stats and
* then put the task into the rbtree:
+ * Don't enqueue a throttled entity further into the hierarchy.
*/
static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int wakeup)
{
@@ -1085,11 +1220,15 @@ static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int wakeup)
for_each_sched_entity(se) {
if (se->on_rq)
break;
+
cfs_rq = cfs_rq_of(se);
- enqueue_entity(cfs_rq, se, flags);
+ if (entity_is_task(se))
+ enqueue_entity(cfs_rq, se, flags);
+ else
+ if (enqueue_group_entity(cfs_rq, se, flags))
+ break;
flags = ENQUEUE_WAKEUP;
}
-
hrtick_update(rq);
}

@@ -1109,6 +1248,13 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int sleep)
/* Don't dequeue parent if it has other entities besides us */
if (cfs_rq->load.weight)
break;
+
+ /*
+ * If this cfs_rq is throttled, then it is already
+ * dequeued.
+ */
+ if (cfs_rq_throttled(cfs_rq))
+ break;
sleep = 1;
}

@@ -1907,9 +2053,10 @@ load_balance_fair(struct rq *this_rq, int this_cpu, struct rq *busiest,
u64 rem_load, moved_load;

/*
- * empty group
+ * empty group or throttled group
*/
- if (!busiest_cfs_rq->task_weight)
+ if (!busiest_cfs_rq->task_weight ||
+ cfs_rq_throttled(busiest_cfs_rq))
continue;

rem_load = (u64)rem_load_move * busiest_weight;
@@ -1958,6 +2105,12 @@ move_one_task_fair(struct rq *this_rq, int this_cpu, struct rq *busiest,

for_each_leaf_cfs_rq(busiest, busy_cfs_rq) {
/*
+ * Don't move task from a throttled cfs_rq
+ */
+ if (cfs_rq_throttled(busy_cfs_rq))
+ continue;
+
+ /*
* pass busy_cfs_rq argument into
* load_balance_[start|next]_fair iterators
*/

2010-01-05 08:02:44

by Bharata B Rao

[permalink] [raw]
Subject: [RFC v5 PATCH 5/8] sched: Unthrottle the throttled tasks

sched: Unthrottle the throttled tasks.

From: Bharata B Rao <[email protected]>

Refresh runtimes when group's period expires. Unthrottle any
throttled groups at that time. Refreshing runtimes is driven through
a periodic timer.

Signed-off-by: Bharata B Rao <[email protected]>
---
kernel/sched.c | 27 ++++++++++++++-----
kernel/sched_fair.c | 71 ++++++++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 89 insertions(+), 9 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index c91158d..c4ab583 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -150,14 +150,7 @@ struct sched_bandwidth {
static struct sched_bandwidth def_rt_bandwidth;

static int do_sched_rt_period_timer(struct sched_bandwidth *sched_b, int overrun);
-
-/*
- * Nothing much to do now. Will be populated in subsequent hard limit patches.
- */
-static int do_sched_cfs_period_timer(struct sched_bandwidth *sched_b, int overrun)
-{
- return 0;
-}
+static int do_sched_cfs_period_timer(struct sched_bandwidth *sched_b, int overrun);

static enum hrtimer_restart sched_period_timer(struct hrtimer *timer, int rt)
{
@@ -1911,6 +1904,24 @@ struct rt_rq *sched_rt_period_rt_rq(struct sched_bandwidth *rt_b, int cpu)

#endif

+#ifdef CONFIG_FAIR_GROUP_SCHED
+static inline
+struct cfs_rq *sched_cfs_period_cfs_rq(struct sched_bandwidth *cfs_b, int cpu)
+{
+ return container_of(cfs_b, struct task_group,
+ cfs_bandwidth)->cfs_rq[cpu];
+}
+
+#else
+
+static inline
+struct cfs_rq *sched_cfs_period_cfs_rq(struct sched_bandwidth *cfs_b, int cpu)
+{
+ return &cpu_rq(cpu)->cfs;
+}
+
+#endif
+
#ifdef CONFIG_SMP

void __disable_runtime(struct rq *rq, struct sched_bandwidth *sched_b,
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index d1ee88e..f791332 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -255,6 +255,66 @@ static inline void update_curr_group(struct sched_entity *curr,
sched_cfs_runtime_exceeded(curr, tsk_curr, delta_exec);
}

+static void enqueue_entity(struct cfs_rq *cfs_rq,
+ struct sched_entity *se, int wakeup);
+
+static void enqueue_throttled_entity(struct rq *rq, struct sched_entity *se)
+{
+ for_each_sched_entity(se) {
+ struct cfs_rq *gcfs_rq = group_cfs_rq(se);
+
+ if (se->on_rq || cfs_rq_throttled(gcfs_rq) ||
+ !gcfs_rq->nr_running)
+ break;
+ enqueue_entity(cfs_rq_of(se), se, 0);
+ }
+}
+
+/*
+ * Refresh runtimes of all cfs_rqs in this group, i,e.,
+ * refresh runtimes of the representative cfs_rq of this
+ * tg on all cpus. Enqueue any throttled entity back.
+ */
+static int do_sched_cfs_period_timer(struct sched_bandwidth *cfs_b, int overrun)
+{
+ int i, idle = 1;
+ const struct cpumask *span;
+
+ if (cfs_b->runtime == RUNTIME_INF)
+ return 1;
+
+ span = sched_bw_period_mask();
+ for_each_cpu(i, span) {
+ int enqueue = 0;
+ struct rq *rq = cpu_rq(i);
+ struct cfs_rq *cfs_rq = sched_cfs_period_cfs_rq(cfs_b, i);
+ struct sched_entity *se = cfs_rq->tg->se[i];
+
+ raw_spin_lock(&rq->lock);
+ if (cfs_rq->rq_bandwidth.time) {
+ u64 runtime;
+
+ raw_spin_lock(&cfs_rq->rq_bandwidth.runtime_lock);
+ runtime = cfs_rq->rq_bandwidth.runtime;
+ cfs_rq->rq_bandwidth.time -= min(cfs_rq->rq_bandwidth.time, overrun*runtime);
+ if (cfs_rq_throttled(cfs_rq) &&
+ cfs_rq->rq_bandwidth.time < runtime) {
+ cfs_rq->rq_bandwidth.throttled = 0;
+ enqueue = 1;
+ }
+ if (cfs_rq->rq_bandwidth.time || cfs_rq->nr_running)
+ idle = 0;
+ raw_spin_unlock(&cfs_rq->rq_bandwidth.runtime_lock);
+ } else if (cfs_rq->nr_running)
+ idle = 0;
+
+ if (enqueue)
+ enqueue_throttled_entity(rq, se);
+ raw_spin_unlock(&rq->lock);
+ }
+ return idle;
+}
+
#else

static inline void update_curr_group(struct sched_entity *curr,
@@ -268,6 +328,11 @@ static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
return 0;
}

+static int do_sched_cfs_period_timer(struct sched_bandwidth *cfs_b, int overrun)
+{
+ return 0;
+}
+
#endif /* CONFIG_CFS_HARD_LIMITS */

#else /* CONFIG_FAIR_GROUP_SCHED */
@@ -346,8 +411,12 @@ static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
return 0;
}

-#endif /* CONFIG_FAIR_GROUP_SCHED */
+static int do_sched_cfs_period_timer(struct sched_bandwidth *cfs_b, int overrun)
+{
+ return 0;
+}

+#endif /* CONFIG_FAIR_GROUP_SCHED */

/**************************************************************
* Scheduling class tree data structure manipulation methods:

2010-01-05 08:03:28

by Bharata B Rao

[permalink] [raw]
Subject: [RFC v5 PATCH 6/8] sched: Add throttle time statistics to /proc/sched_debug

sched: Add throttle time statistics to /proc/sched_debug

From: Bharata B Rao <[email protected]>

With hard limits, provide stats about throttle time, throttle count
and max throttle time for group sched entities in /proc/sched_debug
Throttle stats are collected only for group entities.

Signed-off-by: Bharata B Rao <[email protected]>
---
include/linux/sched.h | 6 ++++++
kernel/sched_debug.c | 17 ++++++++++++++++-
kernel/sched_fair.c | 20 ++++++++++++++++++++
3 files changed, 42 insertions(+), 1 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f2f842d..025a594 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1191,6 +1191,12 @@ struct sched_entity {
u64 nr_wakeups_affine_attempts;
u64 nr_wakeups_passive;
u64 nr_wakeups_idle;
+#ifdef CONFIG_CFS_HARD_LIMITS
+ u64 throttle_start;
+ u64 throttle_max;
+ u64 throttle_count;
+ u64 throttle_sum;
+#endif
#endif

#ifdef CONFIG_FAIR_GROUP_SCHED
diff --git a/kernel/sched_debug.c b/kernel/sched_debug.c
index 1b67698..c833aa2 100644
--- a/kernel/sched_debug.c
+++ b/kernel/sched_debug.c
@@ -80,6 +80,11 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu,
PN(se->wait_max);
PN(se->wait_sum);
P(se->wait_count);
+#ifdef CONFIG_CFS_HARD_LIMITS
+ PN(se->throttle_max);
+ PN(se->throttle_sum);
+ P(se->throttle_count);
+#endif
#endif
P(se->load.weight);
#undef PN
@@ -214,6 +219,16 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
#ifdef CONFIG_SMP
SEQ_printf(m, " .%-30s: %lu\n", "shares", cfs_rq->shares);
#endif
+#ifdef CONFIG_CFS_HARD_LIMITS
+ raw_spin_lock_irqsave(&rq->lock, flags);
+ SEQ_printf(m, " .%-30s: %d\n", "rq_bandwidth.throttled",
+ cfs_rq->rq_bandwidth.throttled);
+ SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "rq_bandwidth.time",
+ SPLIT_NS(cfs_rq->rq_bandwidth.time));
+ SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "rq_bandwidth.runtime",
+ SPLIT_NS(cfs_rq->rq_bandwidth.runtime));
+ raw_spin_unlock_irqrestore(&rq->lock, flags);
+#endif /* CONFIG_CFS_HARD_LIMITS */
print_cfs_group_stats(m, cpu, cfs_rq->tg);
#endif
}
@@ -320,7 +335,7 @@ static int sched_debug_show(struct seq_file *m, void *v)
u64 now = ktime_to_ns(ktime_get());
int cpu;

- SEQ_printf(m, "Sched Debug Version: v0.09, %s %.*s\n",
+ SEQ_printf(m, "Sched Debug Version: v0.10, %s %.*s\n",
init_utsname()->release,
(int)strcspn(init_utsname()->version, " "),
init_utsname()->version);
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index f791332..16ed209 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -219,6 +219,23 @@ static inline void start_cfs_bandwidth(struct cfs_rq *cfs_rq)

#ifdef CONFIG_CFS_HARD_LIMITS

+static inline void update_stats_throttle_start(struct cfs_rq *cfs_rq,
+ struct sched_entity *se)
+{
+ schedstat_set(se->throttle_start, rq_of(cfs_rq)->clock);
+}
+
+static inline void update_stats_throttle_end(struct cfs_rq *cfs_rq,
+ struct sched_entity *se)
+{
+ schedstat_set(se->throttle_max, max(se->throttle_max,
+ rq_of(cfs_rq)->clock - se->throttle_start));
+ schedstat_set(se->throttle_count, se->throttle_count + 1);
+ schedstat_set(se->throttle_sum, se->throttle_sum +
+ rq_of(cfs_rq)->clock - se->throttle_start);
+ schedstat_set(se->throttle_start, 0);
+}
+
static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
{
return cfs_rq->rq_bandwidth.throttled;
@@ -245,6 +262,7 @@ static void sched_cfs_runtime_exceeded(struct sched_entity *se,

if (cfs_rq->rq_bandwidth.time > cfs_rq->rq_bandwidth.runtime) {
cfs_rq->rq_bandwidth.throttled = 1;
+ update_stats_throttle_start(cfs_rq, se);
resched_task(tsk_curr);
}
}
@@ -300,6 +318,8 @@ static int do_sched_cfs_period_timer(struct sched_bandwidth *cfs_b, int overrun)
if (cfs_rq_throttled(cfs_rq) &&
cfs_rq->rq_bandwidth.time < runtime) {
cfs_rq->rq_bandwidth.throttled = 0;
+ update_rq_clock(rq);
+ update_stats_throttle_end(cfs_rq, se);
enqueue = 1;
}
if (cfs_rq->rq_bandwidth.time || cfs_rq->nr_running)

2010-01-05 08:04:14

by Bharata B Rao

[permalink] [raw]
Subject: [RFC v5 PATCH 7/8] sched: CFS runtime borrowing

sched: CFS runtime borrowing

From: Bharata B Rao <[email protected]>

Before throttling a group, try to borrow runtime from groups that have excess.

To start with, a group will get equal runtime on every cpu. If the group doesn't
have tasks on all cpus, it might get throttled on some cpus while it still has
runtime left on other cpus where it doesn't have any tasks to consume that
runtime. Hence there is a chance to borrow runtimes from such cpus/cfs_rqs to
cpus/cfs_rqs where it is required.

CHECK: RT seems to be handling runtime initialization/reclaim during hotplug
from multiple places (migration_call, update_runtime). Need to check if CFS
also needs to do the same.

Signed-off-by: Kamalesh Babulal <[email protected]>
Signed-off-by: Bharata B Rao <[email protected]>
---
kernel/sched.c | 13 +++++++++++++
kernel/sched_fair.c | 42 ++++++++++++++++++++++++++++++++++++++++++
2 files changed, 55 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index c4ab583..c9922ec 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1960,6 +1960,8 @@ void __disable_runtime(struct rq *rq, struct sched_bandwidth *sched_b,

if (rt)
iter = &(sched_rt_period_rt_rq(sched_b, i)->rq_bandwidth);
+ else
+ iter = &(sched_cfs_period_cfs_rq(sched_b, i)->rq_bandwidth);
/*
* Can't reclaim from ourselves or disabled runqueues.
*/
@@ -1999,12 +2001,16 @@ balanced:
}

void disable_runtime_rt(struct rq *rq);
+void disable_runtime_cfs(struct rq *rq);
static void disable_runtime(struct rq *rq)
{
unsigned long flags;

raw_spin_lock_irqsave(&rq->lock, flags);
disable_runtime_rt(rq);
+#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_CFS_HARD_LIMITS)
+ disable_runtime_cfs(rq);
+#endif
raw_spin_unlock_irqrestore(&rq->lock, flags);
}

@@ -2021,12 +2027,16 @@ void __enable_runtime(struct sched_bandwidth *sched_b,
}

void enable_runtime_rt(struct rq *rq);
+void enable_runtime_cfs(struct rq *rq);
static void enable_runtime(struct rq *rq)
{
unsigned long flags;

raw_spin_lock_irqsave(&rq->lock, flags);
enable_runtime_rt(rq);
+#if defined(config_fair_group_sched) && defined(config_cfs_hard_limits)
+ enable_runtime_cfs(rq);
+#endif
raw_spin_unlock_irqrestore(&rq->lock, flags);
}

@@ -2050,6 +2060,9 @@ static void do_balance_runtime(struct rq_bandwidth *rq_b,

if (rt)
iter = &(sched_rt_period_rt_rq(sched_b, i)->rq_bandwidth);
+ else
+ iter = &(sched_cfs_period_cfs_rq(sched_b, i)->rq_bandwidth);
+
if (iter == rq_b)
continue;

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 16ed209..dcd093b 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -241,6 +241,41 @@ static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
return cfs_rq->rq_bandwidth.throttled;
}

+#ifdef CONFIG_SMP
+/*
+ * Ensure this RQ takes back all the runtime it lend to its neighbours.
+ */
+void disable_runtime_cfs(struct rq *rq)
+{
+ struct cfs_rq *cfs_rq;
+
+ if (unlikely(!scheduler_running))
+ return;
+
+ for_each_leaf_cfs_rq(rq, cfs_rq) {
+ struct sched_bandwidth *sched_b = sched_cfs_bandwidth(cfs_rq);
+ __disable_runtime(rq, sched_b, &cfs_rq->rq_bandwidth, 0);
+ }
+}
+
+void enable_runtime_cfs(struct rq *rq)
+{
+ struct cfs_rq *cfs_rq;
+
+ if (unlikely(!scheduler_running))
+ return;
+
+ /*
+ * Reset each runqueue's bandwidth settings
+ */
+ for_each_leaf_cfs_rq(rq, cfs_rq) {
+ struct sched_bandwidth *sched_b = sched_cfs_bandwidth(cfs_rq);
+ __enable_runtime(sched_b, &cfs_rq->rq_bandwidth);
+ }
+}
+
+#endif /* CONFIG_SMP */
+
/*
* Check if group entity exceeded its runtime. If so, mark the cfs_rq as
* throttled mark the current task for reschedling.
@@ -260,6 +295,10 @@ static void sched_cfs_runtime_exceeded(struct sched_entity *se,
if (cfs_rq_throttled(cfs_rq))
return;

+ if (cfs_rq->rq_bandwidth.time > cfs_rq->rq_bandwidth.runtime)
+ balance_runtime(&cfs_rq->rq_bandwidth,
+ sched_cfs_bandwidth(cfs_rq), 0);
+
if (cfs_rq->rq_bandwidth.time > cfs_rq->rq_bandwidth.runtime) {
cfs_rq->rq_bandwidth.throttled = 1;
update_stats_throttle_start(cfs_rq, se);
@@ -313,6 +352,9 @@ static int do_sched_cfs_period_timer(struct sched_bandwidth *cfs_b, int overrun)
u64 runtime;

raw_spin_lock(&cfs_rq->rq_bandwidth.runtime_lock);
+ if (cfs_rq_throttled(cfs_rq))
+ balance_runtime(&cfs_rq->rq_bandwidth,
+ sched_cfs_bandwidth(cfs_rq), 0);
runtime = cfs_rq->rq_bandwidth.runtime;
cfs_rq->rq_bandwidth.time -= min(cfs_rq->rq_bandwidth.time, overrun*runtime);
if (cfs_rq_throttled(cfs_rq) &&

2010-01-05 08:05:07

by Bharata B Rao

[permalink] [raw]
Subject: [RFC v5 PATCH 8/8] sched: Hard limits documentation

sched: Hard limits documentation

From: Bharata B Rao <[email protected]>

Documentation for hard limits feature.

Signed-off-by: Bharata B Rao <[email protected]>
---
Documentation/scheduler/sched-cfs-hard-limits.txt | 48 +++++++++++++++++++++
1 files changed, 48 insertions(+), 0 deletions(-)
create mode 100644 Documentation/scheduler/sched-cfs-hard-limits.txt

diff --git a/Documentation/scheduler/sched-cfs-hard-limits.txt b/Documentation/scheduler/sched-cfs-hard-limits.txt
new file mode 100644
index 0000000..d6387af
--- /dev/null
+++ b/Documentation/scheduler/sched-cfs-hard-limits.txt
@@ -0,0 +1,48 @@
+CPU HARD LIMITS FOR CFS GROUPS
+==============================
+
+1. Overview
+2. Interface
+3. Examples
+
+1. Overview
+-----------
+
+CFS is a proportional share scheduler which tries to divide the CPU time
+proportionately between tasks or groups of tasks (task group/cgroup) depending
+on the priority/weight of the task or shares assigned to groups of tasks.
+In CFS, a task/task group can get more than its share of CPU if there are
+enough idle CPU cycles available in the system, due to the work conserving
+nature of the scheduler. However in certain scenarios (like pay-per-use),
+it is desirable not to provide extra time to a group even in the presence
+of idle CPU cycles. This is where hard limiting can be of use.
+
+Hard limits for task groups can be set by specifying how much CPU runtime a
+group can consume within a given period. If the group consumes more CPU time
+than the runtime in a given period, it gets throttled. None of the tasks of
+the throttled group gets to run until the runtime of the group gets refreshed
+at the beginning of the next period.
+
+2. Interface
+------------
+
+Hard limit feature adds 2 cgroup files for CFS group scheduler:
+
+cfs_runtime_us: Hard limit for the group in microseconds.
+
+cfs_period_us: Time period in microseconds within which hard limits is
+enforced.
+
+A group gets created with default values for runtime (infinite runtime which
+means hard limits disabled) and period (0.5s). Each group can set its own
+values for runtime and period independent of other groups in the system.
+
+3. Examples
+-----------
+
+# mount -t cgroup -ocpu none /cgroups/
+# cd /cgroups
+# mkdir 1
+# cd 1/
+# echo 250000 > cfs_runtime_us /* set a 250ms runtime or limit */
+# echo 500000 > cfs_period_us /* set a 500ms period */

2010-01-05 08:06:36

by Bharata B Rao

[permalink] [raw]
Subject: Re: [RFC v5 PATCH 0/8] CFS Hard limits - v5

On Tue, Jan 05, 2010 at 01:27:03PM +0530, Bharata B Rao wrote:
> Hi,
>
> This is the v5 post of CFS hard limits. In this patchset, I have
> pulled out bandwidth and runtime handling code from RT into sched.c
> so that the same code can be used by CFS hard limits.
>
> Also I have addressed the review comments given by Peter Zijlstra for v4.

And forgot to mention that this set applies on 2.6.33-rc2.

Regards,
Bharata.

2010-01-06 05:02:49

by Bharata B Rao

[permalink] [raw]
Subject: Re: [RFC v5 PATCH 7/8] sched: CFS runtime borrowing

On Tue, Jan 05, 2010 at 01:33:46PM +0530, Bharata B Rao wrote:
> sched: CFS runtime borrowing
>
> static void enable_runtime(struct rq *rq)
> {
> unsigned long flags;
>
> raw_spin_lock_irqsave(&rq->lock, flags);
> enable_runtime_rt(rq);
> +#if defined(config_fair_group_sched) && defined(config_cfs_hard_limits)

Got the above config options wrong. Resending this patch again with correction.

Regards,
Bharata.

sched: CFS runtime borrowing

From: Bharata B Rao <[email protected]>

Before throttling a group, try to borrow runtime from groups that have excess.

To start with, a group will get equal runtime on every cpu. If the group doesn't
have tasks on all cpus, it might get throttled on some cpus while it still has
runtime left on other cpus where it doesn't have any tasks to consume that
runtime. Hence there is a chance to borrow runtimes from such cpus/cfs_rqs to
cpus/cfs_rqs where it is required.

CHECK: RT seems to be handling runtime initialization/reclaim during hotplug
from multiple places (migration_call, update_runtime). Need to check if CFS
also needs to do the same.

Signed-off-by: Kamalesh Babulal <[email protected]>
Signed-off-by: Bharata B Rao <[email protected]>
---
kernel/sched.c | 13 +++++++++++++
kernel/sched_fair.c | 42 ++++++++++++++++++++++++++++++++++++++++++
2 files changed, 55 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index c4ab583..857e567 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1960,6 +1960,8 @@ void __disable_runtime(struct rq *rq, struct sched_bandwidth *sched_b,

if (rt)
iter = &(sched_rt_period_rt_rq(sched_b, i)->rq_bandwidth);
+ else
+ iter = &(sched_cfs_period_cfs_rq(sched_b, i)->rq_bandwidth);
/*
* Can't reclaim from ourselves or disabled runqueues.
*/
@@ -1999,12 +2001,16 @@ balanced:
}

void disable_runtime_rt(struct rq *rq);
+void disable_runtime_cfs(struct rq *rq);
static void disable_runtime(struct rq *rq)
{
unsigned long flags;

raw_spin_lock_irqsave(&rq->lock, flags);
disable_runtime_rt(rq);
+#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_CFS_HARD_LIMITS)
+ disable_runtime_cfs(rq);
+#endif
raw_spin_unlock_irqrestore(&rq->lock, flags);
}

@@ -2021,12 +2027,16 @@ void __enable_runtime(struct sched_bandwidth *sched_b,
}

void enable_runtime_rt(struct rq *rq);
+void enable_runtime_cfs(struct rq *rq);
static void enable_runtime(struct rq *rq)
{
unsigned long flags;

raw_spin_lock_irqsave(&rq->lock, flags);
enable_runtime_rt(rq);
+#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_CFS_HARD_LIMITS)
+ enable_runtime_cfs(rq);
+#endif
raw_spin_unlock_irqrestore(&rq->lock, flags);
}

@@ -2050,6 +2060,9 @@ static void do_balance_runtime(struct rq_bandwidth *rq_b,

if (rt)
iter = &(sched_rt_period_rt_rq(sched_b, i)->rq_bandwidth);
+ else
+ iter = &(sched_cfs_period_cfs_rq(sched_b, i)->rq_bandwidth);
+
if (iter == rq_b)
continue;

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 16ed209..dcd093b 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -241,6 +241,41 @@ static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
return cfs_rq->rq_bandwidth.throttled;
}

+#ifdef CONFIG_SMP
+/*
+ * Ensure this RQ takes back all the runtime it lend to its neighbours.
+ */
+void disable_runtime_cfs(struct rq *rq)
+{
+ struct cfs_rq *cfs_rq;
+
+ if (unlikely(!scheduler_running))
+ return;
+
+ for_each_leaf_cfs_rq(rq, cfs_rq) {
+ struct sched_bandwidth *sched_b = sched_cfs_bandwidth(cfs_rq);
+ __disable_runtime(rq, sched_b, &cfs_rq->rq_bandwidth, 0);
+ }
+}
+
+void enable_runtime_cfs(struct rq *rq)
+{
+ struct cfs_rq *cfs_rq;
+
+ if (unlikely(!scheduler_running))
+ return;
+
+ /*
+ * Reset each runqueue's bandwidth settings
+ */
+ for_each_leaf_cfs_rq(rq, cfs_rq) {
+ struct sched_bandwidth *sched_b = sched_cfs_bandwidth(cfs_rq);
+ __enable_runtime(sched_b, &cfs_rq->rq_bandwidth);
+ }
+}
+
+#endif /* CONFIG_SMP */
+
/*
* Check if group entity exceeded its runtime. If so, mark the cfs_rq as
* throttled mark the current task for reschedling.
@@ -260,6 +295,10 @@ static void sched_cfs_runtime_exceeded(struct sched_entity *se,
if (cfs_rq_throttled(cfs_rq))
return;

+ if (cfs_rq->rq_bandwidth.time > cfs_rq->rq_bandwidth.runtime)
+ balance_runtime(&cfs_rq->rq_bandwidth,
+ sched_cfs_bandwidth(cfs_rq), 0);
+
if (cfs_rq->rq_bandwidth.time > cfs_rq->rq_bandwidth.runtime) {
cfs_rq->rq_bandwidth.throttled = 1;
update_stats_throttle_start(cfs_rq, se);
@@ -313,6 +352,9 @@ static int do_sched_cfs_period_timer(struct sched_bandwidth *cfs_b, int overrun)
u64 runtime;

raw_spin_lock(&cfs_rq->rq_bandwidth.runtime_lock);
+ if (cfs_rq_throttled(cfs_rq))
+ balance_runtime(&cfs_rq->rq_bandwidth,
+ sched_cfs_bandwidth(cfs_rq), 0);
runtime = cfs_rq->rq_bandwidth.runtime;
cfs_rq->rq_bandwidth.time -= min(cfs_rq->rq_bandwidth.time, overrun*runtime);
if (cfs_rq_throttled(cfs_rq) &&

2010-01-08 20:50:34

by Paul Turner

[permalink] [raw]
Subject: Re: [RFC v5 PATCH 0/8] CFS Hard limits - v5

Hi Bharata,

Thanks for the updated patchset. As discussed the other day briefly I have some
concerns with the usage of the current RT bandwidth rate-limiting code as there
are some assumptions made that I feel don't fit the general case well.

The reason for this is that the borrowing logic in bandwidth rebalance appears
to make the assumption that we wil will be able to converge rapidly to the
period. Indeed, in each iteration of redistribution we take only
1/weight(nrcpus) [assuming no cpuset partitions] of the time remaining. This is
a decreasing series, and if we can't exceed the period our convergence looks
pretty slow [geometric series].

In general it appears the following relation be satisfied for efficient
execution: (weight(nr_cpus) * runtime) >> period

This assumption is well satisfied in the RT case since the available bandwidth
is very high. However I fear for the general case of user limits on tg usage
lie at the other end of the spectrum. Especially for those trying to partition
large machines into many smaller well provisioned fractions, e.g. 0-2 cores out
of a total 64. The lock and re-distribution cost for each iteration is also
going to be quite high in this case which will potentially compound on the
number of iterations required above.

What are your thoughts on using a separate mechanism for the general case. A
draft proposal follows:

- Maintain a global run-time pool for each tg. The runtime specified by the
user represents the value that this pool will be refilled to each period.
- We continue to maintain the local notion of runtime/period in each cfs_rq,
continue to accumulate locally here.

Upon locally exceeding the period acquire new credit from the global pool
(either under lock or more likely using atomic ops). This can either be in
fixed steppings (e.g. 10ms, could be tunable) or following some quasi-curve
variant with historical demand.

One caveat here is that there is some over-commit in the system, the local
differences of runtime vs period represent additional over the global pool.
However it should not be possible to consistently exceed limits since the rate
of refill is gated by the runtime being input into the system via the per-tg
pool.

This would also naturally associate with an interface change that would mean the
runtime limit for a group would be the effective cpurate within the period.

e.g. by setting a runtime of 200000us on a 100000us period it would effectively
allow you to use 2 cpus worth of wall-time on a multicore system.

I feel this is slightly more natural than the current definition which due to
being local means that values set will not result in consistent behavior across
machines of different core counts. It also has the benefit of being consistent
with observed exports of time consumed, e.g. rusage, (indirectly) time, etc.

For future scalability as machine size grows this could potentially be
partitioned below the tg level along the boundaries of sched_domains (or
something similar). However for an initial draft given current machine sizes
the contention on the global pool should hopefully be fairly low.

- Paul

2010-01-29 03:49:26

by Bharata B Rao

[permalink] [raw]
Subject: Re: [RFC v5 PATCH 0/8] CFS Hard limits - v5

On Sat, Jan 9, 2010 at 2:15 AM, Paul Turner <[email protected]> wrote:
>
> Hi Bharata,

Hi Paul,

Sorry for the late reply. Since you removed the CC-list, I didn't get
this mail in my inbox and hence didn't notice this at all until this
time!

[Putting back the original CC list in this mail]

>
> Thanks for the updated patchset. ?As discussed the other day briefly I have some
> concerns with the usage of the current RT bandwidth rate-limiting code as there
> are some assumptions made that I feel don't fit the general case well.
>
> The reason for this is that the borrowing logic in bandwidth rebalance appears
> to make the assumption that we wil will be able to converge rapidly to the
> period. ?Indeed, in each iteration of redistribution we take only
> 1/weight(nrcpus) [assuming no cpuset partitions] of the time remaining. ?This is
> a decreasing series, and if we can't exceed the period our convergence looks
> pretty slow [geometric series].
>
> In general it appears the following relation be satisfied for efficient
> execution: ?(weight(nr_cpus) * runtime) >> period
>
> This assumption is well satisfied in the RT case since the available bandwidth
> is very high. ?However I fear for the general case of user limits on tg usage
> lie at the other end of the spectrum. ?Especially for those trying to partition
> large machines into many smaller well provisioned fractions, e.g. 0-2 cores out
> of a total 64. ?The lock and re-distribution cost for each iteration is also
> going to be quite high in this case which will potentially compound on the
> number of iterations required above.

I see your point. Having a runtime which is much lesser than the
period will result in a lot of iterations of borrowing from every CPU
before the source CPU accumulates the maximum possible runtime.

Apart from this, I also see that after accumulating the maximum
possible runtime from all CPUs, the task sometimes moves to another
CPU due to load balancing. When this happens, the new CPU starts the
borrowing iterations all over again!

As you observe, borrowing just 1/n th (n = number of CPUs) of the
spare runtime from each CPU is not ideal for CFS if runtimes are going
to be much lesser than period unlike RT. This would involve iterating
through all the CPUs and acquiring/releasing a spinlock in each
iteration.

>
> What are your thoughts on using a separate mechanism for the general case. ?A
> draft proposal follows:
>
> - Maintain a global run-time pool for each tg. ?The runtime specified by the
> ?user represents the value that this pool will be refilled to each period.
> - We continue to maintain the local notion of runtime/period in each cfs_rq,
> ?continue to accumulate locally here.
>
> Upon locally exceeding the period acquire new credit from the global pool
> (either under lock or more likely using atomic ops). ?This can either be in
> fixed steppings (e.g. 10ms, could be tunable) or following some quasi-curve
> variant with historical demand.
>
> One caveat here is that there is some over-commit in the system, the local
> differences of runtime vs period represent additional over the global pool.
> However it should not be possible to consistently exceed limits since the rate
> of refill is gated by the runtime being input into the system via the per-tg
> pool.
>

We borrow from what is actually available as spare (spare = unused or
remaining). With global pool, I see that would be difficult.
Inability/difficulty in keeping the global pool in sync with the
actual available spare time is the reason for over-commit ?

> This would also naturally associate with an interface change that would mean the
> runtime limit for a group would be the effective cpurate within the period.
>
> e.g. by setting a runtime of 200000us on a 100000us period it would effectively
> allow you to use 2 cpus worth of wall-time on a multicore system.
>
> I feel this is slightly more natural than the current definition which due to
> being local means that values set will not result in consistent behavior across
> machines of different core counts. ?It also has the benefit of being consistent
> with observed exports of time consumed, e.g. rusage, (indirectly) time, etc.

Though runtimes are enforced locally per-cpu, that's only the
implementation. The definition of runtime and period is still
system-wide/global. A runtime/period=0.25/0.5 will mean 0.25s of
system wide runtime within a period of 0.5s. Talking about consistent
definition, I would say this consistently defines half of system wide
wall-time on all configurations :) If it means 2 CPUs worth wall-time
in 4 core machine, it would mean 4 CPUs on a 8 CPU machine. At this
point, I am inclined to go with this and let the admins/tools work out
the actual CPUs part of it. However I would like to hear what others
think about this interface.

>
> For future scalability as machine size grows this could potentially be
> partitioned below the tg level along the boundaries of sched_domains (or
> something similar). ?However for an initial draft given current machine sizes
> the contention on the global pool should hopefully be fairly low.

One of the alternatives I have in mind is to be more aggressive while
borrowing. While keeping the current algorithm (of iterating thro' all
CPUs when borrowing) intact, we could potentially borrow more from
those CPUs which don't have any running task from the given group. I
just experimented with borrowing half of the available runtime from
such CPUs and found that number of iterations are greatly reduced and
the source runtime quickly converges to its max possible value. Do you
see any issues with this ?

Thanks for your note.

Regards,
Bharata.
--
http://bharata.sulekha.com/blog/posts.htm, http://raobharata.wordpress.com/

2010-01-29 04:26:18

by Paul Turner

[permalink] [raw]
Subject: Re: [RFC v5 PATCH 0/8] CFS Hard limits - v5

On Thu, Jan 28, 2010 at 7:49 PM, Bharata B Rao <[email protected]> wrote:
> On Sat, Jan 9, 2010 at 2:15 AM, Paul Turner <[email protected]> wrote:
>>
>> Hi Bharata,
>
> Hi Paul,
>
> Sorry for the late reply. Since you removed the CC-list, I didn't get
> this mail in my inbox and hence didn't notice this at all until this
> time!
>
> [Putting back the original CC list in this mail]

Sorry! I replied through GMane since I didn't have the headers from
the original mail, I didn't realize it breaks the cc list!

>
>>
>> Thanks for the updated patchset. ?As discussed the other day briefly I have some
>> concerns with the usage of the current RT bandwidth rate-limiting code as there
>> are some assumptions made that I feel don't fit the general case well.
>>
>> The reason for this is that the borrowing logic in bandwidth rebalance appears
>> to make the assumption that we wil will be able to converge rapidly to the
>> period. ?Indeed, in each iteration of redistribution we take only
>> 1/weight(nrcpus) [assuming no cpuset partitions] of the time remaining. ?This is
>> a decreasing series, and if we can't exceed the period our convergence looks
>> pretty slow [geometric series].
>>
>> In general it appears the following relation be satisfied for efficient
>> execution: ?(weight(nr_cpus) * runtime) >> period
>>
>> This assumption is well satisfied in the RT case since the available bandwidth
>> is very high. ?However I fear for the general case of user limits on tg usage
>> lie at the other end of the spectrum. ?Especially for those trying to partition
>> large machines into many smaller well provisioned fractions, e.g. 0-2 cores out
>> of a total 64. ?The lock and re-distribution cost for each iteration is also
>> going to be quite high in this case which will potentially compound on the
>> number of iterations required above.
>
> I see your point. Having a runtime which is much lesser than the
> period will result in a lot of iterations of borrowing from every CPU
> before the source CPU accumulates the maximum possible runtime.
>
> Apart from this, I also see that after accumulating the maximum
> possible runtime from all CPUs, the task sometimes moves to another
> CPU due to load balancing. When this happens, the new CPU starts the
> borrowing iterations all over again!

Yes, we later confirmed this with some benchmarks that showed the cost
to be quite high indeed! (e.g. try a while(1) thread with a provision
of 1-core on a larger (16+ core) machine).

>
> As you observe, borrowing just 1/n th (n = number of CPUs) of the
> spare runtime from each CPU is not ideal for CFS if runtimes are going
> to be much lesser than period unlike RT. This would involve iterating
> through all the CPUs and acquiring/releasing a spinlock in each
> iteration.
>
>>
>> What are your thoughts on using a separate mechanism for the general case. ?A
>> draft proposal follows:
>>
>> - Maintain a global run-time pool for each tg. ?The runtime specified by the
>> ?user represents the value that this pool will be refilled to each period.
>> - We continue to maintain the local notion of runtime/period in each cfs_rq,
>> ?continue to accumulate locally here.
>>
>> Upon locally exceeding the period acquire new credit from the global pool
>> (either under lock or more likely using atomic ops). ?This can either be in
>> fixed steppings (e.g. 10ms, could be tunable) or following some quasi-curve
>> variant with historical demand.
>>
>> One caveat here is that there is some over-commit in the system, the local
>> differences of runtime vs period represent additional over the global pool.
>> However it should not be possible to consistently exceed limits since the rate
>> of refill is gated by the runtime being input into the system via the per-tg
>> pool.
>>
>
> We borrow from what is actually available as spare (spare = unused or
> remaining). With global pool, I see that would be difficult.
> Inability/difficulty in keeping the global pool in sync with the
> actual available spare time is the reason for over-commit ?
>

We maintain two pools, a global pool (new) and a per-cfs_rq pool
(similar to existing rt_bw).

When consuming time you charge vs your local bandwidth until it is
expired, at this point you must either refill from the global pool, or
throttle.

The "slack" in the system is the sum of unconsumed time in local pools
from the *previous* global pool refill. This is bounded above by the
size of time you refill a local pool at each expiry. We call the size
of refill a 'slice'.

e.g.

Task limit of 50ms, slice=10ms, 4cpus, period of 500ms

Task A runs on cpus 0 and 1 for 5ms each, then blocks.

When A first executes on each cpu we take slice=10ms from the global
pool of 50ms and apply it to the local rq. Execution then proceeds vs
local pool.

Current state is: 5 ms in local pools on {0,1}, 30ms remaining in global pool

Upon period expiration we issue a global pool refill. At this point we have:
5 ms in local pools on {0,1}, 50ms remaining in global pool.

That 10ms of slack time is over-commit in the system. However it
should be clear that this can only be a local effect since over any
period of time the rate of input into the system is limited by global
pool refill rate.

There are also some strategies that we are exploring to improve
behavior further here. One idea is that if we maintain a generation
counter then on voluntary dequeue (e.g. tasks block) we can return
local time to the global period pool or expire it (if generations
don't match), this greatly reduces the noise (via slack vs ideal
limit) that a busty application can induce.

>> This would also naturally associate with an interface change that would mean the
>> runtime limit for a group would be the effective cpurate within the period.
>>
>> e.g. by setting a runtime of 200000us on a 100000us period it would effectively
>> allow you to use 2 cpus worth of wall-time on a multicore system.
>>
>> I feel this is slightly more natural than the current definition which due to
>> being local means that values set will not result in consistent behavior across
>> machines of different core counts. ?It also has the benefit of being consistent
>> with observed exports of time consumed, e.g. rusage, (indirectly) time, etc.
>
> Though runtimes are enforced locally per-cpu, that's only the
> implementation. The definition of runtime and period is still
> system-wide/global. A runtime/period=0.25/0.5 will mean 0.25s of
> system wide runtime within a period of 0.5s. Talking about consistent
> definition, I would say this consistently defines half of system wide
> wall-time on all configurations :)

This feels non-intuitive suppose you have a non-homogeneous fleet of
systems. It is also difficult to express limits in terms of cores,
suppose I'm an admin trying to jail my users (maybe I rent out virtual
time ala EC2 for example). The fractions I have to use to represent
integer core amounts are going to become quite small on large systems.
For example, 1 core on a 64 core system would mean 3.905ms/250ms
period. What's the dependency here between your time and the current
cpuset topology also, if I'm only allowed on half the system does this
fraction then refer to global resources or what I'm constrained to?
This seems a difficult data dependency to manage.

My (personal) ideology is that we are metering at the cpu level as
opposed to the system level -- which means N seconds of cpu-time makes
more sense to me. I feel it has advantages in that it can be
specified more directly relative to the period and is independent of
system partitioning.

I'd be interested to hear other opinions on this.

> If it means 2 CPUs worth wall-time
> in 4 core machine, it would mean 4 CPUs on a 8 CPU machine. ?At this
> point, I am inclined to go with this and let the admins/tools work out
> the actual CPUs part of it. However I would like to hear what others
> think about this interface.
>
>>
>> For future scalability as machine size grows this could potentially be
>> partitioned below the tg level along the boundaries of sched_domains (or
>> something similar). ?However for an initial draft given current machine sizes
>> the contention on the global pool should hopefully be fairly low.
>
> One of the alternatives I have in mind is to be more aggressive while
> borrowing. While keeping the current algorithm (of iterating thro' all
> CPUs when borrowing) intact, we could potentially borrow more from
> those CPUs which don't have any running task from the given group. I
> just experimented with borrowing half of the available runtime from
> such CPUs and found that number of iterations are greatly reduced and
> the source runtime quickly converges to its max possible value. Do you
> see any issues with this ?
>

I strongly believe that this is going to induce significant lock
contention and is not a scalable solution over time. While using a
faster converging series for time may help I think there are
confounding factors that limit effect here. Consider the 1 core on a
64 core system example above. With only 3.905ms per pool we are going
to quickly hit the case where we are borrowing not-useful periods of
time while thrashing locks.

We are in the midst of an implementation for proposal above which
we'll have ready post to here for consideration next week. We have
maintained your existing approach with respect to handling throttled
entities and layered on top of that the proposed alternate
local/global bandwidth scheme. Initial tests show very promising
results!

Thank-you!

- Paul

> Thanks for your note.
>
> Regards,
> Bharata.
> --
> http://bharata.sulekha.com/blog/posts.htm, http://raobharata.wordpress.com/
>

2010-01-29 08:59:58

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC v5 PATCH 1/8] sched: Rename struct rt_bandwidth to sched_bandwidth

* Bharata B Rao <[email protected]> [2010-01-05 13:28:24]:

> sched: Rename struct rt_bandwidth to sched_bandwidth
>
> From: Dhaval Giani <[email protected]>
>
> Rename struct rt_bandwidth to sched_bandwidth and rename some of the
> routines to generic names (s/rt_/sched_) so that they can be used
> by CFS hard limits code in the subsequent patches.
>
> No functionality change by this patch.
>
> Signed-off-by: Dhaval Giani <[email protected]>
> Signed-off-by: Bharata B Rao <[email protected]>

Looks good, some nit picks below

Acked-by: Balbir Singh <[email protected]>


> ---
> kernel/sched.c | 127 ++++++++++++++++++++++++++---------------------------
> kernel/sched_rt.c | 46 ++++++++++---------
> 2 files changed, 86 insertions(+), 87 deletions(-)
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index c535cc4..21cf0d5 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -139,50 +139,50 @@ struct rt_prio_array {
> struct list_head queue[MAX_RT_PRIO];
> };
>
> -struct rt_bandwidth {
> +struct sched_bandwidth {
> /* nests inside the rq lock: */
> - raw_spinlock_t rt_runtime_lock;
> - ktime_t rt_period;
> - u64 rt_runtime;
> - struct hrtimer rt_period_timer;
> + raw_spinlock_t runtime_lock;
> + ktime_t period;
> + u64 runtime;
> + struct hrtimer period_timer;
> };
>
> -static struct rt_bandwidth def_rt_bandwidth;
> +static struct sched_bandwidth def_rt_bandwidth;
>
> -static int do_sched_rt_period_timer(struct rt_bandwidth *rt_b, int overrun);
> +static int do_sched_rt_period_timer(struct sched_bandwidth *sched_b, int overrun);
>
> static enum hrtimer_restart sched_rt_period_timer(struct hrtimer *timer)
> {
> - struct rt_bandwidth *rt_b =
> - container_of(timer, struct rt_bandwidth, rt_period_timer);
> + struct sched_bandwidth *sched_b =
> + container_of(timer, struct sched_bandwidth, period_timer);
> ktime_t now;
> int overrun;
> int idle = 0;
>
> for (;;) {
> now = hrtimer_cb_get_time(timer);
> - overrun = hrtimer_forward(timer, now, rt_b->rt_period);
> + overrun = hrtimer_forward(timer, now, sched_b->period);
>
> if (!overrun)
> break;
>
> - idle = do_sched_rt_period_timer(rt_b, overrun);
> + idle = do_sched_rt_period_timer(sched_b, overrun);
> }
>
> return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
> }
>
> -static
> -void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime)
> +static void init_sched_bandwidth(struct sched_bandwidth *sched_b, u64 period,
> + u64 runtime, enum hrtimer_restart (*period_timer)(struct hrtimer *))
> {
> - rt_b->rt_period = ns_to_ktime(period);
> - rt_b->rt_runtime = runtime;
> + sched_b->period = ns_to_ktime(period);
> + sched_b->runtime = runtime;
>
> - raw_spin_lock_init(&rt_b->rt_runtime_lock);
> + raw_spin_lock_init(&sched_b->runtime_lock);
>
> - hrtimer_init(&rt_b->rt_period_timer,
> + hrtimer_init(&sched_b->period_timer,
> CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> - rt_b->rt_period_timer.function = sched_rt_period_timer;
> + sched_b->period_timer.function = *period_timer;

Hmm.. may be I forgetting the "C" language, but why do you dereference
the pointer before assignment? You should be able to directly assign a
function address to the function pointer. Did you see a warning?

> }
>
> static inline int rt_bandwidth_enabled(void)
> @@ -190,42 +190,40 @@ static inline int rt_bandwidth_enabled(void)
> return sysctl_sched_rt_runtime >= 0;
> }
>
> -static void start_rt_bandwidth(struct rt_bandwidth *rt_b)
> +static void start_sched_bandwidth(struct sched_bandwidth *sched_b)
> {
> ktime_t now;
>
> - if (!rt_bandwidth_enabled() || rt_b->rt_runtime == RUNTIME_INF)
> + if (!rt_bandwidth_enabled() || sched_b->runtime == RUNTIME_INF)
> return;
>
> - if (hrtimer_active(&rt_b->rt_period_timer))
> + if (hrtimer_active(&sched_b->period_timer))
> return;
>
> - raw_spin_lock(&rt_b->rt_runtime_lock);
> + raw_spin_lock(&sched_b->runtime_lock);

I don't quite understand why this is a raw_spin_lock

[snip]

--
Balbir

2010-01-29 14:08:27

by Bharata B Rao

[permalink] [raw]
Subject: Re: [RFC v5 PATCH 1/8] sched: Rename struct rt_bandwidth to sched_bandwidth

On Fri, Jan 29, 2010 at 02:29:49PM +0530, Balbir Singh wrote:
> * Bharata B Rao <[email protected]> [2010-01-05 13:28:24]:
>
> > sched: Rename struct rt_bandwidth to sched_bandwidth
> >
> > From: Dhaval Giani <[email protected]>
> >
> > Rename struct rt_bandwidth to sched_bandwidth and rename some of the
> > routines to generic names (s/rt_/sched_) so that they can be used
> > by CFS hard limits code in the subsequent patches.
> >
> > No functionality change by this patch.
> >
> > Signed-off-by: Dhaval Giani <[email protected]>
> > Signed-off-by: Bharata B Rao <[email protected]>
>
> Looks good, some nit picks below
>
> Acked-by: Balbir Singh <[email protected]>

Thanks Balbir.

>
>
> > ---
> > kernel/sched.c | 127 ++++++++++++++++++++++++++---------------------------
> > kernel/sched_rt.c | 46 ++++++++++---------
> > 2 files changed, 86 insertions(+), 87 deletions(-)
> >
> > diff --git a/kernel/sched.c b/kernel/sched.c
> > index c535cc4..21cf0d5 100644
> > --- a/kernel/sched.c
> > +++ b/kernel/sched.c
> > @@ -139,50 +139,50 @@ struct rt_prio_array {
> > struct list_head queue[MAX_RT_PRIO];
> > };
> >
> > -struct rt_bandwidth {
> > +struct sched_bandwidth {
> > /* nests inside the rq lock: */
> > - raw_spinlock_t rt_runtime_lock;
> > - ktime_t rt_period;
> > - u64 rt_runtime;
> > - struct hrtimer rt_period_timer;
> > + raw_spinlock_t runtime_lock;
> > + ktime_t period;
> > + u64 runtime;
> > + struct hrtimer period_timer;
> > };
> >
> > -static struct rt_bandwidth def_rt_bandwidth;
> > +static struct sched_bandwidth def_rt_bandwidth;
> >
> > -static int do_sched_rt_period_timer(struct rt_bandwidth *rt_b, int overrun);
> > +static int do_sched_rt_period_timer(struct sched_bandwidth *sched_b, int overrun);
> >
> > static enum hrtimer_restart sched_rt_period_timer(struct hrtimer *timer)
> > {
> > - struct rt_bandwidth *rt_b =
> > - container_of(timer, struct rt_bandwidth, rt_period_timer);
> > + struct sched_bandwidth *sched_b =
> > + container_of(timer, struct sched_bandwidth, period_timer);
> > ktime_t now;
> > int overrun;
> > int idle = 0;
> >
> > for (;;) {
> > now = hrtimer_cb_get_time(timer);
> > - overrun = hrtimer_forward(timer, now, rt_b->rt_period);
> > + overrun = hrtimer_forward(timer, now, sched_b->period);
> >
> > if (!overrun)
> > break;
> >
> > - idle = do_sched_rt_period_timer(rt_b, overrun);
> > + idle = do_sched_rt_period_timer(sched_b, overrun);
> > }
> >
> > return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
> > }
> >
> > -static
> > -void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime)
> > +static void init_sched_bandwidth(struct sched_bandwidth *sched_b, u64 period,
> > + u64 runtime, enum hrtimer_restart (*period_timer)(struct hrtimer *))
> > {
> > - rt_b->rt_period = ns_to_ktime(period);
> > - rt_b->rt_runtime = runtime;
> > + sched_b->period = ns_to_ktime(period);
> > + sched_b->runtime = runtime;
> >
> > - raw_spin_lock_init(&rt_b->rt_runtime_lock);
> > + raw_spin_lock_init(&sched_b->runtime_lock);
> >
> > - hrtimer_init(&rt_b->rt_period_timer,
> > + hrtimer_init(&sched_b->period_timer,
> > CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> > - rt_b->rt_period_timer.function = sched_rt_period_timer;
> > + sched_b->period_timer.function = *period_timer;
>
> Hmm.. may be I forgetting the "C" language, but why do you dereference
> the pointer before assignment? You should be able to directly assign a
> function address to the function pointer. Did you see a warning?

This is a carry over from old patches. I will fix this
in the next iteration.

>
> > }
> >
> > static inline int rt_bandwidth_enabled(void)
> > @@ -190,42 +190,40 @@ static inline int rt_bandwidth_enabled(void)
> > return sysctl_sched_rt_runtime >= 0;
> > }
> >
> > -static void start_rt_bandwidth(struct rt_bandwidth *rt_b)
> > +static void start_sched_bandwidth(struct sched_bandwidth *sched_b)
> > {
> > ktime_t now;
> >
> > - if (!rt_bandwidth_enabled() || rt_b->rt_runtime == RUNTIME_INF)
> > + if (!rt_bandwidth_enabled() || sched_b->runtime == RUNTIME_INF)
> > return;
> >
> > - if (hrtimer_active(&rt_b->rt_period_timer))
> > + if (hrtimer_active(&sched_b->period_timer))
> > return;
> >
> > - raw_spin_lock(&rt_b->rt_runtime_lock);
> > + raw_spin_lock(&sched_b->runtime_lock);
>
> I don't quite understand why this is a raw_spin_lock

- When upgrading from v4 (2.6.32-rc6) to v5 (2.6.33-rc2), I needed
to change most of the spin_locks in hard limits code in sched.c
since they had become raw_spin_lock_t.
- If your question is why couldn't we use spin_lock_t (sleepable types),
this routine is called from RT and CFS with rq->lock (raw type) held.
So I guess it is not possible to use sleepable version here.

Thanks for your reivew. Would appreciate review of other patches also :)

Regards,
Bharata.

2010-02-01 08:22:09

by Bharata B Rao

[permalink] [raw]
Subject: Re: [RFC v5 PATCH 0/8] CFS Hard limits - v5

On Thu, Jan 28, 2010 at 08:26:08PM -0800, Paul Turner wrote:
> On Thu, Jan 28, 2010 at 7:49 PM, Bharata B Rao <[email protected]> wrote:
> > On Sat, Jan 9, 2010 at 2:15 AM, Paul Turner <[email protected]> wrote:
> >>
> >> What are your thoughts on using a separate mechanism for the general case. ?A
> >> draft proposal follows:
> >>
> >> - Maintain a global run-time pool for each tg. ?The runtime specified by the
> >> ?user represents the value that this pool will be refilled to each period.
> >> - We continue to maintain the local notion of runtime/period in each cfs_rq,
> >> ?continue to accumulate locally here.
> >>
> >> Upon locally exceeding the period acquire new credit from the global pool
> >> (either under lock or more likely using atomic ops). ?This can either be in
> >> fixed steppings (e.g. 10ms, could be tunable) or following some quasi-curve
> >> variant with historical demand.
> >>
> >> One caveat here is that there is some over-commit in the system, the local
> >> differences of runtime vs period represent additional over the global pool.
> >> However it should not be possible to consistently exceed limits since the rate
> >> of refill is gated by the runtime being input into the system via the per-tg
> >> pool.
> >>
> >
> > We borrow from what is actually available as spare (spare = unused or
> > remaining). With global pool, I see that would be difficult.
> > Inability/difficulty in keeping the global pool in sync with the
> > actual available spare time is the reason for over-commit ?
> >
>
> We maintain two pools, a global pool (new) and a per-cfs_rq pool
> (similar to existing rt_bw).
>
> When consuming time you charge vs your local bandwidth until it is
> expired, at this point you must either refill from the global pool, or
> throttle.
>
> The "slack" in the system is the sum of unconsumed time in local pools
> from the *previous* global pool refill. This is bounded above by the
> size of time you refill a local pool at each expiry. We call the size
> of refill a 'slice'.
>
> e.g.
>
> Task limit of 50ms, slice=10ms, 4cpus, period of 500ms
>
> Task A runs on cpus 0 and 1 for 5ms each, then blocks.
>
> When A first executes on each cpu we take slice=10ms from the global
> pool of 50ms and apply it to the local rq. Execution then proceeds vs
> local pool.
>
> Current state is: 5 ms in local pools on {0,1}, 30ms remaining in global pool
>
> Upon period expiration we issue a global pool refill. At this point we have:
> 5 ms in local pools on {0,1}, 50ms remaining in global pool.
>
> That 10ms of slack time is over-commit in the system. However it
> should be clear that this can only be a local effect since over any
> period of time the rate of input into the system is limited by global
> pool refill rate.

With the same setup as above consider 5 such tasks which block after
consuming 5ms each. So now we have 25ms slack time. In the next bandwidth
period if 5 cpu hogs start running and they would consume this 25ms and the
50ms from this period. So we gave 50% extra to a group in a bandwidth period.
Just wondering how common such scenarious could be.

>
> There are also some strategies that we are exploring to improve
> behavior further here. One idea is that if we maintain a generation
> counter then on voluntary dequeue (e.g. tasks block) we can return
> local time to the global period pool or expire it (if generations
> don't match), this greatly reduces the noise (via slack vs ideal
> limit) that a busty application can induce.

Why not clear the remaining runtime during bandwidth refresh ?

>
> >> This would also naturally associate with an interface change that would mean the
> >> runtime limit for a group would be the effective cpurate within the period.
> >>
> >> e.g. by setting a runtime of 200000us on a 100000us period it would effectively
> >> allow you to use 2 cpus worth of wall-time on a multicore system.
> >>
> >> I feel this is slightly more natural than the current definition which due to
> >> being local means that values set will not result in consistent behavior across
> >> machines of different core counts. ?It also has the benefit of being consistent
> >> with observed exports of time consumed, e.g. rusage, (indirectly) time, etc.
> >
> > Though runtimes are enforced locally per-cpu, that's only the
> > implementation. The definition of runtime and period is still
> > system-wide/global. A runtime/period=0.25/0.5 will mean 0.25s of
> > system wide runtime within a period of 0.5s. Talking about consistent
> > definition, I would say this consistently defines half of system wide
> > wall-time on all configurations :)
>
> This feels non-intuitive suppose you have a non-homogeneous fleet of
> systems. It is also difficult to express limits in terms of cores,
> suppose I'm an admin trying to jail my users (maybe I rent out virtual
> time ala EC2 for example). The fractions I have to use to represent
> integer core amounts are going to become quite small on large systems.
> For example, 1 core on a 64 core system would mean 3.905ms/250ms
> period. What's the dependency here between your time and the current
> cpuset topology also, if I'm only allowed on half the system does this
> fraction then refer to global resources or what I'm constrained to?
> This seems a difficult data dependency to manage.
>
> My (personal) ideology is that we are metering at the cpu level as
> opposed to the system level -- which means N seconds of cpu-time makes
> more sense to me. I feel it has advantages in that it can be
> specified more directly relative to the period and is independent of
> system partitioning.
>
> I'd be interested to hear other opinions on this.

We need a consensus here, will wait to see what others think about this.

>
> > If it means 2 CPUs worth wall-time
> > in 4 core machine, it would mean 4 CPUs on a 8 CPU machine. ?At this
> > point, I am inclined to go with this and let the admins/tools work out
> > the actual CPUs part of it. However I would like to hear what others
> > think about this interface.
> >
> >>
> >> For future scalability as machine size grows this could potentially be
> >> partitioned below the tg level along the boundaries of sched_domains (or
> >> something similar). ?However for an initial draft given current machine sizes
> >> the contention on the global pool should hopefully be fairly low.
> >
> > One of the alternatives I have in mind is to be more aggressive while
> > borrowing. While keeping the current algorithm (of iterating thro' all
> > CPUs when borrowing) intact, we could potentially borrow more from
> > those CPUs which don't have any running task from the given group. I
> > just experimented with borrowing half of the available runtime from
> > such CPUs and found that number of iterations are greatly reduced and
> > the source runtime quickly converges to its max possible value. Do you
> > see any issues with this ?
> >
>
> I strongly believe that this is going to induce significant lock
> contention and is not a scalable solution over time. While using a
> faster converging series for time may help I think there are
> confounding factors that limit effect here. Consider the 1 core on a
> 64 core system example above. With only 3.905ms per pool we are going
> to quickly hit the case where we are borrowing not-useful periods of
> time while thrashing locks.
>
> We are in the midst of an implementation for proposal above which
> we'll have ready post to here for consideration next week. We have
> maintained your existing approach with respect to handling throttled
> entities and layered on top of that the proposed alternate
> local/global bandwidth scheme. Initial tests show very promising
> results!

Nice. Look forward to your patches.

Regards,
Bharata.

2010-02-01 11:04:11

by Paul Turner

[permalink] [raw]
Subject: Re: [RFC v5 PATCH 0/8] CFS Hard limits - v5

On Mon, Feb 1, 2010 at 12:21 AM, Bharata B Rao
<[email protected]> wrote:
> On Thu, Jan 28, 2010 at 08:26:08PM -0800, Paul Turner wrote:
>> On Thu, Jan 28, 2010 at 7:49 PM, Bharata B Rao <[email protected]> wrote:
>> > On Sat, Jan 9, 2010 at 2:15 AM, Paul Turner <[email protected]> wrote:
>> >>
>> >> What are your thoughts on using a separate mechanism for the general case. ?A
>> >> draft proposal follows:
>> >>
>> >> - Maintain a global run-time pool for each tg. ?The runtime specified by the
>> >> ?user represents the value that this pool will be refilled to each period.
>> >> - We continue to maintain the local notion of runtime/period in each cfs_rq,
>> >> ?continue to accumulate locally here.
>> >>
>> >> Upon locally exceeding the period acquire new credit from the global pool
>> >> (either under lock or more likely using atomic ops). ?This can either be in
>> >> fixed steppings (e.g. 10ms, could be tunable) or following some quasi-curve
>> >> variant with historical demand.
>> >>
>> >> One caveat here is that there is some over-commit in the system, the local
>> >> differences of runtime vs period represent additional over the global pool.
>> >> However it should not be possible to consistently exceed limits since the rate
>> >> of refill is gated by the runtime being input into the system via the per-tg
>> >> pool.
>> >>
>> >
>> > We borrow from what is actually available as spare (spare = unused or
>> > remaining). With global pool, I see that would be difficult.
>> > Inability/difficulty in keeping the global pool in sync with the
>> > actual available spare time is the reason for over-commit ?
>> >
>>
>> We maintain two pools, a global pool (new) and a per-cfs_rq pool
>> (similar to existing rt_bw).
>>
>> When consuming time you charge vs your local bandwidth until it is
>> expired, at this point you must either refill from the global pool, or
>> throttle.
>>
>> The "slack" in the system is the sum of unconsumed time in local pools
>> from the *previous* global pool refill. ?This is bounded above by the
>> size of time you refill a local pool at each expiry. ?We call the size
>> of refill a 'slice'.
>>
>> e.g.
>>
>> Task limit of 50ms, slice=10ms, 4cpus, period of 500ms
>>
>> Task A runs on cpus 0 and 1 for 5ms each, then blocks.
>>
>> When A first executes on each cpu we take slice=10ms from the global
>> pool of 50ms and apply it to the local rq. ?Execution then proceeds vs
>> local pool.
>>
>> Current state is: 5 ms in local pools on {0,1}, 30ms remaining in global pool
>>
>> Upon period expiration we issue a global pool refill. ?At this point we have:
>> 5 ms in local pools on {0,1}, 50ms remaining in global pool.
>>
>> That 10ms of slack time is over-commit in the system. ?However it
>> should be clear that this can only be a local effect since over any
>> period of time the rate of input into the system is limited by global
>> pool refill rate.
>
> With the same setup as above consider 5 such tasks which block after
> consuming 5ms each. So now we have 25ms slack time. In the next bandwidth
> period if 5 cpu hogs start running and they would consume this 25ms and the
> 50ms from this period. So we gave 50% extra to a group in a bandwidth period.
> Just wondering how common such scenarious could be.
>

Yes within a single given period you may exceed your reservation due
to slack. However, of note is that across any 2 successive periods
you are guaranteed to be within your reservation, i.e. 2*usage <=
2*period, as slack available means that you under-consumed your
previous period.

For those needing a hard guarantee (independent of amelioration
strategies) halving the period provided would then provide this across
their target period with the basic v1 implementation.

>>
>> There are also some strategies that we are exploring to improve
>> behavior further here. ?One idea is that if we maintain a generation
>> counter then on voluntary dequeue (e.g. tasks block) we can return
>> local time to the global period pool or expire it (if generations
>> don't match), this greatly reduces the noise (via slack vs ideal
>> limit) that a busty application can induce.
>
> Why not clear the remaining runtime during bandwidth refresh ?
>

This doesn't scale with (many cpus) * (many bandwidth limits).

It is still my intuition that introducing a generation counter for
quota would allow us to handle this gracefully as that would allow
both invalidation of expired slack time and for the (potentially
fractional) returning of local quota to the global pool within a
period (upon block/voluntary yield).

Such tactics would probably seem better suited for a v2 as there are
both other possible approaches and the matter of added complexity to
the basic design. However, if it works well it might sneak into v1 ;)

>>
>> >> This would also naturally associate with an interface change that would mean the
>> >> runtime limit for a group would be the effective cpurate within the period.
>> >>
>> >> e.g. by setting a runtime of 200000us on a 100000us period it would effectively
>> >> allow you to use 2 cpus worth of wall-time on a multicore system.
>> >>
>> >> I feel this is slightly more natural than the current definition which due to
>> >> being local means that values set will not result in consistent behavior across
>> >> machines of different core counts. ?It also has the benefit of being consistent
>> >> with observed exports of time consumed, e.g. rusage, (indirectly) time, etc.
>> >
>> > Though runtimes are enforced locally per-cpu, that's only the
>> > implementation. The definition of runtime and period is still
>> > system-wide/global. A runtime/period=0.25/0.5 will mean 0.25s of
>> > system wide runtime within a period of 0.5s. Talking about consistent
>> > definition, I would say this consistently defines half of system wide
>> > wall-time on all configurations :)
>>
>> This feels non-intuitive suppose you have a non-homogeneous fleet of
>> systems. ?It is also difficult to express limits in terms of cores,
>> suppose I'm an admin trying to jail my users (maybe I rent out virtual
>> time ala EC2 for example). ?The fractions I have to use to represent
>> integer core amounts are going to become quite small on large systems.
>> ?For example, 1 core on a 64 core system would mean 3.905ms/250ms
>> period. ?What's the dependency here between your time and the current
>> cpuset topology also, if I'm only allowed on half the system does this
>> fraction then refer to global resources or what I'm constrained to?
>> This seems a difficult data dependency to manage.
>>
>> My (personal) ideology is that we are metering at the cpu level as
>> opposed to the system level -- which means N seconds of cpu-time makes
>> more sense to me. ?I feel it has advantages in that it can be
>> specified more directly relative to the period and is independent of
>> system partitioning.
>>
>> I'd be interested to hear other opinions on this.
>
> We need a consensus here, will wait to see what others think about this.
>
>>
>> > If it means 2 CPUs worth wall-time
>> > in 4 core machine, it would mean 4 CPUs on a 8 CPU machine. ?At this
>> > point, I am inclined to go with this and let the admins/tools work out
>> > the actual CPUs part of it. However I would like to hear what others
>> > think about this interface.
>> >
>> >>
>> >> For future scalability as machine size grows this could potentially be
>> >> partitioned below the tg level along the boundaries of sched_domains (or
>> >> something similar). ?However for an initial draft given current machine sizes
>> >> the contention on the global pool should hopefully be fairly low.
>> >
>> > One of the alternatives I have in mind is to be more aggressive while
>> > borrowing. While keeping the current algorithm (of iterating thro' all
>> > CPUs when borrowing) intact, we could potentially borrow more from
>> > those CPUs which don't have any running task from the given group. I
>> > just experimented with borrowing half of the available runtime from
>> > such CPUs and found that number of iterations are greatly reduced and
>> > the source runtime quickly converges to its max possible value. Do you
>> > see any issues with this ?
>> >
>>
>> I strongly believe that this is going to induce significant lock
>> contention and is not a scalable solution over time. ?While using a
>> faster converging series for time may help I think there are
>> confounding factors that limit effect here. ?Consider the 1 core on a
>> 64 core system example above. ?With only 3.905ms per pool we are going
>> to quickly hit the case where we are borrowing not-useful periods of
>> time while thrashing locks.
>>
>> We are in the midst of an implementation for proposal above which
>> we'll have ready post to here for consideration next week. ?We have
>> maintained your existing approach with respect to handling throttled
>> entities and layered on top of that the proposed alternate
>> local/global bandwidth scheme. ?Initial tests show very promising
>> results!
>
> Nice. Look forward to your patches.
>
> Regards,
> Bharata.
>

2010-02-01 18:25:22

by Paul Turner

[permalink] [raw]
Subject: Re: [RFC v5 PATCH 0/8] CFS Hard limits - v5

On Mon, Feb 1, 2010 at 3:04 AM, Paul Turner <[email protected]> wrote:
> On Mon, Feb 1, 2010 at 12:21 AM, Bharata B Rao
> <[email protected]> wrote:
>> On Thu, Jan 28, 2010 at 08:26:08PM -0800, Paul Turner wrote:
>>> On Thu, Jan 28, 2010 at 7:49 PM, Bharata B Rao <[email protected]> wrote:
>>> > On Sat, Jan 9, 2010 at 2:15 AM, Paul Turner <[email protected]> wrote:
>>> >>
>>> >> What are your thoughts on using a separate mechanism for the general case. ?A
>>> >> draft proposal follows:
>>> >>
>>> >> - Maintain a global run-time pool for each tg. ?The runtime specified by the
>>> >> ?user represents the value that this pool will be refilled to each period.
>>> >> - We continue to maintain the local notion of runtime/period in each cfs_rq,
>>> >> ?continue to accumulate locally here.
>>> >>
>>> >> Upon locally exceeding the period acquire new credit from the global pool
>>> >> (either under lock or more likely using atomic ops). ?This can either be in
>>> >> fixed steppings (e.g. 10ms, could be tunable) or following some quasi-curve
>>> >> variant with historical demand.
>>> >>
>>> >> One caveat here is that there is some over-commit in the system, the local
>>> >> differences of runtime vs period represent additional over the global pool.
>>> >> However it should not be possible to consistently exceed limits since the rate
>>> >> of refill is gated by the runtime being input into the system via the per-tg
>>> >> pool.
>>> >>
>>> >
>>> > We borrow from what is actually available as spare (spare = unused or
>>> > remaining). With global pool, I see that would be difficult.
>>> > Inability/difficulty in keeping the global pool in sync with the
>>> > actual available spare time is the reason for over-commit ?
>>> >
>>>
>>> We maintain two pools, a global pool (new) and a per-cfs_rq pool
>>> (similar to existing rt_bw).
>>>
>>> When consuming time you charge vs your local bandwidth until it is
>>> expired, at this point you must either refill from the global pool, or
>>> throttle.
>>>
>>> The "slack" in the system is the sum of unconsumed time in local pools
>>> from the *previous* global pool refill. ?This is bounded above by the
>>> size of time you refill a local pool at each expiry. ?We call the size
>>> of refill a 'slice'.
>>>
>>> e.g.
>>>
>>> Task limit of 50ms, slice=10ms, 4cpus, period of 500ms
>>>
>>> Task A runs on cpus 0 and 1 for 5ms each, then blocks.
>>>
>>> When A first executes on each cpu we take slice=10ms from the global
>>> pool of 50ms and apply it to the local rq. ?Execution then proceeds vs
>>> local pool.
>>>
>>> Current state is: 5 ms in local pools on {0,1}, 30ms remaining in global pool
>>>
>>> Upon period expiration we issue a global pool refill. ?At this point we have:
>>> 5 ms in local pools on {0,1}, 50ms remaining in global pool.
>>>
>>> That 10ms of slack time is over-commit in the system. ?However it
>>> should be clear that this can only be a local effect since over any
>>> period of time the rate of input into the system is limited by global
>>> pool refill rate.
>>
>> With the same setup as above consider 5 such tasks which block after
>> consuming 5ms each. So now we have 25ms slack time. In the next bandwidth
>> period if 5 cpu hogs start running and they would consume this 25ms and the
>> 50ms from this period. So we gave 50% extra to a group in a bandwidth period.
>> Just wondering how common such scenarious could be.
>>
>
> Yes within a single given period you may exceed your reservation due
> to slack. ?However, of note is that across any 2 successive periods
> you are guaranteed to be within your reservation, i.e. 2*usage <=
> 2*period, as slack available means that you under-consumed your
> previous period.
>
> For those needing a hard guarantee (independent of amelioration
> strategies) halving the period provided would then provide this across
> their target period with the basic v1 implementation.
>

Actually now that I think about it, this observation only holds when
the slack is consumed within the second of the two periods. It should
be restated something like:

for any n contiguous periods your maximum usage is n*runtime +
nr_cpus*slice, note the slack term is constant and is dominated for
any observation window involving several periods

>>>
>>> There are also some strategies that we are exploring to improve
>>> behavior further here. ?One idea is that if we maintain a generation
>>> counter then on voluntary dequeue (e.g. tasks block) we can return
>>> local time to the global period pool or expire it (if generations
>>> don't match), this greatly reduces the noise (via slack vs ideal
>>> limit) that a busty application can induce.
>>
>> Why not clear the remaining runtime during bandwidth refresh ?
>>
>
> This doesn't scale with (many cpus) * (many bandwidth limits).
>
> It is still my intuition that introducing a generation counter for
> quota would allow us to handle this gracefully as that would allow
> both invalidation of expired slack time and for the (potentially
> fractional) returning of local quota to the global pool within a
> period (upon block/voluntary yield).
>
> Such tactics would probably seem better suited for a v2 as there are
> both other possible approaches and the matter of added complexity to
> the basic design. ?However, if it works well it might sneak into v1 ;)
>
>>>
>>> >> This would also naturally associate with an interface change that would mean the
>>> >> runtime limit for a group would be the effective cpurate within the period.
>>> >>
>>> >> e.g. by setting a runtime of 200000us on a 100000us period it would effectively
>>> >> allow you to use 2 cpus worth of wall-time on a multicore system.
>>> >>
>>> >> I feel this is slightly more natural than the current definition which due to
>>> >> being local means that values set will not result in consistent behavior across
>>> >> machines of different core counts. ?It also has the benefit of being consistent
>>> >> with observed exports of time consumed, e.g. rusage, (indirectly) time, etc.
>>> >
>>> > Though runtimes are enforced locally per-cpu, that's only the
>>> > implementation. The definition of runtime and period is still
>>> > system-wide/global. A runtime/period=0.25/0.5 will mean 0.25s of
>>> > system wide runtime within a period of 0.5s. Talking about consistent
>>> > definition, I would say this consistently defines half of system wide
>>> > wall-time on all configurations :)
>>>
>>> This feels non-intuitive suppose you have a non-homogeneous fleet of
>>> systems. ?It is also difficult to express limits in terms of cores,
>>> suppose I'm an admin trying to jail my users (maybe I rent out virtual
>>> time ala EC2 for example). ?The fractions I have to use to represent
>>> integer core amounts are going to become quite small on large systems.
>>> ?For example, 1 core on a 64 core system would mean 3.905ms/250ms
>>> period. ?What's the dependency here between your time and the current
>>> cpuset topology also, if I'm only allowed on half the system does this
>>> fraction then refer to global resources or what I'm constrained to?
>>> This seems a difficult data dependency to manage.
>>>
>>> My (personal) ideology is that we are metering at the cpu level as
>>> opposed to the system level -- which means N seconds of cpu-time makes
>>> more sense to me. ?I feel it has advantages in that it can be
>>> specified more directly relative to the period and is independent of
>>> system partitioning.
>>>
>>> I'd be interested to hear other opinions on this.
>>
>> We need a consensus here, will wait to see what others think about this.
>>
>>>
>>> > If it means 2 CPUs worth wall-time
>>> > in 4 core machine, it would mean 4 CPUs on a 8 CPU machine. ?At this
>>> > point, I am inclined to go with this and let the admins/tools work out
>>> > the actual CPUs part of it. However I would like to hear what others
>>> > think about this interface.
>>> >
>>> >>
>>> >> For future scalability as machine size grows this could potentially be
>>> >> partitioned below the tg level along the boundaries of sched_domains (or
>>> >> something similar). ?However for an initial draft given current machine sizes
>>> >> the contention on the global pool should hopefully be fairly low.
>>> >
>>> > One of the alternatives I have in mind is to be more aggressive while
>>> > borrowing. While keeping the current algorithm (of iterating thro' all
>>> > CPUs when borrowing) intact, we could potentially borrow more from
>>> > those CPUs which don't have any running task from the given group. I
>>> > just experimented with borrowing half of the available runtime from
>>> > such CPUs and found that number of iterations are greatly reduced and
>>> > the source runtime quickly converges to its max possible value. Do you
>>> > see any issues with this ?
>>> >
>>>
>>> I strongly believe that this is going to induce significant lock
>>> contention and is not a scalable solution over time. ?While using a
>>> faster converging series for time may help I think there are
>>> confounding factors that limit effect here. ?Consider the 1 core on a
>>> 64 core system example above. ?With only 3.905ms per pool we are going
>>> to quickly hit the case where we are borrowing not-useful periods of
>>> time while thrashing locks.
>>>
>>> We are in the midst of an implementation for proposal above which
>>> we'll have ready post to here for consideration next week. ?We have
>>> maintained your existing approach with respect to handling throttled
>>> entities and layered on top of that the proposed alternate
>>> local/global bandwidth scheme. ?Initial tests show very promising
>>> results!
>>
>> Nice. Look forward to your patches.
>>
>> Regards,
>> Bharata.
>>
>

2010-02-02 04:15:06

by Bharata B Rao

[permalink] [raw]
Subject: Re: [RFC v5 PATCH 0/8] CFS Hard limits - v5

On Mon, Feb 01, 2010 at 10:25:11AM -0800, Paul Turner wrote:
> On Mon, Feb 1, 2010 at 3:04 AM, Paul Turner <[email protected]> wrote:
> > On Mon, Feb 1, 2010 at 12:21 AM, Bharata B Rao
> > <[email protected]> wrote:
> >> On Thu, Jan 28, 2010 at 08:26:08PM -0800, Paul Turner wrote:
> >>> On Thu, Jan 28, 2010 at 7:49 PM, Bharata B Rao <[email protected]> wrote:
> >>> > On Sat, Jan 9, 2010 at 2:15 AM, Paul Turner <[email protected]> wrote:
> >>> >>
> >>> >> What are your thoughts on using a separate mechanism for the general case. ?A
> >>> >> draft proposal follows:
> >>> >>
> >>> >> - Maintain a global run-time pool for each tg. ?The runtime specified by the
> >>> >> ?user represents the value that this pool will be refilled to each period.
> >>> >> - We continue to maintain the local notion of runtime/period in each cfs_rq,
> >>> >> ?continue to accumulate locally here.
> >>> >>
> >>> >> Upon locally exceeding the period acquire new credit from the global pool
> >>> >> (either under lock or more likely using atomic ops). ?This can either be in
> >>> >> fixed steppings (e.g. 10ms, could be tunable) or following some quasi-curve
> >>> >> variant with historical demand.
> >>> >>
> >>> >> One caveat here is that there is some over-commit in the system, the local
> >>> >> differences of runtime vs period represent additional over the global pool.
> >>> >> However it should not be possible to consistently exceed limits since the rate
> >>> >> of refill is gated by the runtime being input into the system via the per-tg
> >>> >> pool.
> >>> >>
> >>> >
> >>> > We borrow from what is actually available as spare (spare = unused or
> >>> > remaining). With global pool, I see that would be difficult.
> >>> > Inability/difficulty in keeping the global pool in sync with the
> >>> > actual available spare time is the reason for over-commit ?
> >>> >
> >>>
> >>> We maintain two pools, a global pool (new) and a per-cfs_rq pool
> >>> (similar to existing rt_bw).
> >>>
> >>> When consuming time you charge vs your local bandwidth until it is
> >>> expired, at this point you must either refill from the global pool, or
> >>> throttle.
> >>>
> >>> The "slack" in the system is the sum of unconsumed time in local pools
> >>> from the *previous* global pool refill. ?This is bounded above by the
> >>> size of time you refill a local pool at each expiry. ?We call the size
> >>> of refill a 'slice'.
> >>>
> >>> e.g.
> >>>
> >>> Task limit of 50ms, slice=10ms, 4cpus, period of 500ms
> >>>
> >>> Task A runs on cpus 0 and 1 for 5ms each, then blocks.
> >>>
> >>> When A first executes on each cpu we take slice=10ms from the global
> >>> pool of 50ms and apply it to the local rq. ?Execution then proceeds vs
> >>> local pool.
> >>>
> >>> Current state is: 5 ms in local pools on {0,1}, 30ms remaining in global pool
> >>>
> >>> Upon period expiration we issue a global pool refill. ?At this point we have:
> >>> 5 ms in local pools on {0,1}, 50ms remaining in global pool.
> >>>
> >>> That 10ms of slack time is over-commit in the system. ?However it
> >>> should be clear that this can only be a local effect since over any
> >>> period of time the rate of input into the system is limited by global
> >>> pool refill rate.
> >>
> >> With the same setup as above consider 5 such tasks which block after
> >> consuming 5ms each. So now we have 25ms slack time. In the next bandwidth
> >> period if 5 cpu hogs start running and they would consume this 25ms and the
> >> 50ms from this period. So we gave 50% extra to a group in a bandwidth period.
> >> Just wondering how common such scenarious could be.
> >>
> >
> > Yes within a single given period you may exceed your reservation due
> > to slack. ?However, of note is that across any 2 successive periods
> > you are guaranteed to be within your reservation, i.e. 2*usage <=
> > 2*period, as slack available means that you under-consumed your
> > previous period.
> >
> > For those needing a hard guarantee (independent of amelioration
> > strategies) halving the period provided would then provide this across
> > their target period with the basic v1 implementation.
> >
>
> Actually now that I think about it, this observation only holds when
> the slack is consumed within the second of the two periods. It should
> be restated something like:
>
> for any n contiguous periods your maximum usage is n*runtime +
> nr_cpus*slice, note the slack term is constant and is dominated for
> any observation window involving several periods

Ok. We are talking about 'hard limits' here and looks like there is
a theoritical possibility of exceeding the limit often. Need to understand
how good/bad this is in real life.

Regards,
Bharata.

2010-02-02 07:13:53

by Paul Turner

[permalink] [raw]
Subject: Re: [RFC v5 PATCH 0/8] CFS Hard limits - v5

On Mon, Feb 1, 2010 at 8:14 PM, Bharata B Rao
<[email protected]> wrote:
> On Mon, Feb 01, 2010 at 10:25:11AM -0800, Paul Turner wrote:
>> On Mon, Feb 1, 2010 at 3:04 AM, Paul Turner <[email protected]> wrote:
>> > On Mon, Feb 1, 2010 at 12:21 AM, Bharata B Rao
>> > <[email protected]> wrote:
>> >> On Thu, Jan 28, 2010 at 08:26:08PM -0800, Paul Turner wrote:
>> >>> On Thu, Jan 28, 2010 at 7:49 PM, Bharata B Rao <[email protected]> wrote:
>> >>> > On Sat, Jan 9, 2010 at 2:15 AM, Paul Turner <[email protected]> wrote:
>> >>> >>
>> >>> >> What are your thoughts on using a separate mechanism for the general case. ?A
>> >>> >> draft proposal follows:
>> >>> >>
>> >>> >> - Maintain a global run-time pool for each tg. ?The runtime specified by the
>> >>> >> ?user represents the value that this pool will be refilled to each period.
>> >>> >> - We continue to maintain the local notion of runtime/period in each cfs_rq,
>> >>> >> ?continue to accumulate locally here.
>> >>> >>
>> >>> >> Upon locally exceeding the period acquire new credit from the global pool
>> >>> >> (either under lock or more likely using atomic ops). ?This can either be in
>> >>> >> fixed steppings (e.g. 10ms, could be tunable) or following some quasi-curve
>> >>> >> variant with historical demand.
>> >>> >>
>> >>> >> One caveat here is that there is some over-commit in the system, the local
>> >>> >> differences of runtime vs period represent additional over the global pool.
>> >>> >> However it should not be possible to consistently exceed limits since the rate
>> >>> >> of refill is gated by the runtime being input into the system via the per-tg
>> >>> >> pool.
>> >>> >>
>> >>> >
>> >>> > We borrow from what is actually available as spare (spare = unused or
>> >>> > remaining). With global pool, I see that would be difficult.
>> >>> > Inability/difficulty in keeping the global pool in sync with the
>> >>> > actual available spare time is the reason for over-commit ?
>> >>> >
>> >>>
>> >>> We maintain two pools, a global pool (new) and a per-cfs_rq pool
>> >>> (similar to existing rt_bw).
>> >>>
>> >>> When consuming time you charge vs your local bandwidth until it is
>> >>> expired, at this point you must either refill from the global pool, or
>> >>> throttle.
>> >>>
>> >>> The "slack" in the system is the sum of unconsumed time in local pools
>> >>> from the *previous* global pool refill. ?This is bounded above by the
>> >>> size of time you refill a local pool at each expiry. ?We call the size
>> >>> of refill a 'slice'.
>> >>>
>> >>> e.g.
>> >>>
>> >>> Task limit of 50ms, slice=10ms, 4cpus, period of 500ms
>> >>>
>> >>> Task A runs on cpus 0 and 1 for 5ms each, then blocks.
>> >>>
>> >>> When A first executes on each cpu we take slice=10ms from the global
>> >>> pool of 50ms and apply it to the local rq. ?Execution then proceeds vs
>> >>> local pool.
>> >>>
>> >>> Current state is: 5 ms in local pools on {0,1}, 30ms remaining in global pool
>> >>>
>> >>> Upon period expiration we issue a global pool refill. ?At this point we have:
>> >>> 5 ms in local pools on {0,1}, 50ms remaining in global pool.
>> >>>
>> >>> That 10ms of slack time is over-commit in the system. ?However it
>> >>> should be clear that this can only be a local effect since over any
>> >>> period of time the rate of input into the system is limited by global
>> >>> pool refill rate.
>> >>
>> >> With the same setup as above consider 5 such tasks which block after
>> >> consuming 5ms each. So now we have 25ms slack time. In the next bandwidth
>> >> period if 5 cpu hogs start running and they would consume this 25ms and the
>> >> 50ms from this period. So we gave 50% extra to a group in a bandwidth period.
>> >> Just wondering how common such scenarious could be.
>> >>
>> >
>> > Yes within a single given period you may exceed your reservation due
>> > to slack. ?However, of note is that across any 2 successive periods
>> > you are guaranteed to be within your reservation, i.e. 2*usage <=
>> > 2*period, as slack available means that you under-consumed your
>> > previous period.
>> >
>> > For those needing a hard guarantee (independent of amelioration
>> > strategies) halving the period provided would then provide this across
>> > their target period with the basic v1 implementation.
>> >
>>
>> Actually now that I think about it, this observation only holds when
>> the slack is consumed within the second of the two periods. ?It should
>> be restated something like:
>>
>> for any n contiguous periods your maximum usage is n*runtime +
>> nr_cpus*slice, note the slack term is constant and is dominated for
>> any observation window involving several periods
>
> Ok. We are talking about 'hard limits' here and looks like there is
> a theoritical possibility of exceeding the limit often. Need to understand
> how good/bad this is in real life.
>

I'm really glad that we're having this discussion however it feels
like we are rat-holing a little on the handling of slack? As an issue
it seems a non-starter since if necessary it can be scrubbed with the
same overhead as the original proposal for period refresh (i.e. remove
slack from each cpu). There are also trivial faster ways to zero it
without this O(n) overhead.

I feel it would be more productive to focus discussion on other areas
of design such as load balancer interaction, distribution from global
to local pools, entity placement and fairness on wakeup, etc.

> Regards,
> Bharata.
>

2010-02-02 07:57:40

by Bharata B Rao

[permalink] [raw]
Subject: Re: [RFC v5 PATCH 0/8] CFS Hard limits - v5

On Mon, Feb 01, 2010 at 11:13:42PM -0800, Paul Turner wrote:
>
> I'm really glad that we're having this discussion however it feels
> like we are rat-holing a little on the handling of slack? As an issue

That's because runtime rebalancing was one of the main the issues you
brought up.

> it seems a non-starter since if necessary it can be scrubbed with the
> same overhead as the original proposal for period refresh (i.e. remove
> slack from each cpu). There are also trivial faster ways to zero it
> without this O(n) overhead.
>
> I feel it would be more productive to focus discussion on other areas
> of design such as load balancer interaction, distribution from global
> to local pools, entity placement and fairness on wakeup, etc.

Sure, as I said looking forward to your patches!
Discussions/reviews on those aspects were hardly forthcoming during
my earlier posts. Hopefully we will see more review/discussions in future.

>
> > Regards,
> > Bharata.
> >