LinuxLists.cc - [RFC v3 PATCH 0/7] CFS Hard limits

2009-11-09 09:08:46

Subject: [RFC v3 PATCH 0/7] CFS Hard limits - v3

[sorry for the double post of 0/7 mail, I missed the CC list earlier]

Hi,

Here is the v3 post of hard limits feature for CFS group scheduler. This
post mostly addresses the comments received during v2.

If the approach taken in v3 is found to be acceptable, I would like to
start merging some common code with rt starting from next post.

As last time, this is lightly tested I know there are bugs in this version
which I am working on.

Changes
-------
RFC v3:
- Till v2, I was updating rq->nr_running when tasks go and come back on
runqueue during throttling and unthrottling. Don't do this.
- With the above change, quite a bit of code simplification is achieved.
Runtime related fields of cfs_rq are now being protected by per cfs_rq
lock instead of per rq lock. With this it looks more similar to rt.
- Remove the control file cpu.cfs_hard_limit which enabled/disabled hard limits
for groups. Now hard limits is enabled by having a non-zero runtime.
- Don't explicitly prevent movement of tasks into throttled groups during
load balancing as throttled entities are anyway prevented from being
enqueued in enqueue_task_fair().
- Moved to 2.6.32-rc6

RFC v2:
- http://lkml.org/lkml/2009/9/30/115
- Upgraded to 2.6.31.
- Added CFS runtime borrowing.
- New locking scheme
The hard limit specific fields of cfs_rq (cfs_runtime, cfs_time and
cfs_throttled) were being protected by rq->lock. This simple scheme will
not work when runtime rebalancing is introduced where it will be required
to look at these fields on other CPU's which requires us to acquire
rq->lock of other CPUs. This will not be feasible from update_curr().
Hence introduce a separate lock (rq->runtime_lock) to protect these
fields of all cfs_rq under it.
- Handle the task wakeup in a throttled group correctly.
- Make CFS_HARD_LIMITS dependent on CGROUP_SCHED (Thanks to Andrea Righi)

RFC v1:
- First version of the patches with minimal features was posted at
http://lkml.org/lkml/2009/8/25/128

RFC v0:
- The CFS hard limits proposal was first posted at
http://lkml.org/lkml/2009/6/4/24

Features TODO
-------------
- CFS runtime borrowing still needs some work, especially need to handle
runtime redistribution when a CPU goes offline.
- Bandwidth inheritance support (long term, not under consideration currently)
- This implementation doesn't work for user group scheduler. Since user group
scheduler will eventually go away, I don't plan to work on this.

Implementation TODO
-------------------
- It is possible to share some of the bandwidth handling code with RT, but
the intention of this post is to show the changes associated with hard limits.
Hence the sharing/cleanup will be done down the line when this patchset
itself becomes more accepatable.
- When a dequeued entity is enqueued back, I don't change its vruntime. The
entity might get undue advantage due to its old (lower) vruntime. Need to
address this.

Patches description
-------------------
This post has the following patches:

1/7 sched: Rename sched_rt_period_mask() and use it in CFS also
2/7 sched: Bandwidth initialization for fair task groups
3/7 sched: Enforce hard limits by throttling
4/7 sched: Unthrottle the throttled tasks
5/7 sched: Add throttle time statistics to /proc/sched_debug
6/7 sched: CFS runtime borrowing
7/7 sched: Hard limits documentation

Documentation/scheduler/sched-cfs-hard-limits.txt | 48 +++
include/linux/sched.h | 6
init/Kconfig | 13
kernel/sched.c | 316 +++++++++++++++++++-
kernel/sched_debug.c | 17 +
kernel/sched_fair.c | 288 +++++++++++++++++-
kernel/sched_rt.c | 19 -
7 files changed, 674 insertions(+), 33 deletions(-)

Regards,
Bharata.

2009-11-09 09:09:27

by Bharata B Rao

[permalink] [raw]

Subject: [RFC v3 PATCH 1/7] sched: Rename sched_rt_period_mask() and use it in CFS also

sched: Rename sched_rt_period_mask() and use it in CFS also.

From: Bharata B Rao <[email protected]>

sched_rt_period_mask() is needed in CFS also. Rename it to a generic name
and move it to kernel/sched.c. No functionality change in this patch.

Signed-off-by: Bharata B Rao <[email protected]>
---
kernel/sched.c | 23 +++++++++++++++++++++++
kernel/sched_rt.c | 19 +------------------
2 files changed, 24 insertions(+), 18 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index a455dca..1309e8d 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1810,6 +1810,29 @@ static void cfs_rq_set_shares(struct cfs_rq *cfs_rq, unsigned long shares)

static void calc_load_account_active(struct rq *this_rq);

+
+#if defined(CONFIG_RT_GROUP_SCHED) || defined(CONFIG_FAIR_GROUP_SCHED)
+
+#ifdef CONFIG_SMP
+static inline const struct cpumask *sched_bw_period_mask(void)
+{
+ return cpu_rq(smp_processor_id())->rd->span;
+}
+#else /* !CONFIG_SMP */
+static inline const struct cpumask *sched_bw_period_mask(void)
+{
+ return cpu_online_mask;
+}
+#endif /* CONFIG_SMP */
+
+#else
+static inline const struct cpumask *sched_bw_period_mask(void)
+{
+ return cpu_online_mask;
+}
+
+#endif
+
#include "sched_stats.h"
#include "sched_idletask.c"
#include "sched_fair.c"
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index a4d790c..97067e1 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -235,18 +235,6 @@ static int rt_se_boosted(struct sched_rt_entity *rt_se)
return p->prio != p->normal_prio;
}

-#ifdef CONFIG_SMP
-static inline const struct cpumask *sched_rt_period_mask(void)
-{
- return cpu_rq(smp_processor_id())->rd->span;
-}
-#else
-static inline const struct cpumask *sched_rt_period_mask(void)
-{
- return cpu_online_mask;
-}
-#endif
-
static inline
struct rt_rq *sched_rt_period_rt_rq(struct rt_bandwidth *rt_b, int cpu)
{
@@ -296,11 +284,6 @@ static inline int rt_rq_throttled(struct rt_rq *rt_rq)
return rt_rq->rt_throttled;
}

-static inline const struct cpumask *sched_rt_period_mask(void)
-{
- return cpu_online_mask;
-}
-
static inline
struct rt_rq *sched_rt_period_rt_rq(struct rt_bandwidth *rt_b, int cpu)
{
@@ -518,7 +501,7 @@ static int do_sched_rt_period_timer(struct rt_bandwidth *rt_b, int overrun)
if (!rt_bandwidth_enabled() || rt_b->rt_runtime == RUNTIME_INF)
return 1;

- span = sched_rt_period_mask();
+ span = sched_bw_period_mask();
for_each_cpu(i, span) {
int enqueue = 0;
struct rt_rq *rt_rq = sched_rt_period_rt_rq(rt_b, i);

2009-11-09 09:10:10

by Bharata B Rao

[permalink] [raw]

Subject: [RFC v3 PATCH 2/7] sched: Bandwidth initialization for fair task groups

sched: Bandwidth initialization for fair task groups.

From: Bharata B Rao <[email protected]>

Introduce the notion of hard limiting for CFS groups by bringing in
the concept of runtime and period for them. Add cgroup files to control
runtime and period.

Signed-off-by: Bharata B Rao <[email protected]>
---
init/Kconfig | 13 +++
kernel/sched.c | 277 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 290 insertions(+), 0 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index f515864..fea8cbe 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -477,6 +477,19 @@ config CGROUP_SCHED

endchoice

+config CFS_HARD_LIMITS
+ bool "Hard Limits for CFS Group Scheduler"
+ depends on EXPERIMENTAL
+ depends on FAIR_GROUP_SCHED && CGROUP_SCHED
+ default n
+ help
+ This option enables hard limiting of CPU time obtained by
+ a fair task group. Use this if you want to throttle a group of tasks
+ based on its CPU usage. For more details refer to
+ Documentation/scheduler/sched-cfs-hard-limits.txt
+
+ Say N if unsure.
+
menuconfig CGROUPS
boolean "Control Group support"
help
diff --git a/kernel/sched.c b/kernel/sched.c
index 1309e8d..1d46fdc 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -237,6 +237,15 @@ static DEFINE_MUTEX(sched_domains_mutex);

#include <linux/cgroup.h>

+#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_CFS_HARD_LIMITS)
+struct cfs_bandwidth {
+ spinlock_t cfs_runtime_lock;
+ ktime_t cfs_period;
+ u64 cfs_runtime;
+ struct hrtimer cfs_period_timer;
+};
+#endif
+
struct cfs_rq;

static LIST_HEAD(task_groups);
@@ -257,6 +266,9 @@ struct task_group {
/* runqueue "owned" by this group on each cpu */
struct cfs_rq **cfs_rq;
unsigned long shares;
+#ifdef CONFIG_CFS_HARD_LIMITS
+ struct cfs_bandwidth cfs_bandwidth;
+#endif
#endif

#ifdef CONFIG_RT_GROUP_SCHED
@@ -445,6 +457,19 @@ struct cfs_rq {
unsigned long rq_weight;
#endif
#endif
+#ifdef CONFIG_CFS_HARD_LIMITS
+ /* set when the group is throttled on this cpu */
+ int cfs_throttled;
+
+ /* runtime currently consumed by the group on this rq */
+ u64 cfs_time;
+
+ /* runtime available to the group on this rq */
+ u64 cfs_runtime;
+
+ /* Protects the cfs runtime related fields of this cfs_rq */
+ spinlock_t cfs_runtime_lock;
+#endif
};

/* Real-Time classes' related field in a runqueue: */
@@ -1833,6 +1858,144 @@ static inline const struct cpumask *sched_bw_period_mask(void)

#endif

+#ifdef CONFIG_FAIR_GROUP_SCHED
+#ifdef CONFIG_CFS_HARD_LIMITS
+
+/*
+ * Runtime allowed for a cfs group before it is hard limited.
+ * default: Infinite which means no hard limiting.
+ */
+u64 sched_cfs_runtime = RUNTIME_INF;
+
+/*
+ * period over which we hard limit the cfs group's bandwidth.
+ * default: 0.5s
+ */
+u64 sched_cfs_period = 500000;
+
+static inline u64 global_cfs_period(void)
+{
+ return sched_cfs_period * NSEC_PER_USEC;
+}
+
+static inline u64 global_cfs_runtime(void)
+{
+ return RUNTIME_INF;
+}
+
+static inline void cfs_rq_runtime_lock(struct cfs_rq *cfs_rq)
+{
+ spin_lock(&cfs_rq->cfs_runtime_lock);
+}
+
+static inline void cfs_rq_runtime_unlock(struct cfs_rq *cfs_rq)
+{
+ spin_unlock(&cfs_rq->cfs_runtime_lock);
+}
+
+/*
+ * Refresh the runtimes of the throttled groups.
+ * But nothing much to do now, will populate this in later patches.
+ */
+static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
+{
+ struct cfs_bandwidth *cfs_b =
+ container_of(timer, struct cfs_bandwidth, cfs_period_timer);
+
+ hrtimer_add_expires_ns(timer, ktime_to_ns(cfs_b->cfs_period));
+ return HRTIMER_RESTART;
+}
+
+/*
+ * TODO: Check if this kind of timer setup is sufficient for cfs or
+ * should we do what rt is doing.
+ */
+static void start_cfs_bandwidth(struct task_group *tg)
+{
+ struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
+
+ /*
+ * Timer isn't setup for groups with infinite runtime
+ */
+ if (cfs_b->cfs_runtime == RUNTIME_INF)
+ return;
+
+ if (hrtimer_active(&cfs_b->cfs_period_timer))
+ return;
+
+ hrtimer_start_range_ns(&cfs_b->cfs_period_timer, cfs_b->cfs_period,
+ 0, HRTIMER_MODE_REL);
+}
+
+static void init_cfs_bandwidth(struct task_group *tg)
+{
+ struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
+
+ cfs_b->cfs_period = ns_to_ktime(global_cfs_period());
+ cfs_b->cfs_runtime = global_cfs_runtime();
+
+ spin_lock_init(&cfs_b->cfs_runtime_lock);
+
+ hrtimer_init(&cfs_b->cfs_period_timer,
+ CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+ cfs_b->cfs_period_timer.function = &sched_cfs_period_timer;
+}
+
+static inline void destroy_cfs_bandwidth(struct task_group *tg)
+{
+ hrtimer_cancel(&tg->cfs_bandwidth.cfs_period_timer);
+}
+
+static void init_cfs_hard_limits(struct cfs_rq *cfs_rq, struct task_group *tg)
+{
+ cfs_rq->cfs_time = 0;
+ cfs_rq->cfs_throttled = 0;
+ cfs_rq->cfs_runtime = tg->cfs_bandwidth.cfs_runtime;
+ spin_lock_init(&cfs_rq->cfs_runtime_lock);
+}
+
+#else /* !CONFIG_CFS_HARD_LIMITS */
+
+static void init_cfs_bandwidth(struct task_group *tg)
+{
+ return;
+}
+
+static inline void destroy_cfs_bandwidth(struct task_group *tg)
+{
+ return;
+}
+
+static void init_cfs_hard_limits(struct cfs_rq *cfs_rq, struct task_group *tg)
+{
+ return;
+}
+
+static inline void cfs_rq_runtime_lock(struct cfs_rq *cfs_rq)
+{
+ return;
+}
+
+static inline void cfs_rq_runtime_unlock(struct cfs_rq *cfs_rq)
+{
+ return;
+}
+
+#endif /* CONFIG_CFS_HARD_LIMITS */
+#else /* !CONFIG_FAIR_GROUP_SCHED */
+
+static inline void cfs_rq_runtime_lock(struct cfs_rq *cfs_rq)
+{
+ return;
+}
+
+static inline void cfs_rq_runtime_unlock(struct cfs_rq *cfs_rq)
+{
+ return;
+}
+
+#endif /* CONFIG_FAIR_GROUP_SCHED */
+
#include "sched_stats.h"
#include "sched_idletask.c"
#include "sched_fair.c"
@@ -9286,6 +9449,7 @@ static void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
struct rq *rq = cpu_rq(cpu);
tg->cfs_rq[cpu] = cfs_rq;
init_cfs_rq(cfs_rq, rq);
+ init_cfs_hard_limits(cfs_rq, tg);
cfs_rq->tg = tg;
if (add)
list_add(&cfs_rq->leaf_cfs_rq_list, &rq->leaf_cfs_rq_list);
@@ -9415,6 +9579,10 @@ void __init sched_init(void)
#endif /* CONFIG_USER_SCHED */
#endif /* CONFIG_RT_GROUP_SCHED */

+#ifdef CONFIG_FAIR_GROUP_SCHED
+ init_cfs_bandwidth(&init_task_group);
+#endif
+
#ifdef CONFIG_GROUP_SCHED
list_add(&init_task_group.list, &task_groups);
INIT_LIST_HEAD(&init_task_group.children);
@@ -9441,6 +9609,7 @@ void __init sched_init(void)
init_cfs_rq(&rq->cfs, rq);
init_rt_rq(&rq->rt, rq);
#ifdef CONFIG_FAIR_GROUP_SCHED
+ init_cfs_hard_limits(&rq->cfs, &init_task_group);
init_task_group.shares = init_task_group_load;
INIT_LIST_HEAD(&rq->leaf_cfs_rq_list);
#ifdef CONFIG_CGROUP_SCHED
@@ -9716,6 +9885,7 @@ static void free_fair_sched_group(struct task_group *tg)
{
int i;

+ destroy_cfs_bandwidth(tg);
for_each_possible_cpu(i) {
if (tg->cfs_rq)
kfree(tg->cfs_rq[i]);
@@ -9742,6 +9912,7 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
if (!tg->se)
goto err;

+ init_cfs_bandwidth(tg);
tg->shares = NICE_0_LOAD;

for_each_possible_cpu(i) {
@@ -10465,6 +10636,100 @@ static u64 cpu_shares_read_u64(struct cgroup *cgrp, struct cftype *cft)

return (u64) tg->shares;
}
+
+#ifdef CONFIG_CFS_HARD_LIMITS
+
+static int tg_set_cfs_bandwidth(struct task_group *tg,
+ u64 cfs_period, u64 cfs_runtime)
+{
+ int i;
+
+ spin_lock_irq(&tg->cfs_bandwidth.cfs_runtime_lock);
+ tg->cfs_bandwidth.cfs_period = ns_to_ktime(cfs_period);
+ tg->cfs_bandwidth.cfs_runtime = cfs_runtime;
+
+ for_each_possible_cpu(i) {
+ struct cfs_rq *cfs_rq = tg->cfs_rq[i];
+
+ cfs_rq_runtime_lock(cfs_rq);
+ cfs_rq->cfs_runtime = cfs_runtime;
+ cfs_rq_runtime_unlock(cfs_rq);
+ }
+
+ start_cfs_bandwidth(tg);
+ spin_unlock_irq(&tg->cfs_bandwidth.cfs_runtime_lock);
+ return 0;
+}
+
+int tg_set_cfs_runtime(struct task_group *tg, long cfs_runtime_us)
+{
+ u64 cfs_runtime, cfs_period;
+
+ cfs_period = ktime_to_ns(tg->cfs_bandwidth.cfs_period);
+ cfs_runtime = (u64)cfs_runtime_us * NSEC_PER_USEC;
+ if (cfs_runtime_us < 0)
+ cfs_runtime = RUNTIME_INF;
+
+ return tg_set_cfs_bandwidth(tg, cfs_period, cfs_runtime);
+}
+
+long tg_get_cfs_runtime(struct task_group *tg)
+{
+ u64 cfs_runtime_us;
+
+ if (tg->cfs_bandwidth.cfs_runtime == RUNTIME_INF)
+ return -1;
+
+ cfs_runtime_us = tg->cfs_bandwidth.cfs_runtime;
+ do_div(cfs_runtime_us, NSEC_PER_USEC);
+ return cfs_runtime_us;
+}
+
+int tg_set_cfs_period(struct task_group *tg, long cfs_period_us)
+{
+ u64 cfs_runtime, cfs_period;
+
+ cfs_period = (u64)cfs_period_us * NSEC_PER_USEC;
+ cfs_runtime = tg->cfs_bandwidth.cfs_runtime;
+
+ if (cfs_period == 0)
+ return -EINVAL;
+
+ return tg_set_cfs_bandwidth(tg, cfs_period, cfs_runtime);
+}
+
+long tg_get_cfs_period(struct task_group *tg)
+{
+ u64 cfs_period_us;
+
+ cfs_period_us = ktime_to_ns(tg->cfs_bandwidth.cfs_period);
+ do_div(cfs_period_us, NSEC_PER_USEC);
+ return cfs_period_us;
+}
+
+static s64 cpu_cfs_runtime_read_s64(struct cgroup *cgrp, struct cftype *cft)
+{
+ return tg_get_cfs_runtime(cgroup_tg(cgrp));
+}
+
+static int cpu_cfs_runtime_write_s64(struct cgroup *cgrp, struct cftype *cftype,
+ s64 cfs_runtime_us)
+{
+ return tg_set_cfs_runtime(cgroup_tg(cgrp), cfs_runtime_us);
+}
+
+static u64 cpu_cfs_period_read_u64(struct cgroup *cgrp, struct cftype *cft)
+{
+ return tg_get_cfs_period(cgroup_tg(cgrp));
+}
+
+static int cpu_cfs_period_write_u64(struct cgroup *cgrp, struct cftype *cftype,
+ u64 cfs_period_us)
+{
+ return tg_set_cfs_period(cgroup_tg(cgrp), cfs_period_us);
+}
+
+#endif /* CONFIG_CFS_HARD_LIMITS */
#endif /* CONFIG_FAIR_GROUP_SCHED */

#ifdef CONFIG_RT_GROUP_SCHED
@@ -10498,6 +10763,18 @@ static struct cftype cpu_files[] = {
.read_u64 = cpu_shares_read_u64,
.write_u64 = cpu_shares_write_u64,
},
+#ifdef CONFIG_CFS_HARD_LIMITS
+ {
+ .name = "cfs_runtime_us",
+ .read_s64 = cpu_cfs_runtime_read_s64,
+ .write_s64 = cpu_cfs_runtime_write_s64,
+ },
+ {
+ .name = "cfs_period_us",
+ .read_u64 = cpu_cfs_period_read_u64,
+ .write_u64 = cpu_cfs_period_write_u64,
+ },
+#endif /* CONFIG_CFS_HARD_LIMITS */
#endif
#ifdef CONFIG_RT_GROUP_SCHED
{

2009-11-09 09:10:41

by Bharata B Rao

[permalink] [raw]

Subject: [RFC v3 PATCH 3/7] sched: Enforce hard limits by throttling

sched: Enforce hard limits by throttling.

From: Bharata B Rao <[email protected]>

Throttle the task-groups which exceed the runtime allocated to them.
Throttled group entities are removed from the run queue.

Signed-off-by: Bharata B Rao <[email protected]>
---
kernel/sched.c | 13 ++++
kernel/sched_fair.c | 164 +++++++++++++++++++++++++++++++++++++++++++++++----
2 files changed, 163 insertions(+), 14 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 1d46fdc..5d2e5e5 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1631,6 +1631,7 @@ static void update_group_shares_cpu(struct task_group *tg, int cpu,
}
}

+static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq);
/*
* Re-compute the task group their per cpu shares over the given domain.
* This needs to be done in a bottom-up fashion because the rq weight of a
@@ -1658,8 +1659,10 @@ static int tg_shares_up(struct task_group *tg, void *data)
* If there are currently no tasks on the cpu pretend there
* is one of average load so that when a new task gets to
* run here it will not get delayed by group starvation.
+ * Also if the group is throttled on this cpu, pretend that
+ * it has no tasks.
*/
- if (!weight)
+ if (!weight || cfs_rq_throttled(tg->cfs_rq[i]))
weight = NICE_0_LOAD;

rq_weight += weight;
@@ -1684,6 +1687,7 @@ static int tg_shares_up(struct task_group *tg, void *data)
* Compute the cpu's hierarchical load factor for each task group.
* This needs to be done in a top-down fashion because the load of a child
* group is a fraction of its parents load.
+ * A throttled group's h_load is set to 0.
*/
static int tg_load_down(struct task_group *tg, void *data)
{
@@ -1692,6 +1696,8 @@ static int tg_load_down(struct task_group *tg, void *data)

if (!tg->parent) {
load = cpu_rq(cpu)->load.weight;
+ } else if (cfs_rq_throttled(tg->cfs_rq[cpu])) {
+ load = 0;
} else {
load = tg->parent->cfs_rq[cpu]->h_load;
load *= tg->cfs_rq[cpu]->shares;
@@ -1994,6 +2000,11 @@ static inline void cfs_rq_runtime_unlock(struct cfs_rq *cfs_rq)
return;
}

+static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
+{
+ return 0;
+}
+
#endif /* CONFIG_FAIR_GROUP_SCHED */

#include "sched_stats.h"
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index c32c3e6..b83ff7d 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -189,7 +189,54 @@ find_matching_se(struct sched_entity **se, struct sched_entity **pse)
}
}

-#else /* !CONFIG_FAIR_GROUP_SCHED */
+#ifdef CONFIG_CFS_HARD_LIMITS
+
+static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
+{
+ return cfs_rq->cfs_throttled;
+}
+
+/*
+ * Check if group entity exceeded its runtime. If so, mark the cfs_rq as
+ * throttled mark the current task for reschedling.
+ */
+static void sched_cfs_runtime_exceeded(struct sched_entity *se,
+ struct task_struct *tsk_curr, unsigned long delta_exec)
+{
+ struct cfs_rq *cfs_rq;
+
+ cfs_rq = group_cfs_rq(se);
+
+ if (cfs_rq->cfs_runtime == RUNTIME_INF)
+ return;
+
+ cfs_rq->cfs_time += delta_exec;
+
+ if (cfs_rq_throttled(cfs_rq))
+ return;
+
+ if (cfs_rq->cfs_time > cfs_rq->cfs_runtime) {
+ cfs_rq->cfs_throttled = 1;
+ resched_task(tsk_curr);
+ }
+}
+
+#else
+
+static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
+{
+ return 0;
+}
+
+static void sched_cfs_runtime_exceeded(struct sched_entity *se,
+ struct task_struct *tsk_curr, unsigned long delta_exec)
+{
+ return;
+}
+
+#endif /* CONFIG_CFS_HARD_LIMITS */
+
+#else /* CONFIG_FAIR_GROUP_SCHED */

static inline struct task_struct *task_of(struct sched_entity *se)
{
@@ -249,6 +296,12 @@ find_matching_se(struct sched_entity **se, struct sched_entity **pse)
{
}

+static void sched_cfs_runtime_exceeded(struct sched_entity *se,
+ struct task_struct *tsk_curr, unsigned long delta_exec)
+{
+ return;
+}
+
#endif /* CONFIG_FAIR_GROUP_SCHED */

@@ -489,10 +542,12 @@ __update_curr(struct cfs_rq *cfs_rq, struct sched_entity *curr,
update_min_vruntime(cfs_rq);
}

-static void update_curr(struct cfs_rq *cfs_rq)
+static void update_curr_common(struct cfs_rq *cfs_rq, int locked)
{
struct sched_entity *curr = cfs_rq->curr;
- u64 now = rq_of(cfs_rq)->clock;
+ struct rq *rq = rq_of(cfs_rq);
+ struct task_struct *tsk_curr = rq->curr;
+ u64 now = rq->clock;
unsigned long delta_exec;

if (unlikely(!curr))
@@ -516,9 +571,26 @@ static void update_curr(struct cfs_rq *cfs_rq)
trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
cpuacct_charge(curtask, delta_exec);
account_group_exec_runtime(curtask, delta_exec);
+ } else {
+ /* TODO: (Un)locking this way is ugly, find an alterntive. */
+ if (!locked)
+ cfs_rq_runtime_lock(group_cfs_rq(curr));
+ sched_cfs_runtime_exceeded(curr, tsk_curr, delta_exec);
+ if (!locked)
+ cfs_rq_runtime_unlock(group_cfs_rq(curr));
}
}

+static inline void update_curr(struct cfs_rq *cfs_rq)
+{
+ update_curr_common(cfs_rq, 0);
+}
+
+static inline void update_curr_locked(struct cfs_rq *cfs_rq)
+{
+ update_curr_common(cfs_rq, 1);
+}
+
static inline void
update_stats_wait_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
@@ -740,13 +812,9 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
se->vruntime = vruntime;
}

-static void
-enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int wakeup)
+static void enqueue_entity_common(struct cfs_rq *cfs_rq,
+ struct sched_entity *se, int wakeup)
{
- /*
- * Update run-time statistics of the 'current'.
- */
- update_curr(cfs_rq);
account_entity_enqueue(cfs_rq, se);

if (wakeup) {
@@ -760,6 +828,26 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int wakeup)
__enqueue_entity(cfs_rq, se);
}

+static void enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
+ int wakeup)
+{
+ /*
+ * Update run-time statistics of the 'current'.
+ */
+ update_curr(cfs_rq);
+ enqueue_entity_common(cfs_rq, se, wakeup);
+}
+
+static void enqueue_entity_locked(struct cfs_rq *cfs_rq,
+ struct sched_entity *se, int wakeup)
+{
+ /*
+ * Update run-time statistics of the 'current'.
+ */
+ update_curr_locked(cfs_rq);
+ enqueue_entity_common(cfs_rq, se, wakeup);
+}
+
static void __clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
if (!se || cfs_rq->last == se)
@@ -880,6 +968,32 @@ static struct sched_entity *pick_next_entity(struct cfs_rq *cfs_rq)
return se;
}

+/*
+ * Called from put_prev_entity()
+ * If a group entity (@se) is found to be throttled, it will not be put back
+ * on @cfs_rq, which is equivalent to dequeing it.
+ */
+static int dequeue_throttled_entity(struct cfs_rq *cfs_rq,
+ struct sched_entity *se)
+{
+ struct cfs_rq *gcfs_rq = group_cfs_rq(se);
+
+ if (entity_is_task(se))
+ return 0;
+
+ cfs_rq_runtime_lock(gcfs_rq);
+ if (!cfs_rq_throttled(gcfs_rq) && gcfs_rq->nr_running) {
+ cfs_rq_runtime_unlock(gcfs_rq);
+ return 0;
+ }
+
+ __clear_buddies(cfs_rq, se);
+ account_entity_dequeue(cfs_rq, se);
+ cfs_rq->curr = NULL;
+ cfs_rq_runtime_unlock(gcfs_rq);
+ return 1;
+}
+
static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
{
/*
@@ -891,6 +1005,8 @@ static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)

check_spread(cfs_rq, prev);
if (prev->on_rq) {
+ if (dequeue_throttled_entity(cfs_rq, prev))
+ return;
update_stats_wait_start(cfs_rq, prev);
/* Put 'current' back into the tree. */
__enqueue_entity(cfs_rq, prev);
@@ -987,10 +1103,28 @@ static inline void hrtick_update(struct rq *rq)
}
#endif

+static int enqueue_group_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
+ int wakeup)
+{
+ struct cfs_rq *gcfs_rq = group_cfs_rq(se);
+ int ret = 0;
+
+ cfs_rq_runtime_lock(gcfs_rq);
+ if (cfs_rq_throttled(gcfs_rq)) {
+ ret = 1;
+ goto out;
+ }
+ enqueue_entity_locked(cfs_rq, se, wakeup);
+out:
+ cfs_rq_runtime_unlock(gcfs_rq);
+ return ret;
+}
+
/*
* The enqueue_task method is called before nr_running is
* increased. Here we update the fair scheduling stats and
* then put the task into the rbtree:
+ * Don't enqueue a throttled entity further into the hierarchy.
*/
static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int wakeup)
{
@@ -1000,11 +1134,15 @@ static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int wakeup)
for_each_sched_entity(se) {
if (se->on_rq)
break;
+
cfs_rq = cfs_rq_of(se);
- enqueue_entity(cfs_rq, se, wakeup);
+ if (entity_is_task(se))
+ enqueue_entity(cfs_rq, se, wakeup);
+ else
+ if (enqueue_group_entity(cfs_rq, se, wakeup))
+ break;
wakeup = 1;
}
-
hrtick_update(rq);
}

@@ -1767,9 +1905,9 @@ load_balance_fair(struct rq *this_rq, int this_cpu, struct rq *busiest,
u64 rem_load, moved_load;

/*
- * empty group
+ * empty group or a group with no h_load (throttled)
*/
- if (!busiest_cfs_rq->task_weight)
+ if (!busiest_cfs_rq->task_weight || !busiest_h_load)
continue;

rem_load = (u64)rem_load_move * busiest_weight;

2009-11-09 09:11:16

by Bharata B Rao

[permalink] [raw]

Subject: [RFC v3 PATCH 4/7] sched: Unthrottle the throttled tasks

sched: Unthrottle the throttled tasks.

From: Bharata B Rao <[email protected]>

Refresh runtimes when group's period expires. Unthrottle any
throttled groups at that time. Refreshing runtimes is driven through
a periodic timer.

Signed-off-by: Bharata B Rao <[email protected]>
---
kernel/sched.c | 3 +++
kernel/sched_fair.c | 45 ++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 47 insertions(+), 1 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 5d2e5e5..8cccda2 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1889,6 +1889,8 @@ static inline u64 global_cfs_runtime(void)
return RUNTIME_INF;
}

+void do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b);
+
static inline void cfs_rq_runtime_lock(struct cfs_rq *cfs_rq)
{
spin_lock(&cfs_rq->cfs_runtime_lock);
@@ -1908,6 +1910,7 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
struct cfs_bandwidth *cfs_b =
container_of(timer, struct cfs_bandwidth, cfs_period_timer);

+ do_sched_cfs_period_timer(cfs_b);
hrtimer_add_expires_ns(timer, ktime_to_ns(cfs_b->cfs_period));
return HRTIMER_RESTART;
}
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index b83ff7d..557d30d 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -221,6 +221,50 @@ static void sched_cfs_runtime_exceeded(struct sched_entity *se,
}
}

+static void enqueue_entity_locked(struct cfs_rq *cfs_rq,
+ struct sched_entity *se, int wakeup);
+
+static void enqueue_throttled_entity(struct rq *rq, struct sched_entity *se)
+{
+ for_each_sched_entity(se) {
+ if (se->on_rq)
+ break;
+ if (cfs_rq_throttled(group_cfs_rq(se)))
+ break;
+ enqueue_entity_locked(cfs_rq_of(se), se, 0);
+ }
+}
+
+/*
+ * Refresh runtimes of all cfs_rqs in this group, i,e.,
+ * refresh runtimes of the representative cfs_rq of this
+ * tg on all cpus. Enqueue any throttled entity back.
+ */
+void do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b)
+{
+ int i;
+ const struct cpumask *span = sched_bw_period_mask();
+ struct task_group *tg = container_of(cfs_b, struct task_group,
+ cfs_bandwidth);
+ unsigned long flags;
+
+ for_each_cpu(i, span) {
+ struct rq *rq = cpu_rq(i);
+ struct cfs_rq *cfs_rq = tg->cfs_rq[i];
+ struct sched_entity *se = tg->se[i];
+
+ spin_lock_irqsave(&rq->lock, flags);
+ cfs_rq_runtime_lock(cfs_rq);
+ cfs_rq->cfs_time = 0;
+ if (cfs_rq_throttled(cfs_rq)) {
+ cfs_rq->cfs_throttled = 0;
+ enqueue_throttled_entity(rq, se);
+ }
+ cfs_rq_runtime_unlock(cfs_rq);
+ spin_unlock_irqrestore(&rq->lock, flags);
+ }
+}
+
#else

static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
@@ -304,7 +348,6 @@ static void sched_cfs_runtime_exceeded(struct sched_entity *se,

#endif /* CONFIG_FAIR_GROUP_SCHED */

-
/**************************************************************
* Scheduling class tree data structure manipulation methods:
*/

2009-11-09 09:11:55

by Bharata B Rao

[permalink] [raw]

Subject: [RFC v3 PATCH 5/7] sched: Add throttle time statistics to /proc/sched_debug

sched: Add throttle time statistics to /proc/sched_debug

From: Bharata B Rao <[email protected]>

With hard limits, provide stats about throttle time, throttle count
and max throttle time for group sched entities in /proc/sched_debug
Throttle stats are collected only for group entities.

Signed-off-by: Bharata B Rao <[email protected]>
---
include/linux/sched.h | 6 ++++++
kernel/sched_debug.c | 17 ++++++++++++++++-
kernel/sched_fair.c | 20 ++++++++++++++++++++
3 files changed, 42 insertions(+), 1 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 75e6e60..b7f238c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1183,6 +1183,12 @@ struct sched_entity {
u64 nr_wakeups_affine_attempts;
u64 nr_wakeups_passive;
u64 nr_wakeups_idle;
+#ifdef CONFIG_CFS_HARD_LIMITS
+ u64 throttle_start;
+ u64 throttle_max;
+ u64 throttle_count;
+ u64 throttle_sum;
+#endif
#endif

#ifdef CONFIG_FAIR_GROUP_SCHED
diff --git a/kernel/sched_debug.c b/kernel/sched_debug.c
index efb8440..a8f24fb 100644
--- a/kernel/sched_debug.c
+++ b/kernel/sched_debug.c
@@ -80,6 +80,11 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu,
PN(se->wait_max);
PN(se->wait_sum);
P(se->wait_count);
+#ifdef CONFIG_CFS_HARD_LIMITS
+ PN(se->throttle_max);
+ PN(se->throttle_sum);
+ P(se->throttle_count);
+#endif
#endif
P(se->load.weight);
#undef PN
@@ -214,6 +219,16 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
#ifdef CONFIG_SMP
SEQ_printf(m, " .%-30s: %lu\n", "shares", cfs_rq->shares);
#endif
+#ifdef CONFIG_CFS_HARD_LIMITS
+ spin_lock_irqsave(&rq->lock, flags);
+ SEQ_printf(m, " .%-30s: %d\n", "cfs_throttled",
+ cfs_rq->cfs_throttled);
+ SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "cfs_time",
+ SPLIT_NS(cfs_rq->cfs_time));
+ SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "cfs_runtime",
+ SPLIT_NS(cfs_rq->cfs_runtime));
+ spin_unlock_irqrestore(&rq->lock, flags);
+#endif /* CONFIG_CFS_HARD_LIMITS */
print_cfs_group_stats(m, cpu, cfs_rq->tg);
#endif
}
@@ -310,7 +325,7 @@ static int sched_debug_show(struct seq_file *m, void *v)
u64 now = ktime_to_ns(ktime_get());
int cpu;

- SEQ_printf(m, "Sched Debug Version: v0.09, %s %.*s\n",
+ SEQ_printf(m, "Sched Debug Version: v0.10, %s %.*s\n",
init_utsname()->release,
(int)strcspn(init_utsname()->version, " "),
init_utsname()->version);
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 557d30d..828d7e7 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -191,6 +191,23 @@ find_matching_se(struct sched_entity **se, struct sched_entity **pse)

#ifdef CONFIG_CFS_HARD_LIMITS

+static inline void update_stats_throttle_start(struct cfs_rq *cfs_rq,
+ struct sched_entity *se)
+{
+ schedstat_set(se->throttle_start, rq_of(cfs_rq)->clock);
+}
+
+static inline void update_stats_throttle_end(struct cfs_rq *cfs_rq,
+ struct sched_entity *se)
+{
+ schedstat_set(se->throttle_max, max(se->throttle_max,
+ rq_of(cfs_rq)->clock - se->throttle_start));
+ schedstat_set(se->throttle_count, se->throttle_count + 1);
+ schedstat_set(se->throttle_sum, se->throttle_sum +
+ rq_of(cfs_rq)->clock - se->throttle_start);
+ schedstat_set(se->throttle_start, 0);
+}
+
static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
{
return cfs_rq->cfs_throttled;
@@ -217,6 +234,7 @@ static void sched_cfs_runtime_exceeded(struct sched_entity *se,

if (cfs_rq->cfs_time > cfs_rq->cfs_runtime) {
cfs_rq->cfs_throttled = 1;
+ update_stats_throttle_start(cfs_rq, se);
resched_task(tsk_curr);
}
}
@@ -257,6 +275,8 @@ void do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b)
cfs_rq_runtime_lock(cfs_rq);
cfs_rq->cfs_time = 0;
if (cfs_rq_throttled(cfs_rq)) {
+ update_rq_clock(rq);
+ update_stats_throttle_end(cfs_rq, se);
cfs_rq->cfs_throttled = 0;
enqueue_throttled_entity(rq, se);
}

2009-11-09 09:12:39

by Bharata B Rao

[permalink] [raw]

Subject: [RFC v3 PATCH 6/7] sched: Rebalance cfs runtimes

sched: CFS runtime borrowing

From: Bharata B Rao <[email protected]>

Before throttling a group, try to borrow runtime from groups that have excess.

To start with, a group will get equal runtime on every cpu. If the group doesn't
have tasks on all cpus, it might get throttled on some cpus while it still has
runtime left on other cpus where it doesn't have any tasks to consume that
runtime. Hence there is a chance to borrow runtimes from such cpus/cfs_rqs to
cpus/cfs_rqs where it is required.
---
kernel/sched_fair.c | 60 +++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 60 insertions(+), 0 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 828d7e7..fc09109 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -214,6 +214,63 @@ static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
}

/*
+ * Ran out of runtime, check if we can borrow some from others
+ * instead of getting throttled right away.
+ */
+static void do_cfs_balance_runtime(struct cfs_rq *cfs_rq)
+{
+ struct cfs_bandwidth *cfs_b = &cfs_rq->tg->cfs_bandwidth;
+ const struct cpumask *span = sched_bw_period_mask();
+ int i, weight;
+ u64 cfs_period;
+ struct task_group *tg = container_of(cfs_b, struct task_group,
+ cfs_bandwidth);
+
+ weight = cpumask_weight(span);
+ spin_lock(&cfs_b->cfs_runtime_lock);
+ cfs_period = ktime_to_ns(cfs_b->cfs_period);
+
+ for_each_cpu(i, span) {
+ struct cfs_rq *borrow_cfs_rq = tg->cfs_rq[i];
+ s64 diff;
+
+ if (borrow_cfs_rq == cfs_rq)
+ continue;
+
+ cfs_rq_runtime_lock(borrow_cfs_rq);
+ if (borrow_cfs_rq->cfs_runtime == RUNTIME_INF) {
+ cfs_rq_runtime_unlock(borrow_cfs_rq);
+ continue;
+ }
+
+ diff = borrow_cfs_rq->cfs_runtime - borrow_cfs_rq->cfs_time;
+ if (diff > 0) {
+ diff = div_u64((u64)diff, weight);
+ if (cfs_rq->cfs_runtime + diff > cfs_period)
+ diff = cfs_period - cfs_rq->cfs_runtime;
+ borrow_cfs_rq->cfs_runtime -= diff;
+ cfs_rq->cfs_runtime += diff;
+ if (cfs_rq->cfs_runtime == cfs_period) {
+ cfs_rq_runtime_unlock(borrow_cfs_rq);
+ break;
+ }
+ }
+ cfs_rq_runtime_unlock(borrow_cfs_rq);
+ }
+ spin_unlock(&cfs_b->cfs_runtime_lock);
+}
+
+/*
+ * Called with rq->runtime_lock held.
+ */
+static void cfs_balance_runtime(struct cfs_rq *cfs_rq)
+{
+ cfs_rq_runtime_unlock(cfs_rq);
+ do_cfs_balance_runtime(cfs_rq);
+ cfs_rq_runtime_lock(cfs_rq);
+}
+
+/*
* Check if group entity exceeded its runtime. If so, mark the cfs_rq as
* throttled mark the current task for reschedling.
*/
@@ -232,6 +289,9 @@ static void sched_cfs_runtime_exceeded(struct sched_entity *se,
if (cfs_rq_throttled(cfs_rq))
return;

+ if (cfs_rq->cfs_time > cfs_rq->cfs_runtime)
+ cfs_balance_runtime(cfs_rq);
+
if (cfs_rq->cfs_time > cfs_rq->cfs_runtime) {
cfs_rq->cfs_throttled = 1;
update_stats_throttle_start(cfs_rq, se);

2009-11-09 09:13:12

by Bharata B Rao

[permalink] [raw]

Subject: [RFC v3 PATCH 7/7] sched: Hard limits documentation

sched: Hard limits documentation

From: Bharata B Rao <[email protected]>

Documentation for hard limits feature.

Signed-off-by: Bharata B Rao <[email protected]>
---
Documentation/scheduler/sched-cfs-hard-limits.txt | 48 +++++++++++++++++++++
1 files changed, 48 insertions(+), 0 deletions(-)
create mode 100644 Documentation/scheduler/sched-cfs-hard-limits.txt

diff --git a/Documentation/scheduler/sched-cfs-hard-limits.txt b/Documentation/scheduler/sched-cfs-hard-limits.txt
new file mode 100644
index 0000000..d6387af
--- /dev/null
+++ b/Documentation/scheduler/sched-cfs-hard-limits.txt
@@ -0,0 +1,48 @@
+CPU HARD LIMITS FOR CFS GROUPS
+==============================
+
+1. Overview
+2. Interface
+3. Examples
+
+1. Overview
+-----------
+
+CFS is a proportional share scheduler which tries to divide the CPU time
+proportionately between tasks or groups of tasks (task group/cgroup) depending
+on the priority/weight of the task or shares assigned to groups of tasks.
+In CFS, a task/task group can get more than its share of CPU if there are
+enough idle CPU cycles available in the system, due to the work conserving
+nature of the scheduler. However in certain scenarios (like pay-per-use),
+it is desirable not to provide extra time to a group even in the presence
+of idle CPU cycles. This is where hard limiting can be of use.
+
+Hard limits for task groups can be set by specifying how much CPU runtime a
+group can consume within a given period. If the group consumes more CPU time
+than the runtime in a given period, it gets throttled. None of the tasks of
+the throttled group gets to run until the runtime of the group gets refreshed
+at the beginning of the next period.
+
+2. Interface
+------------
+
+Hard limit feature adds 2 cgroup files for CFS group scheduler:
+
+cfs_runtime_us: Hard limit for the group in microseconds.
+
+cfs_period_us: Time period in microseconds within which hard limits is
+enforced.
+
+A group gets created with default values for runtime (infinite runtime which
+means hard limits disabled) and period (0.5s). Each group can set its own
+values for runtime and period independent of other groups in the system.
+
+3. Examples
+-----------
+
+# mount -t cgroup -ocpu none /cgroups/
+# cd /cgroups
+# mkdir 1
+# cd 1/
+# echo 250000 > cfs_runtime_us /* set a 250ms runtime or limit */
+# echo 500000 > cfs_period_us /* set a 500ms period */