2009-11-17 14:33:24

by Bharata B Rao

[permalink] [raw]
Subject: [RFC v4 PATCH 0/7] CFS Hard limits - v4

Hi,

Here is the v4 post of hard limits feature for CFS group scheduler. This
version mainly adds cpu hotplug support for CFS runtime balancing.

Changes
-------
RFC v4:
- Reclaim runtimes lent to other cpus when a cpu goes
offline. (Kamalesh Babulal)
- Fixed a few bugs.
- Some cleanups.

RFC v3:
- http://lkml.org/lkml/2009/11/9/65
- Till v2, I was updating rq->nr_running when tasks go and come back on
runqueue during throttling and unthrottling. Don't do this.
- With the above change, quite a bit of code simplification is achieved.
Runtime related fields of cfs_rq are now being protected by per cfs_rq
lock instead of per rq lock. With this it looks more similar to rt.
- Remove the control file cpu.cfs_hard_limit which enabled/disabled hard limits
for groups. Now hard limits is enabled by having a non-zero runtime.
- Don't explicitly prevent movement of tasks into throttled groups during
load balancing as throttled entities are anyway prevented from being
enqueued in enqueue_task_fair().
- Moved to 2.6.32-rc6

RFC v2:
- http://lkml.org/lkml/2009/9/30/115
- Upgraded to 2.6.31.
- Added CFS runtime borrowing.
- New locking scheme
The hard limit specific fields of cfs_rq (cfs_runtime, cfs_time and
cfs_throttled) were being protected by rq->lock. This simple scheme will
not work when runtime rebalancing is introduced where it will be required
to look at these fields on other CPU's which requires us to acquire
rq->lock of other CPUs. This will not be feasible from update_curr().
Hence introduce a separate lock (rq->runtime_lock) to protect these
fields of all cfs_rq under it.
- Handle the task wakeup in a throttled group correctly.
- Make CFS_HARD_LIMITS dependent on CGROUP_SCHED (Thanks to Andrea Righi)

RFC v1:
- First version of the patches with minimal features was posted at
http://lkml.org/lkml/2009/8/25/128

RFC v0:
- The CFS hard limits proposal was first posted at
http://lkml.org/lkml/2009/6/4/24

Testing and Benchmark numbers
-----------------------------
Some numbers from simple benchmarks to sanity-check that hard limits
patches are not causing any major regressions.

- hackbench (hackbench -pipe N)
(hackbench was run as part of a group under root group)
-----------------------------------------------------------------------
Time
-----------------------------------------------------------------
N CFS_HARD_LIMTS=n CFS_HARD_LIMTS=y CFS_HARD_LIMITS=y
(infinite runtime) (BW=450000/500000)
-----------------------------------------------------------------------
10 0.574 0.614 0.674
20 1.086 1.154 1.232
50 2.689 2.487 2.714
100 4.897 4.771 5.439
-----------------------------------------------------------------------
- BW = Bandwidth = runtime/period
- Infinite runtime means no hard limiting

- lmbench (lat_ctx -N 5 -s <size_in_kb> N)

(i) size_in_kb = 1024
-----------------------------------------------------------------------
Context switch time (us)
-----------------------------------------------------------------
N CFS_HARD_LIMTS=n CFS_HARD_LIMTS=y CFS_HARD_LIMITS=y
(infinite runtime) (BW=450000/500000)
-----------------------------------------------------------------------
10 237.14 248.83 69.71
100 251.97 234.74 254.73
500 248.39 252.73 252.66
-----------------------------------------------------------------------

(ii) size_in_kb = 2048
-----------------------------------------------------------------------
Context switch time (us)
-----------------------------------------------------------------
N CFS_HARD_LIMTS=n CFS_HARD_LIMTS=y CFS_HARD_LIMITS=y
(infinite runtime) (BW=450000/500000)
-----------------------------------------------------------------------
10 541.39 538.68 419.03
100 504.52 504.22 491.20
500 495.26 494.11 497.12
-----------------------------------------------------------------------

- kernbench

Average Optimal load -j 96 Run (std deviation):
------------------------------------------------------------------------------
CFS_HARD_LIMTS=n CFS_HARD_LIMTS=y CFS_HARD_LIMITS=y
(infinite runtime) (BW=450000/500000)
------------------------------------------------------------------------------
Elapsd 234.965 (10.1328) 235.93 (8.0893) 270.74 (5.11945)
User 796.605 (62.1617) 787.105 (80.3486) 880.54 (9.33381)
System 802.715 (7.62968) 838.565 (14.5593) 868.23 (10.8894)
% CPU 680 (0) 688.5 (16.2635) 645.5 (4.94975)
CtxSwt 535452 (23273.7) 536321 (27946.3) 567430 (9579.88)
Sleeps 614784 (19538.8) 610256 (17570.2) 626286 (2390.73)
------------------------------------------------------------------------------

Patches description
-------------------
This post has the following patches:

1/7 sched: Rename sched_rt_period_mask() and use it in CFS also
2/7 sched: Bandwidth initialization for fair task groups
3/7 sched: Enforce hard limits by throttling
4/7 sched: Unthrottle the throttled tasks
5/7 sched: Add throttle time statistics to /proc/sched_debug
6/7 sched: CFS runtime borrowing
7/7 sched: Hard limits documentation

Documentation/scheduler/sched-cfs-hard-limits.txt | 48 ++
include/linux/sched.h | 6
init/Kconfig | 13
kernel/sched.c | 339 ++++++++++++++
kernel/sched_debug.c | 17
kernel/sched_fair.c | 464 +++++++++++++++++++-
kernel/sched_rt.c | 45 -
7 files changed, 869 insertions(+), 63 deletions(-)

Regards,
Bharata.


2009-11-17 14:34:16

by Bharata B Rao

[permalink] [raw]
Subject: [RFC v4 PATCH 1/7] sched: Rename sched_rt_period_mask() and use it in CFS also

sched: Rename sched_rt_period_mask() and use it in CFS also.

From: Bharata B Rao <[email protected]>

sched_rt_period_mask() is needed in CFS also. Rename it to a generic name
and move it to kernel/sched.c. No functionality change in this patch.

Signed-off-by: Bharata B Rao <[email protected]>
---
kernel/sched.c | 23 +++++++++++++++++++++++
kernel/sched_rt.c | 19 +------------------
2 files changed, 24 insertions(+), 18 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index a455dca..1309e8d 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1810,6 +1810,29 @@ static void cfs_rq_set_shares(struct cfs_rq *cfs_rq, unsigned long shares)

static void calc_load_account_active(struct rq *this_rq);

+
+#if defined(CONFIG_RT_GROUP_SCHED) || defined(CONFIG_FAIR_GROUP_SCHED)
+
+#ifdef CONFIG_SMP
+static inline const struct cpumask *sched_bw_period_mask(void)
+{
+ return cpu_rq(smp_processor_id())->rd->span;
+}
+#else /* !CONFIG_SMP */
+static inline const struct cpumask *sched_bw_period_mask(void)
+{
+ return cpu_online_mask;
+}
+#endif /* CONFIG_SMP */
+
+#else
+static inline const struct cpumask *sched_bw_period_mask(void)
+{
+ return cpu_online_mask;
+}
+
+#endif
+
#include "sched_stats.h"
#include "sched_idletask.c"
#include "sched_fair.c"
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index a4d790c..97067e1 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -235,18 +235,6 @@ static int rt_se_boosted(struct sched_rt_entity *rt_se)
return p->prio != p->normal_prio;
}

-#ifdef CONFIG_SMP
-static inline const struct cpumask *sched_rt_period_mask(void)
-{
- return cpu_rq(smp_processor_id())->rd->span;
-}
-#else
-static inline const struct cpumask *sched_rt_period_mask(void)
-{
- return cpu_online_mask;
-}
-#endif
-
static inline
struct rt_rq *sched_rt_period_rt_rq(struct rt_bandwidth *rt_b, int cpu)
{
@@ -296,11 +284,6 @@ static inline int rt_rq_throttled(struct rt_rq *rt_rq)
return rt_rq->rt_throttled;
}

-static inline const struct cpumask *sched_rt_period_mask(void)
-{
- return cpu_online_mask;
-}
-
static inline
struct rt_rq *sched_rt_period_rt_rq(struct rt_bandwidth *rt_b, int cpu)
{
@@ -518,7 +501,7 @@ static int do_sched_rt_period_timer(struct rt_bandwidth *rt_b, int overrun)
if (!rt_bandwidth_enabled() || rt_b->rt_runtime == RUNTIME_INF)
return 1;

- span = sched_rt_period_mask();
+ span = sched_bw_period_mask();
for_each_cpu(i, span) {
int enqueue = 0;
struct rt_rq *rt_rq = sched_rt_period_rt_rq(rt_b, i);

2009-11-17 14:34:45

by Bharata B Rao

[permalink] [raw]
Subject: [RFC v4 PATCH 2/7] sched: Bandwidth initialization for fair task groups

sched: Bandwidth initialization for fair task groups.

From: Bharata B Rao <[email protected]>

Introduce the notion of hard limiting for CFS groups by bringing in
the concept of runtime and period for them. Add cgroup files to control
runtime and period.

Signed-off-by: Bharata B Rao <[email protected]>
---
init/Kconfig | 13 +++
kernel/sched.c | 277 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 290 insertions(+), 0 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index f515864..fea8cbe 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -477,6 +477,19 @@ config CGROUP_SCHED

endchoice

+config CFS_HARD_LIMITS
+ bool "Hard Limits for CFS Group Scheduler"
+ depends on EXPERIMENTAL
+ depends on FAIR_GROUP_SCHED && CGROUP_SCHED
+ default n
+ help
+ This option enables hard limiting of CPU time obtained by
+ a fair task group. Use this if you want to throttle a group of tasks
+ based on its CPU usage. For more details refer to
+ Documentation/scheduler/sched-cfs-hard-limits.txt
+
+ Say N if unsure.
+
menuconfig CGROUPS
boolean "Control Group support"
help
diff --git a/kernel/sched.c b/kernel/sched.c
index 1309e8d..1d46fdc 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -237,6 +237,15 @@ static DEFINE_MUTEX(sched_domains_mutex);

#include <linux/cgroup.h>

+#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_CFS_HARD_LIMITS)
+struct cfs_bandwidth {
+ spinlock_t cfs_runtime_lock;
+ ktime_t cfs_period;
+ u64 cfs_runtime;
+ struct hrtimer cfs_period_timer;
+};
+#endif
+
struct cfs_rq;

static LIST_HEAD(task_groups);
@@ -257,6 +266,9 @@ struct task_group {
/* runqueue "owned" by this group on each cpu */
struct cfs_rq **cfs_rq;
unsigned long shares;
+#ifdef CONFIG_CFS_HARD_LIMITS
+ struct cfs_bandwidth cfs_bandwidth;
+#endif
#endif

#ifdef CONFIG_RT_GROUP_SCHED
@@ -445,6 +457,19 @@ struct cfs_rq {
unsigned long rq_weight;
#endif
#endif
+#ifdef CONFIG_CFS_HARD_LIMITS
+ /* set when the group is throttled on this cpu */
+ int cfs_throttled;
+
+ /* runtime currently consumed by the group on this rq */
+ u64 cfs_time;
+
+ /* runtime available to the group on this rq */
+ u64 cfs_runtime;
+
+ /* Protects the cfs runtime related fields of this cfs_rq */
+ spinlock_t cfs_runtime_lock;
+#endif
};

/* Real-Time classes' related field in a runqueue: */
@@ -1833,6 +1858,144 @@ static inline const struct cpumask *sched_bw_period_mask(void)

#endif

+#ifdef CONFIG_FAIR_GROUP_SCHED
+#ifdef CONFIG_CFS_HARD_LIMITS
+
+/*
+ * Runtime allowed for a cfs group before it is hard limited.
+ * default: Infinite which means no hard limiting.
+ */
+u64 sched_cfs_runtime = RUNTIME_INF;
+
+/*
+ * period over which we hard limit the cfs group's bandwidth.
+ * default: 0.5s
+ */
+u64 sched_cfs_period = 500000;
+
+static inline u64 global_cfs_period(void)
+{
+ return sched_cfs_period * NSEC_PER_USEC;
+}
+
+static inline u64 global_cfs_runtime(void)
+{
+ return RUNTIME_INF;
+}
+
+static inline void cfs_rq_runtime_lock(struct cfs_rq *cfs_rq)
+{
+ spin_lock(&cfs_rq->cfs_runtime_lock);
+}
+
+static inline void cfs_rq_runtime_unlock(struct cfs_rq *cfs_rq)
+{
+ spin_unlock(&cfs_rq->cfs_runtime_lock);
+}
+
+/*
+ * Refresh the runtimes of the throttled groups.
+ * But nothing much to do now, will populate this in later patches.
+ */
+static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
+{
+ struct cfs_bandwidth *cfs_b =
+ container_of(timer, struct cfs_bandwidth, cfs_period_timer);
+
+ hrtimer_add_expires_ns(timer, ktime_to_ns(cfs_b->cfs_period));
+ return HRTIMER_RESTART;
+}
+
+/*
+ * TODO: Check if this kind of timer setup is sufficient for cfs or
+ * should we do what rt is doing.
+ */
+static void start_cfs_bandwidth(struct task_group *tg)
+{
+ struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
+
+ /*
+ * Timer isn't setup for groups with infinite runtime
+ */
+ if (cfs_b->cfs_runtime == RUNTIME_INF)
+ return;
+
+ if (hrtimer_active(&cfs_b->cfs_period_timer))
+ return;
+
+ hrtimer_start_range_ns(&cfs_b->cfs_period_timer, cfs_b->cfs_period,
+ 0, HRTIMER_MODE_REL);
+}
+
+static void init_cfs_bandwidth(struct task_group *tg)
+{
+ struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
+
+ cfs_b->cfs_period = ns_to_ktime(global_cfs_period());
+ cfs_b->cfs_runtime = global_cfs_runtime();
+
+ spin_lock_init(&cfs_b->cfs_runtime_lock);
+
+ hrtimer_init(&cfs_b->cfs_period_timer,
+ CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+ cfs_b->cfs_period_timer.function = &sched_cfs_period_timer;
+}
+
+static inline void destroy_cfs_bandwidth(struct task_group *tg)
+{
+ hrtimer_cancel(&tg->cfs_bandwidth.cfs_period_timer);
+}
+
+static void init_cfs_hard_limits(struct cfs_rq *cfs_rq, struct task_group *tg)
+{
+ cfs_rq->cfs_time = 0;
+ cfs_rq->cfs_throttled = 0;
+ cfs_rq->cfs_runtime = tg->cfs_bandwidth.cfs_runtime;
+ spin_lock_init(&cfs_rq->cfs_runtime_lock);
+}
+
+#else /* !CONFIG_CFS_HARD_LIMITS */
+
+static void init_cfs_bandwidth(struct task_group *tg)
+{
+ return;
+}
+
+static inline void destroy_cfs_bandwidth(struct task_group *tg)
+{
+ return;
+}
+
+static void init_cfs_hard_limits(struct cfs_rq *cfs_rq, struct task_group *tg)
+{
+ return;
+}
+
+static inline void cfs_rq_runtime_lock(struct cfs_rq *cfs_rq)
+{
+ return;
+}
+
+static inline void cfs_rq_runtime_unlock(struct cfs_rq *cfs_rq)
+{
+ return;
+}
+
+#endif /* CONFIG_CFS_HARD_LIMITS */
+#else /* !CONFIG_FAIR_GROUP_SCHED */
+
+static inline void cfs_rq_runtime_lock(struct cfs_rq *cfs_rq)
+{
+ return;
+}
+
+static inline void cfs_rq_runtime_unlock(struct cfs_rq *cfs_rq)
+{
+ return;
+}
+
+#endif /* CONFIG_FAIR_GROUP_SCHED */
+
#include "sched_stats.h"
#include "sched_idletask.c"
#include "sched_fair.c"
@@ -9286,6 +9449,7 @@ static void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
struct rq *rq = cpu_rq(cpu);
tg->cfs_rq[cpu] = cfs_rq;
init_cfs_rq(cfs_rq, rq);
+ init_cfs_hard_limits(cfs_rq, tg);
cfs_rq->tg = tg;
if (add)
list_add(&cfs_rq->leaf_cfs_rq_list, &rq->leaf_cfs_rq_list);
@@ -9415,6 +9579,10 @@ void __init sched_init(void)
#endif /* CONFIG_USER_SCHED */
#endif /* CONFIG_RT_GROUP_SCHED */

+#ifdef CONFIG_FAIR_GROUP_SCHED
+ init_cfs_bandwidth(&init_task_group);
+#endif
+
#ifdef CONFIG_GROUP_SCHED
list_add(&init_task_group.list, &task_groups);
INIT_LIST_HEAD(&init_task_group.children);
@@ -9441,6 +9609,7 @@ void __init sched_init(void)
init_cfs_rq(&rq->cfs, rq);
init_rt_rq(&rq->rt, rq);
#ifdef CONFIG_FAIR_GROUP_SCHED
+ init_cfs_hard_limits(&rq->cfs, &init_task_group);
init_task_group.shares = init_task_group_load;
INIT_LIST_HEAD(&rq->leaf_cfs_rq_list);
#ifdef CONFIG_CGROUP_SCHED
@@ -9716,6 +9885,7 @@ static void free_fair_sched_group(struct task_group *tg)
{
int i;

+ destroy_cfs_bandwidth(tg);
for_each_possible_cpu(i) {
if (tg->cfs_rq)
kfree(tg->cfs_rq[i]);
@@ -9742,6 +9912,7 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
if (!tg->se)
goto err;

+ init_cfs_bandwidth(tg);
tg->shares = NICE_0_LOAD;

for_each_possible_cpu(i) {
@@ -10465,6 +10636,100 @@ static u64 cpu_shares_read_u64(struct cgroup *cgrp, struct cftype *cft)

return (u64) tg->shares;
}
+
+#ifdef CONFIG_CFS_HARD_LIMITS
+
+static int tg_set_cfs_bandwidth(struct task_group *tg,
+ u64 cfs_period, u64 cfs_runtime)
+{
+ int i;
+
+ spin_lock_irq(&tg->cfs_bandwidth.cfs_runtime_lock);
+ tg->cfs_bandwidth.cfs_period = ns_to_ktime(cfs_period);
+ tg->cfs_bandwidth.cfs_runtime = cfs_runtime;
+
+ for_each_possible_cpu(i) {
+ struct cfs_rq *cfs_rq = tg->cfs_rq[i];
+
+ cfs_rq_runtime_lock(cfs_rq);
+ cfs_rq->cfs_runtime = cfs_runtime;
+ cfs_rq_runtime_unlock(cfs_rq);
+ }
+
+ start_cfs_bandwidth(tg);
+ spin_unlock_irq(&tg->cfs_bandwidth.cfs_runtime_lock);
+ return 0;
+}
+
+int tg_set_cfs_runtime(struct task_group *tg, long cfs_runtime_us)
+{
+ u64 cfs_runtime, cfs_period;
+
+ cfs_period = ktime_to_ns(tg->cfs_bandwidth.cfs_period);
+ cfs_runtime = (u64)cfs_runtime_us * NSEC_PER_USEC;
+ if (cfs_runtime_us < 0)
+ cfs_runtime = RUNTIME_INF;
+
+ return tg_set_cfs_bandwidth(tg, cfs_period, cfs_runtime);
+}
+
+long tg_get_cfs_runtime(struct task_group *tg)
+{
+ u64 cfs_runtime_us;
+
+ if (tg->cfs_bandwidth.cfs_runtime == RUNTIME_INF)
+ return -1;
+
+ cfs_runtime_us = tg->cfs_bandwidth.cfs_runtime;
+ do_div(cfs_runtime_us, NSEC_PER_USEC);
+ return cfs_runtime_us;
+}
+
+int tg_set_cfs_period(struct task_group *tg, long cfs_period_us)
+{
+ u64 cfs_runtime, cfs_period;
+
+ cfs_period = (u64)cfs_period_us * NSEC_PER_USEC;
+ cfs_runtime = tg->cfs_bandwidth.cfs_runtime;
+
+ if (cfs_period == 0)
+ return -EINVAL;
+
+ return tg_set_cfs_bandwidth(tg, cfs_period, cfs_runtime);
+}
+
+long tg_get_cfs_period(struct task_group *tg)
+{
+ u64 cfs_period_us;
+
+ cfs_period_us = ktime_to_ns(tg->cfs_bandwidth.cfs_period);
+ do_div(cfs_period_us, NSEC_PER_USEC);
+ return cfs_period_us;
+}
+
+static s64 cpu_cfs_runtime_read_s64(struct cgroup *cgrp, struct cftype *cft)
+{
+ return tg_get_cfs_runtime(cgroup_tg(cgrp));
+}
+
+static int cpu_cfs_runtime_write_s64(struct cgroup *cgrp, struct cftype *cftype,
+ s64 cfs_runtime_us)
+{
+ return tg_set_cfs_runtime(cgroup_tg(cgrp), cfs_runtime_us);
+}
+
+static u64 cpu_cfs_period_read_u64(struct cgroup *cgrp, struct cftype *cft)
+{
+ return tg_get_cfs_period(cgroup_tg(cgrp));
+}
+
+static int cpu_cfs_period_write_u64(struct cgroup *cgrp, struct cftype *cftype,
+ u64 cfs_period_us)
+{
+ return tg_set_cfs_period(cgroup_tg(cgrp), cfs_period_us);
+}
+
+#endif /* CONFIG_CFS_HARD_LIMITS */
#endif /* CONFIG_FAIR_GROUP_SCHED */

#ifdef CONFIG_RT_GROUP_SCHED
@@ -10498,6 +10763,18 @@ static struct cftype cpu_files[] = {
.read_u64 = cpu_shares_read_u64,
.write_u64 = cpu_shares_write_u64,
},
+#ifdef CONFIG_CFS_HARD_LIMITS
+ {
+ .name = "cfs_runtime_us",
+ .read_s64 = cpu_cfs_runtime_read_s64,
+ .write_s64 = cpu_cfs_runtime_write_s64,
+ },
+ {
+ .name = "cfs_period_us",
+ .read_u64 = cpu_cfs_period_read_u64,
+ .write_u64 = cpu_cfs_period_write_u64,
+ },
+#endif /* CONFIG_CFS_HARD_LIMITS */
#endif
#ifdef CONFIG_RT_GROUP_SCHED
{

2009-11-17 14:35:41

by Bharata B Rao

[permalink] [raw]
Subject: [RFC v4 PATCH 3/7] sched: Enforce hard limits by throttling

sched: Enforce hard limits by throttling.

From: Bharata B Rao <[email protected]>

Throttle the task-groups which exceed the runtime allocated to them.
Throttled group entities are removed from the run queue.

Signed-off-by: Bharata B Rao <[email protected]>
---
kernel/sched.c | 10 ++
kernel/sched_fair.c | 221 ++++++++++++++++++++++++++++++++++++++++++++++-----
2 files changed, 210 insertions(+), 21 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 1d46fdc..19069d3 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1631,6 +1631,7 @@ static void update_group_shares_cpu(struct task_group *tg, int cpu,
}
}

+static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq);
/*
* Re-compute the task group their per cpu shares over the given domain.
* This needs to be done in a bottom-up fashion because the rq weight of a
@@ -1658,8 +1659,10 @@ static int tg_shares_up(struct task_group *tg, void *data)
* If there are currently no tasks on the cpu pretend there
* is one of average load so that when a new task gets to
* run here it will not get delayed by group starvation.
+ * Also if the group is throttled on this cpu, pretend that
+ * it has no tasks.
*/
- if (!weight)
+ if (!weight || cfs_rq_throttled(tg->cfs_rq[i]))
weight = NICE_0_LOAD;

rq_weight += weight;
@@ -1994,6 +1997,11 @@ static inline void cfs_rq_runtime_unlock(struct cfs_rq *cfs_rq)
return;
}

+static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
+{
+ return 0;
+}
+
#endif /* CONFIG_FAIR_GROUP_SCHED */

#include "sched_stats.h"
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index c32c3e6..ea7468c 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -189,7 +189,66 @@ find_matching_se(struct sched_entity **se, struct sched_entity **pse)
}
}

-#else /* !CONFIG_FAIR_GROUP_SCHED */
+#ifdef CONFIG_CFS_HARD_LIMITS
+
+static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
+{
+ return cfs_rq->cfs_throttled;
+}
+
+/*
+ * Check if group entity exceeded its runtime. If so, mark the cfs_rq as
+ * throttled mark the current task for reschedling.
+ */
+static void sched_cfs_runtime_exceeded(struct sched_entity *se,
+ struct task_struct *tsk_curr, unsigned long delta_exec)
+{
+ struct cfs_rq *cfs_rq;
+
+ cfs_rq = group_cfs_rq(se);
+
+ if (cfs_rq->cfs_runtime == RUNTIME_INF)
+ return;
+
+ cfs_rq->cfs_time += delta_exec;
+
+ if (cfs_rq_throttled(cfs_rq))
+ return;
+
+ if (cfs_rq->cfs_time > cfs_rq->cfs_runtime) {
+ cfs_rq->cfs_throttled = 1;
+ resched_task(tsk_curr);
+ }
+}
+
+static inline void update_curr_group(struct sched_entity *curr,
+ unsigned long delta_exec, struct task_struct *tsk_curr)
+{
+ sched_cfs_runtime_exceeded(curr, tsk_curr, delta_exec);
+}
+
+#else
+
+static inline void update_curr_group(struct sched_entity *curr,
+ unsigned long delta_exec, struct task_struct *tsk_curr)
+{
+ return;
+}
+
+static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
+{
+ return 0;
+}
+
+#endif /* CONFIG_CFS_HARD_LIMITS */
+
+#else /* CONFIG_FAIR_GROUP_SCHED */
+
+static inline void update_curr_group(struct sched_entity *curr,
+ unsigned long delta_exec, struct task_struct *tsk_curr)
+{
+ return;
+}

static inline struct task_struct *task_of(struct sched_entity *se)
{
@@ -489,14 +548,25 @@ __update_curr(struct cfs_rq *cfs_rq, struct sched_entity *curr,
update_min_vruntime(cfs_rq);
}

-static void update_curr(struct cfs_rq *cfs_rq)
+static void update_curr_task(struct sched_entity *curr,
+ unsigned long delta_exec)
+{
+ struct task_struct *curtask = task_of(curr);
+
+ trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
+ cpuacct_charge(curtask, delta_exec);
+ account_group_exec_runtime(curtask, delta_exec);
+}
+
+static int update_curr_common(struct cfs_rq *cfs_rq, unsigned long *delta)
{
struct sched_entity *curr = cfs_rq->curr;
- u64 now = rq_of(cfs_rq)->clock;
+ struct rq *rq = rq_of(cfs_rq);
+ u64 now = rq->clock;
unsigned long delta_exec;

if (unlikely(!curr))
- return;
+ return 1;

/*
* Get the amount of time the current task was running
@@ -505,20 +575,47 @@ static void update_curr(struct cfs_rq *cfs_rq)
*/
delta_exec = (unsigned long)(now - curr->exec_start);
if (!delta_exec)
- return;
+ return 1;

__update_curr(cfs_rq, curr, delta_exec);
curr->exec_start = now;
+ *delta = delta_exec;
+ return 0;
+}

- if (entity_is_task(curr)) {
- struct task_struct *curtask = task_of(curr);
+static void update_curr(struct cfs_rq *cfs_rq)
+{
+ struct sched_entity *curr = cfs_rq->curr;
+ struct rq *rq = rq_of(cfs_rq);
+ unsigned long delta_exec;

- trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
- cpuacct_charge(curtask, delta_exec);
- account_group_exec_runtime(curtask, delta_exec);
+ if (update_curr_common(cfs_rq, &delta_exec))
+ return ;
+
+ if (entity_is_task(curr))
+ update_curr_task(curr, delta_exec);
+ else {
+ cfs_rq_runtime_lock(group_cfs_rq(curr));
+ update_curr_group(curr, delta_exec, rq->curr);
+ cfs_rq_runtime_unlock(group_cfs_rq(curr));
}
}

+static void update_curr_locked(struct cfs_rq *cfs_rq)
+{
+ struct sched_entity *curr = cfs_rq->curr;
+ struct rq *rq = rq_of(cfs_rq);
+ unsigned long delta_exec;
+
+ if (update_curr_common(cfs_rq, &delta_exec))
+ return ;
+
+ if (entity_is_task(curr))
+ update_curr_task(curr, delta_exec);
+ else
+ update_curr_group(curr, delta_exec, rq->curr);
+}
+
static inline void
update_stats_wait_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
@@ -740,13 +837,9 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
se->vruntime = vruntime;
}

-static void
-enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int wakeup)
+static void enqueue_entity_common(struct cfs_rq *cfs_rq,
+ struct sched_entity *se, int wakeup)
{
- /*
- * Update run-time statistics of the 'current'.
- */
- update_curr(cfs_rq);
account_entity_enqueue(cfs_rq, se);

if (wakeup) {
@@ -760,6 +853,26 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int wakeup)
__enqueue_entity(cfs_rq, se);
}

+static void enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
+ int wakeup)
+{
+ /*
+ * Update run-time statistics of the 'current'.
+ */
+ update_curr(cfs_rq);
+ enqueue_entity_common(cfs_rq, se, wakeup);
+}
+
+static void enqueue_entity_locked(struct cfs_rq *cfs_rq,
+ struct sched_entity *se, int wakeup)
+{
+ /*
+ * Update run-time statistics of the 'current'.
+ */
+ update_curr_locked(cfs_rq);
+ enqueue_entity_common(cfs_rq, se, wakeup);
+}
+
static void __clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
if (!se || cfs_rq->last == se)
@@ -880,6 +993,32 @@ static struct sched_entity *pick_next_entity(struct cfs_rq *cfs_rq)
return se;
}

+/*
+ * Called from put_prev_entity()
+ * If a group entity (@se) is found to be throttled, it will not be put back
+ * on @cfs_rq, which is equivalent to dequeing it.
+ */
+static int dequeue_throttled_entity(struct cfs_rq *cfs_rq,
+ struct sched_entity *se)
+{
+ struct cfs_rq *gcfs_rq = group_cfs_rq(se);
+
+ if (entity_is_task(se))
+ return 0;
+
+ cfs_rq_runtime_lock(gcfs_rq);
+ if (!cfs_rq_throttled(gcfs_rq) && gcfs_rq->nr_running) {
+ cfs_rq_runtime_unlock(gcfs_rq);
+ return 0;
+ }
+
+ __clear_buddies(cfs_rq, se);
+ account_entity_dequeue(cfs_rq, se);
+ cfs_rq->curr = NULL;
+ cfs_rq_runtime_unlock(gcfs_rq);
+ return 1;
+}
+
static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
{
/*
@@ -891,6 +1030,8 @@ static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)

check_spread(cfs_rq, prev);
if (prev->on_rq) {
+ if (dequeue_throttled_entity(cfs_rq, prev))
+ return;
update_stats_wait_start(cfs_rq, prev);
/* Put 'current' back into the tree. */
__enqueue_entity(cfs_rq, prev);
@@ -987,10 +1128,28 @@ static inline void hrtick_update(struct rq *rq)
}
#endif

+static int enqueue_group_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
+ int wakeup)
+{
+ struct cfs_rq *gcfs_rq = group_cfs_rq(se);
+ int ret = 0;
+
+ cfs_rq_runtime_lock(gcfs_rq);
+ if (cfs_rq_throttled(gcfs_rq)) {
+ ret = 1;
+ goto out;
+ }
+ enqueue_entity_locked(cfs_rq, se, wakeup);
+out:
+ cfs_rq_runtime_unlock(gcfs_rq);
+ return ret;
+}
+
/*
* The enqueue_task method is called before nr_running is
* increased. Here we update the fair scheduling stats and
* then put the task into the rbtree:
+ * Don't enqueue a throttled entity further into the hierarchy.
*/
static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int wakeup)
{
@@ -1000,11 +1159,15 @@ static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int wakeup)
for_each_sched_entity(se) {
if (se->on_rq)
break;
+
cfs_rq = cfs_rq_of(se);
- enqueue_entity(cfs_rq, se, wakeup);
+ if (entity_is_task(se))
+ enqueue_entity(cfs_rq, se, wakeup);
+ else
+ if (enqueue_group_entity(cfs_rq, se, wakeup))
+ break;
wakeup = 1;
}
-
hrtick_update(rq);
}

@@ -1024,6 +1187,17 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int sleep)
/* Don't dequeue parent if it has other entities besides us */
if (cfs_rq->load.weight)
break;
+
+ /*
+ * If this cfs_rq is throttled, then it is already
+ * dequeued.
+ */
+ cfs_rq_runtime_lock(cfs_rq);
+ if (cfs_rq_throttled(cfs_rq)) {
+ cfs_rq_runtime_unlock(cfs_rq);
+ break;
+ }
+ cfs_rq_runtime_unlock(cfs_rq);
sleep = 1;
}

@@ -1767,9 +1941,10 @@ load_balance_fair(struct rq *this_rq, int this_cpu, struct rq *busiest,
u64 rem_load, moved_load;

/*
- * empty group
+ * empty group or throttled group
*/
- if (!busiest_cfs_rq->task_weight)
+ if (!busiest_cfs_rq->task_weight ||
+ cfs_rq_throttled(busiest_cfs_rq))
continue;

rem_load = (u64)rem_load_move * busiest_weight;
@@ -1818,6 +1993,12 @@ move_one_task_fair(struct rq *this_rq, int this_cpu, struct rq *busiest,

for_each_leaf_cfs_rq(busiest, busy_cfs_rq) {
/*
+ * Don't move task from a throttled cfs_rq
+ */
+ if (cfs_rq_throttled(busy_cfs_rq))
+ continue;
+
+ /*
* pass busy_cfs_rq argument into
* load_balance_[start|next]_fair iterators
*/

2009-11-17 14:36:05

by Bharata B Rao

[permalink] [raw]
Subject: [RFC v4 PATCH 4/7] sched: Unthrottle the throttled tasks

sched: Unthrottle the throttled tasks.

From: Bharata B Rao <[email protected]>

Refresh runtimes when group's period expires. Unthrottle any
throttled groups at that time. Refreshing runtimes is driven through
a periodic timer.

Signed-off-by: Bharata B Rao <[email protected]>
---
kernel/sched.c | 3 +++
kernel/sched_fair.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 53 insertions(+), 1 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 19069d3..dd56c72 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1886,6 +1886,8 @@ static inline u64 global_cfs_runtime(void)
return RUNTIME_INF;
}

+void do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b);
+
static inline void cfs_rq_runtime_lock(struct cfs_rq *cfs_rq)
{
spin_lock(&cfs_rq->cfs_runtime_lock);
@@ -1905,6 +1907,7 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
struct cfs_bandwidth *cfs_b =
container_of(timer, struct cfs_bandwidth, cfs_period_timer);

+ do_sched_cfs_period_timer(cfs_b);
hrtimer_add_expires_ns(timer, ktime_to_ns(cfs_b->cfs_period));
return HRTIMER_RESTART;
}
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index ea7468c..3d0f006 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -191,6 +191,13 @@ find_matching_se(struct sched_entity **se, struct sched_entity **pse)

#ifdef CONFIG_CFS_HARD_LIMITS

+static inline
+struct cfs_rq *sched_cfs_period_cfs_rq(struct cfs_bandwidth *cfs_b, int cpu)
+{
+ return container_of(cfs_b, struct task_group,
+ cfs_bandwidth)->cfs_rq[cpu];
+}
+
static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
{
return cfs_rq->cfs_throttled;
@@ -227,6 +234,49 @@ static inline void update_curr_group(struct sched_entity *curr,
sched_cfs_runtime_exceeded(curr, tsk_curr, delta_exec);
}

+static void enqueue_entity_locked(struct cfs_rq *cfs_rq,
+ struct sched_entity *se, int wakeup);
+
+static void enqueue_throttled_entity(struct rq *rq, struct sched_entity *se)
+{
+ for_each_sched_entity(se) {
+ struct cfs_rq *gcfs_rq = group_cfs_rq(se);
+
+ if (se->on_rq || cfs_rq_throttled(gcfs_rq) ||
+ !gcfs_rq->nr_running)
+ break;
+ enqueue_entity_locked(cfs_rq_of(se), se, 0);
+ }
+}
+
+/*
+ * Refresh runtimes of all cfs_rqs in this group, i,e.,
+ * refresh runtimes of the representative cfs_rq of this
+ * tg on all cpus. Enqueue any throttled entity back.
+ */
+void do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b)
+{
+ int i;
+ const struct cpumask *span = sched_bw_period_mask();
+ unsigned long flags;
+
+ for_each_cpu(i, span) {
+ struct rq *rq = cpu_rq(i);
+ struct cfs_rq *cfs_rq = sched_cfs_period_cfs_rq(cfs_b, i);
+ struct sched_entity *se = cfs_rq->tg->se[i];
+
+ spin_lock_irqsave(&rq->lock, flags);
+ cfs_rq_runtime_lock(cfs_rq);
+ cfs_rq->cfs_time = 0;
+ if (cfs_rq_throttled(cfs_rq)) {
+ cfs_rq->cfs_throttled = 0;
+ enqueue_throttled_entity(rq, se);
+ }
+ cfs_rq_runtime_unlock(cfs_rq);
+ spin_unlock_irqrestore(&rq->lock, flags);
+ }
+}
+
#else

static inline void update_curr_group(struct sched_entity *curr,
@@ -310,7 +360,6 @@ find_matching_se(struct sched_entity **se, struct sched_entity **pse)

#endif /* CONFIG_FAIR_GROUP_SCHED */

-
/**************************************************************
* Scheduling class tree data structure manipulation methods:
*/

2009-11-17 14:36:43

by Bharata B Rao

[permalink] [raw]
Subject: [RFC v4 PATCH 5/7] sched: Add throttle time statistics to /proc/sched_debug

sched: Add throttle time statistics to /proc/sched_debug

From: Bharata B Rao <[email protected]>

With hard limits, provide stats about throttle time, throttle count
and max throttle time for group sched entities in /proc/sched_debug
Throttle stats are collected only for group entities.

Signed-off-by: Bharata B Rao <[email protected]>
---
include/linux/sched.h | 6 ++++++
kernel/sched_debug.c | 17 ++++++++++++++++-
kernel/sched_fair.c | 20 ++++++++++++++++++++
3 files changed, 42 insertions(+), 1 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 75e6e60..b7f238c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1183,6 +1183,12 @@ struct sched_entity {
u64 nr_wakeups_affine_attempts;
u64 nr_wakeups_passive;
u64 nr_wakeups_idle;
+#ifdef CONFIG_CFS_HARD_LIMITS
+ u64 throttle_start;
+ u64 throttle_max;
+ u64 throttle_count;
+ u64 throttle_sum;
+#endif
#endif

#ifdef CONFIG_FAIR_GROUP_SCHED
diff --git a/kernel/sched_debug.c b/kernel/sched_debug.c
index efb8440..a8f24fb 100644
--- a/kernel/sched_debug.c
+++ b/kernel/sched_debug.c
@@ -80,6 +80,11 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu,
PN(se->wait_max);
PN(se->wait_sum);
P(se->wait_count);
+#ifdef CONFIG_CFS_HARD_LIMITS
+ PN(se->throttle_max);
+ PN(se->throttle_sum);
+ P(se->throttle_count);
+#endif
#endif
P(se->load.weight);
#undef PN
@@ -214,6 +219,16 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
#ifdef CONFIG_SMP
SEQ_printf(m, " .%-30s: %lu\n", "shares", cfs_rq->shares);
#endif
+#ifdef CONFIG_CFS_HARD_LIMITS
+ spin_lock_irqsave(&rq->lock, flags);
+ SEQ_printf(m, " .%-30s: %d\n", "cfs_throttled",
+ cfs_rq->cfs_throttled);
+ SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "cfs_time",
+ SPLIT_NS(cfs_rq->cfs_time));
+ SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "cfs_runtime",
+ SPLIT_NS(cfs_rq->cfs_runtime));
+ spin_unlock_irqrestore(&rq->lock, flags);
+#endif /* CONFIG_CFS_HARD_LIMITS */
print_cfs_group_stats(m, cpu, cfs_rq->tg);
#endif
}
@@ -310,7 +325,7 @@ static int sched_debug_show(struct seq_file *m, void *v)
u64 now = ktime_to_ns(ktime_get());
int cpu;

- SEQ_printf(m, "Sched Debug Version: v0.09, %s %.*s\n",
+ SEQ_printf(m, "Sched Debug Version: v0.10, %s %.*s\n",
init_utsname()->release,
(int)strcspn(init_utsname()->version, " "),
init_utsname()->version);
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 3d0f006..c57ca54 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -191,6 +191,23 @@ find_matching_se(struct sched_entity **se, struct sched_entity **pse)

#ifdef CONFIG_CFS_HARD_LIMITS

+static inline void update_stats_throttle_start(struct cfs_rq *cfs_rq,
+ struct sched_entity *se)
+{
+ schedstat_set(se->throttle_start, rq_of(cfs_rq)->clock);
+}
+
+static inline void update_stats_throttle_end(struct cfs_rq *cfs_rq,
+ struct sched_entity *se)
+{
+ schedstat_set(se->throttle_max, max(se->throttle_max,
+ rq_of(cfs_rq)->clock - se->throttle_start));
+ schedstat_set(se->throttle_count, se->throttle_count + 1);
+ schedstat_set(se->throttle_sum, se->throttle_sum +
+ rq_of(cfs_rq)->clock - se->throttle_start);
+ schedstat_set(se->throttle_start, 0);
+}
+
static inline
struct cfs_rq *sched_cfs_period_cfs_rq(struct cfs_bandwidth *cfs_b, int cpu)
{
@@ -224,6 +241,7 @@ static void sched_cfs_runtime_exceeded(struct sched_entity *se,

if (cfs_rq->cfs_time > cfs_rq->cfs_runtime) {
cfs_rq->cfs_throttled = 1;
+ update_stats_throttle_start(cfs_rq, se);
resched_task(tsk_curr);
}
}
@@ -269,6 +287,8 @@ void do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b)
cfs_rq_runtime_lock(cfs_rq);
cfs_rq->cfs_time = 0;
if (cfs_rq_throttled(cfs_rq)) {
+ update_rq_clock(rq);
+ update_stats_throttle_end(cfs_rq, se);
cfs_rq->cfs_throttled = 0;
enqueue_throttled_entity(rq, se);
}

2009-11-17 14:37:24

by Bharata B Rao

[permalink] [raw]
Subject: [RFC v4 PATCH 6/7] sched: Rebalance cfs runtimes

sched: CFS runtime borrowing

From: Bharata B Rao <[email protected]>

Before throttling a group, try to borrow runtime from groups that have excess.

To start with, a group will get equal runtime on every cpu. If the group doesn't
have tasks on all cpus, it might get throttled on some cpus while it still has
runtime left on other cpus where it doesn't have any tasks to consume that
runtime. Hence there is a chance to borrow runtimes from such cpus/cfs_rqs to
cpus/cfs_rqs where it is required.

CHECK: RT seems to be handling runtime initialization/reclaim during hotplug
from multiple places (migration_call, update_runtime). Need to check if CFS
also needs to do the same.

Signed-off-by: Kamalesh Babulal <[email protected]>
---
kernel/sched.c | 26 ++++++++
kernel/sched_fair.c | 172 +++++++++++++++++++++++++++++++++++++++++++++++++++
kernel/sched_rt.c | 26 +-------
3 files changed, 202 insertions(+), 22 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index dd56c72..ead02ca 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -9328,6 +9328,32 @@ static int update_sched_domains(struct notifier_block *nfb,
}
#endif

+#ifdef CONFIG_SMP
+static void disable_runtime(struct rq *rq)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&rq->lock, flags);
+#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_CFS_HARD_LIMITS)
+ disable_runtime_cfs(rq);
+#endif
+ disable_runtime_rt(rq);
+ spin_unlock_irqrestore(&rq->lock, flags);
+}
+
+static void enable_runtime(struct rq *rq)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&rq->lock, flags);
+#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_CFS_HARD_LIMITS)
+ enable_runtime_cfs(rq);
+#endif
+ enable_runtime_rt(rq);
+ spin_unlock_irqrestore(&rq->lock, flags);
+}
+#endif
+
static int update_runtime(struct notifier_block *nfb,
unsigned long action, void *hcpu)
{
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index c57ca54..6b254b8 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -220,6 +220,175 @@ static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
return cfs_rq->cfs_throttled;
}

+#ifdef CONFIG_SMP
+/*
+ * Ensure this RQ takes back all the runtime it lend to its neighbours.
+ */
+static void disable_runtime_cfs(struct rq *rq)
+{
+ struct root_domain *rd = rq->rd;
+ struct cfs_rq *cfs_rq;
+
+ if (unlikely(!scheduler_running))
+ return;
+
+ for_each_leaf_cfs_rq(rq, cfs_rq) {
+ struct cfs_bandwidth *cfs_b = &cfs_rq->tg->cfs_bandwidth;
+ s64 want;
+ int i;
+
+ spin_lock(&cfs_b->cfs_runtime_lock);
+ spin_lock(&cfs_rq->cfs_runtime_lock);
+
+ /*
+ * Either we're all are infinity and nobody needs to borrow,
+ * or we're already disabled and this have nothing to do, or
+ * we have exactly the right amount of runtime to take out.
+ */
+ if (cfs_rq->cfs_runtime == RUNTIME_INF ||
+ cfs_rq->cfs_runtime == cfs_b->cfs_runtime)
+ goto balanced;
+ spin_unlock(&cfs_rq->cfs_runtime_lock);
+
+ /*
+ * Calculate the difference between what we started out with
+ * and what we current have, that's the amount of runtime
+ * we lend and now have to reclaim.
+ */
+ want = cfs_b->cfs_runtime - cfs_rq->cfs_runtime;
+
+ /*
+ * Greedy reclaim, take back as much as possible.
+ */
+ for_each_cpu(i, rd->span) {
+ struct cfs_rq *iter = sched_cfs_period_cfs_rq(cfs_b, i);
+ s64 diff;
+
+ /*
+ * Can't reclaim from ourselves or disabled runqueues.
+ */
+ if (iter == cfs_rq || iter->cfs_runtime == RUNTIME_INF)
+ continue;
+
+ spin_lock(&iter->cfs_runtime_lock);
+ if (want > 0) {
+ diff = min_t(s64, iter->cfs_runtime, want);
+ iter->cfs_runtime -= diff;
+ want -= diff;
+ } else {
+ iter->cfs_runtime -= want;
+ want -= want;
+ }
+
+ spin_unlock(&iter->cfs_runtime_lock);
+ if (!want)
+ break;
+ }
+
+ spin_lock(&cfs_rq->cfs_runtime_lock);
+ /*
+ * We cannot be left wanting - that would mean some
+ * runtime leaked out of the system.
+ */
+ BUG_ON(want);
+balanced:
+ /*
+ * Disable all the borrow logic by pretending we have infinite
+ * runtime - in which case borrowing doesn't make sense.
+ */
+ cfs_rq->cfs_runtime = RUNTIME_INF;
+ spin_unlock(&cfs_rq->cfs_runtime_lock);
+ spin_unlock(&cfs_b->cfs_runtime_lock);
+ }
+}
+
+static void enable_runtime_cfs(struct rq *rq)
+{
+ struct cfs_rq *cfs_rq;
+
+ if (unlikely(!scheduler_running))
+ return;
+
+ /*
+ * Reset each runqueue's bandwidth settings
+ */
+ for_each_leaf_cfs_rq(rq, cfs_rq) {
+ struct cfs_bandwidth *cfs_b = &cfs_rq->tg->cfs_bandwidth;
+
+ spin_lock(&cfs_b->cfs_runtime_lock);
+ spin_lock(&cfs_rq->cfs_runtime_lock);
+ cfs_rq->cfs_runtime = cfs_b->cfs_runtime;
+ cfs_rq->cfs_time = 0;
+ cfs_rq->cfs_throttled = 0;
+ spin_unlock(&cfs_rq->cfs_runtime_lock);
+ spin_unlock(&cfs_b->cfs_runtime_lock);
+ }
+}
+
+/*
+ * Ran out of runtime, check if we can borrow some from others
+ * instead of getting throttled right away.
+ */
+static void do_cfs_balance_runtime(struct cfs_rq *cfs_rq)
+{
+ struct cfs_bandwidth *cfs_b = &cfs_rq->tg->cfs_bandwidth;
+ const struct cpumask *span = sched_bw_period_mask();
+ int i, weight;
+ u64 cfs_period;
+
+ weight = cpumask_weight(span);
+ spin_lock(&cfs_b->cfs_runtime_lock);
+ cfs_period = ktime_to_ns(cfs_b->cfs_period);
+
+ for_each_cpu(i, span) {
+ struct cfs_rq *borrow_cfs_rq =
+ sched_cfs_period_cfs_rq(cfs_b, i);
+ s64 diff;
+
+ if (borrow_cfs_rq == cfs_rq)
+ continue;
+
+ cfs_rq_runtime_lock(borrow_cfs_rq);
+ if (borrow_cfs_rq->cfs_runtime == RUNTIME_INF) {
+ cfs_rq_runtime_unlock(borrow_cfs_rq);
+ continue;
+ }
+
+ diff = borrow_cfs_rq->cfs_runtime - borrow_cfs_rq->cfs_time;
+ if (diff > 0) {
+ diff = div_u64((u64)diff, weight);
+ if (cfs_rq->cfs_runtime + diff > cfs_period)
+ diff = cfs_period - cfs_rq->cfs_runtime;
+ borrow_cfs_rq->cfs_runtime -= diff;
+ cfs_rq->cfs_runtime += diff;
+ if (cfs_rq->cfs_runtime == cfs_period) {
+ cfs_rq_runtime_unlock(borrow_cfs_rq);
+ break;
+ }
+ }
+ cfs_rq_runtime_unlock(borrow_cfs_rq);
+ }
+ spin_unlock(&cfs_b->cfs_runtime_lock);
+}
+
+/*
+ * Called with rq->runtime_lock held.
+ */
+static void cfs_balance_runtime(struct cfs_rq *cfs_rq)
+{
+ cfs_rq_runtime_unlock(cfs_rq);
+ do_cfs_balance_runtime(cfs_rq);
+ cfs_rq_runtime_lock(cfs_rq);
+}
+
+#else /* !CONFIG_SMP */
+
+static void cfs_balance_runtime(struct cfs_rq *cfs_rq)
+{
+ return;
+}
+#endif /* CONFIG_SMP */
+
/*
* Check if group entity exceeded its runtime. If so, mark the cfs_rq as
* throttled mark the current task for reschedling.
@@ -239,6 +408,9 @@ static void sched_cfs_runtime_exceeded(struct sched_entity *se,
if (cfs_rq_throttled(cfs_rq))
return;

+ if (cfs_rq->cfs_time > cfs_rq->cfs_runtime)
+ cfs_balance_runtime(cfs_rq);
+
if (cfs_rq->cfs_time > cfs_rq->cfs_runtime) {
cfs_rq->cfs_throttled = 1;
update_stats_throttle_start(cfs_rq, se);
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 97067e1..edcea9b 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -356,7 +356,7 @@ next:
/*
* Ensure this RQ takes back all the runtime it lend to its neighbours.
*/
-static void __disable_runtime(struct rq *rq)
+static void disable_runtime_rt(struct rq *rq)
{
struct root_domain *rd = rq->rd;
struct rt_rq *rt_rq;
@@ -433,16 +433,7 @@ balanced:
}
}

-static void disable_runtime(struct rq *rq)
-{
- unsigned long flags;
-
- spin_lock_irqsave(&rq->lock, flags);
- __disable_runtime(rq);
- spin_unlock_irqrestore(&rq->lock, flags);
-}
-
-static void __enable_runtime(struct rq *rq)
+static void enable_runtime_rt(struct rq *rq)
{
struct rt_rq *rt_rq;

@@ -465,15 +456,6 @@ static void __enable_runtime(struct rq *rq)
}
}

-static void enable_runtime(struct rq *rq)
-{
- unsigned long flags;
-
- spin_lock_irqsave(&rq->lock, flags);
- __enable_runtime(rq);
- spin_unlock_irqrestore(&rq->lock, flags);
-}
-
static int balance_runtime(struct rt_rq *rt_rq)
{
int more = 0;
@@ -1547,7 +1529,7 @@ static void rq_online_rt(struct rq *rq)
if (rq->rt.overloaded)
rt_set_overload(rq);

- __enable_runtime(rq);
+ enable_runtime_rt(rq);

cpupri_set(&rq->rd->cpupri, rq->cpu, rq->rt.highest_prio.curr);
}
@@ -1558,7 +1540,7 @@ static void rq_offline_rt(struct rq *rq)
if (rq->rt.overloaded)
rt_clear_overload(rq);

- __disable_runtime(rq);
+ disable_runtime_rt(rq);

cpupri_set(&rq->rd->cpupri, rq->cpu, CPUPRI_INVALID);
}

2009-11-17 14:37:58

by Bharata B Rao

[permalink] [raw]
Subject: [RFC v4 PATCH 7/7] sched: Hard limits documentation

sched: Hard limits documentation

From: Bharata B Rao <[email protected]>

Documentation for hard limits feature.

Signed-off-by: Bharata B Rao <[email protected]>
---
Documentation/scheduler/sched-cfs-hard-limits.txt | 48 +++++++++++++++++++++
1 files changed, 48 insertions(+), 0 deletions(-)
create mode 100644 Documentation/scheduler/sched-cfs-hard-limits.txt

diff --git a/Documentation/scheduler/sched-cfs-hard-limits.txt b/Documentation/scheduler/sched-cfs-hard-limits.txt
new file mode 100644
index 0000000..d6387af
--- /dev/null
+++ b/Documentation/scheduler/sched-cfs-hard-limits.txt
@@ -0,0 +1,48 @@
+CPU HARD LIMITS FOR CFS GROUPS
+==============================
+
+1. Overview
+2. Interface
+3. Examples
+
+1. Overview
+-----------
+
+CFS is a proportional share scheduler which tries to divide the CPU time
+proportionately between tasks or groups of tasks (task group/cgroup) depending
+on the priority/weight of the task or shares assigned to groups of tasks.
+In CFS, a task/task group can get more than its share of CPU if there are
+enough idle CPU cycles available in the system, due to the work conserving
+nature of the scheduler. However in certain scenarios (like pay-per-use),
+it is desirable not to provide extra time to a group even in the presence
+of idle CPU cycles. This is where hard limiting can be of use.
+
+Hard limits for task groups can be set by specifying how much CPU runtime a
+group can consume within a given period. If the group consumes more CPU time
+than the runtime in a given period, it gets throttled. None of the tasks of
+the throttled group gets to run until the runtime of the group gets refreshed
+at the beginning of the next period.
+
+2. Interface
+------------
+
+Hard limit feature adds 2 cgroup files for CFS group scheduler:
+
+cfs_runtime_us: Hard limit for the group in microseconds.
+
+cfs_period_us: Time period in microseconds within which hard limits is
+enforced.
+
+A group gets created with default values for runtime (infinite runtime which
+means hard limits disabled) and period (0.5s). Each group can set its own
+values for runtime and period independent of other groups in the system.
+
+3. Examples
+-----------
+
+# mount -t cgroup -ocpu none /cgroups/
+# cd /cgroups
+# mkdir 1
+# cd 1/
+# echo 250000 > cfs_runtime_us /* set a 250ms runtime or limit */
+# echo 500000 > cfs_period_us /* set a 500ms period */

2009-12-04 16:10:42

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC v4 PATCH 2/7] sched: Bandwidth initialization for fair task groups

On Tue, 2009-11-17 at 20:04 +0530, Bharata B Rao wrote:

> +++ b/kernel/sched.c
> @@ -237,6 +237,15 @@ static DEFINE_MUTEX(sched_domains_mutex);
>
> #include <linux/cgroup.h>
>
> +#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_CFS_HARD_LIMITS)
> +struct cfs_bandwidth {
> + spinlock_t cfs_runtime_lock;
> + ktime_t cfs_period;
> + u64 cfs_runtime;
> + struct hrtimer cfs_period_timer;
> +};
> +#endif
> +
> struct cfs_rq;
>
> static LIST_HEAD(task_groups);

So what's wrong with using struct rt_bandwidth, aside from the name?

> @@ -445,6 +457,19 @@ struct cfs_rq {
> unsigned long rq_weight;
> #endif
> #endif
> +#ifdef CONFIG_CFS_HARD_LIMITS
> + /* set when the group is throttled on this cpu */
> + int cfs_throttled;
> +
> + /* runtime currently consumed by the group on this rq */
> + u64 cfs_time;
> +
> + /* runtime available to the group on this rq */
> + u64 cfs_runtime;
> +
> + /* Protects the cfs runtime related fields of this cfs_rq */
> + spinlock_t cfs_runtime_lock;
> +#endif
> };

If you put these 4 in a new struct, say rq_bandwidth, and also use that
for rt_rq, then I bet you can write patch 6 with a lot less copy/paste
action.

2009-12-04 16:11:13

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC v4 PATCH 3/7] sched: Enforce hard limits by throttling

On Tue, 2009-11-17 at 20:05 +0530, Bharata B Rao wrote:
> sched: Enforce hard limits by throttling.
>
> From: Bharata B Rao <[email protected]>
>
> Throttle the task-groups which exceed the runtime allocated to them.
> Throttled group entities are removed from the run queue.

This patch is just vile, all those _locked variants should really go.

Nor it is entirely clear why they're there.

2009-12-04 16:10:59

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC v4 PATCH 2/7] sched: Bandwidth initialization for fair task groups

On Tue, 2009-11-17 at 20:04 +0530, Bharata B Rao wrote:
> +/*
> + * TODO: Check if this kind of timer setup is sufficient for cfs or
> + * should we do what rt is doing.
> + */
> +static void start_cfs_bandwidth(struct task_group *tg)

is there a reason not to do as rt does?

2009-12-04 16:10:43

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC v4 PATCH 6/7] sched: Rebalance cfs runtimes

On Tue, 2009-11-17 at 20:07 +0530, Bharata B Rao wrote:
> sched: CFS runtime borrowing
>
> From: Bharata B Rao <[email protected]>
>
> Before throttling a group, try to borrow runtime from groups that have excess.
>
> To start with, a group will get equal runtime on every cpu. If the group doesn't
> have tasks on all cpus, it might get throttled on some cpus while it still has
> runtime left on other cpus where it doesn't have any tasks to consume that
> runtime. Hence there is a chance to borrow runtimes from such cpus/cfs_rqs to
> cpus/cfs_rqs where it is required.
>
> CHECK: RT seems to be handling runtime initialization/reclaim during hotplug
> from multiple places (migration_call, update_runtime). Need to check if CFS
> also needs to do the same.
>
> Signed-off-by: Kamalesh Babulal <[email protected]>
> ---
> kernel/sched.c | 26 ++++++++
> kernel/sched_fair.c | 172 +++++++++++++++++++++++++++++++++++++++++++++++++++
> kernel/sched_rt.c | 26 +-------
> 3 files changed, 202 insertions(+), 22 deletions(-)

I think that if we unify the se/rq bandwidth structures a lot of copy
and paste can be avoided, resulting in an over-all much easier to
maintain code-base.

2009-12-05 13:02:16

by Bharata B Rao

[permalink] [raw]
Subject: Re: [RFC v4 PATCH 3/7] sched: Enforce hard limits by throttling

On Fri, Dec 04, 2009 at 05:09:55PM +0100, Peter Zijlstra wrote:
> On Tue, 2009-11-17 at 20:05 +0530, Bharata B Rao wrote:
> > sched: Enforce hard limits by throttling.
> >
> > From: Bharata B Rao <[email protected]>
> >
> > Throttle the task-groups which exceed the runtime allocated to them.
> > Throttled group entities are removed from the run queue.
>
> This patch is just vile, all those _locked variants should really go.
>
> Nor it is entirely clear why they're there.
>

update_curr() is the place where I check if the group has exceeded its
runtime and it needs to take cfs_rq->cfs_runtime_lock. However there are
2 places from where update_curr() gets called with cfs_rq->cfs_runtime_lock
already held. Hence _locked() version of update_curr() exists.

These two call paths are both enqueue paths (enqueue_task_fair and
enqueue during unthrottling when period timer fires). Hence _locked()
versions of entity_enqueue() exists.

I see that you don't have this sort of requirement (of holding
rt_rq->rt_runtime_lock) in rt. I will recheck on this one to see why you are
able to do this in rt and I can't in cfs. If convinced, I shall get rid
_locked versions.

Thanks Peter for taking time to review the hard limit patches.

Regards,
Bharata.

2009-12-05 13:04:50

by Bharata B Rao

[permalink] [raw]
Subject: Re: [RFC v4 PATCH 2/7] sched: Bandwidth initialization for fair task groups

On Fri, Dec 04, 2009 at 05:09:57PM +0100, Peter Zijlstra wrote:
> On Tue, 2009-11-17 at 20:04 +0530, Bharata B Rao wrote:
> > +/*
> > + * TODO: Check if this kind of timer setup is sufficient for cfs or
> > + * should we do what rt is doing.
> > + */
> > +static void start_cfs_bandwidth(struct task_group *tg)
>
> is there a reason not to do as rt does?

Not really, rt seems to have an elaborate mechanism to check for overruns etc.
I was in two minds if I need to have the same for cfs, may be I will follow
rt here also.

Regards,
Bharata.

2009-12-05 13:08:38

by Bharata B Rao

[permalink] [raw]
Subject: Re: [RFC v4 PATCH 6/7] sched: Rebalance cfs runtimes

On Fri, Dec 04, 2009 at 05:09:58PM +0100, Peter Zijlstra wrote:
> On Tue, 2009-11-17 at 20:07 +0530, Bharata B Rao wrote:
> > sched: CFS runtime borrowing
> >
> > From: Bharata B Rao <[email protected]>
> >
> > Before throttling a group, try to borrow runtime from groups that have excess.
> >
> > To start with, a group will get equal runtime on every cpu. If the group doesn't
> > have tasks on all cpus, it might get throttled on some cpus while it still has
> > runtime left on other cpus where it doesn't have any tasks to consume that
> > runtime. Hence there is a chance to borrow runtimes from such cpus/cfs_rqs to
> > cpus/cfs_rqs where it is required.
> >
> > CHECK: RT seems to be handling runtime initialization/reclaim during hotplug
> > from multiple places (migration_call, update_runtime). Need to check if CFS
> > also needs to do the same.
> >
> > Signed-off-by: Kamalesh Babulal <[email protected]>
> > ---
> > kernel/sched.c | 26 ++++++++
> > kernel/sched_fair.c | 172 +++++++++++++++++++++++++++++++++++++++++++++++++++
> > kernel/sched_rt.c | 26 +-------
> > 3 files changed, 202 insertions(+), 22 deletions(-)
>
> I think that if we unify the se/rq bandwidth structures a lot of copy
> and paste can be avoided, resulting in an over-all much easier to
> maintain code-base.

As you note here and also in reply to 2/7, I could definetely unify some
of the bandwidth handling code b/n rt and cfs. In fact I already have patches
from Dhaval for this. I was holding them back till now since I wanted to
show how cfs hard limits looks like and what changes it involves and get
review feedback on them. As I have said in my earlier posts, I will eventually
merge code with rt. I shall do this starting from next post.

Regards,
Bharata.