2010-02-13 02:56:01

by Paul Turner

[permalink] [raw]
Subject: [RFC PATCH v1 0/4] CFS Bandwidth Control

Hi all,

Please find attached an early patchset for an alternate approach to CFS
bandwidth provisioning. Bharata's excellent original RFC motivating discussion
on this topic can be found at: http://lkml.org/lkml/2009/6/4/24.

As discussed in reply to the latest iteration of Bharata's patchset (at:
http://lkml.org/lkml/2010/1/5/44) we have significant concerns with the
adaption of the RT-bandwidth code to the general case. In that discussion
an alternate approach as proposed in which a global per-taskgroup pool
is maintained with runqueues acquiring time directly from that (as opposed to
the RT all-rq-mapping-to-all-rq scheme). In particular we felt the original
approach did not scale well with increasing number-of-cores or small
reservations.

The skeleton of our approach is as follows:
- As above we maintain a global pool, per-tg, pool of unassigned quota. On it
we track the bandwidth period, quota per period, and runtime remaining in
the current period. As bandwidth is used within a period it is decremented
from runtime. Runtime is currently synchronized using a spinlock, in the
current implementation there's no reason this couldn't be done using
atomic ops instead however the spinlock allows for a little more flexibility
in experimentation with other schemes.
- When a cfs_rq participating in a bandwidth constrained task_group executes
it acquires time in sysctl_sched_cfs_bandwidth_slice (default currently
10ms) size chunks from the global pool, this synchronizes under rq->lock and
is part of the update_curr path.
- Throttled entities are dequeued immediately (as opposed to delaying this
operation to the put path), this avoids some potentially poor load-balancer
interactions and preserves the 'verbage' of the put_task semantic.
Throttled entities are gated from participating in the tree at the
{enqueue, dequeue}_entity level. They are also skipped for load
balance in the same manner as Bharatta's patch-series employs.

Interface:
----------
Two new cgroupfs files are added to the cpu subsystem:
- cpu.cfs_period_us : period over which bandwidth is to be regulated
- cpu.cfs_quota_us : bandwidth available for consumption per period

One important interface change that this introduces (versus the rate limits
proposal) is that the defined bandwidth becomes an absolute quantifier.

e.g. a bandwidth of 5 seconds (cpu.cfs_quota_us=5000000) on a period of 1 second
(cpu.cfs_period_us=1000000) would result in 5 wall seconds of cpu time being
consumable every 1 wall second.

Benchmarks:
-----------
In the attached benchmarks we compared the two approaches with each other and
the third 'ideal' case of using cpu affinities to restrict cpus to an
equivalent rate. Tests were performed on a 16-core AMD machine with no
restrictions to affinity [Bandwidth is cpu_rate / period ]

Pardon the wall of raw results that follows, if you prefer to view the data in
graph format it's available at:
http://docs.google.com/View?id=dgjvsb78_6dhktmzhs

Workload 1: while(1)x16
Measured: % Utilization

period=100ms, workload=while(1)x16
affinity cfs bandwidth cfs hardlimits
bandwidth %busy %busy (pct%) %busy (pct%)
1 6.28 6.67 (+6.2%) 6.87 (+9.4%)
2 12.57 13.20 (+5.0%) 11.67 (-7.2%)
3 18.80 19.42 (+3.3%) 16.13 (-14.2%)
4 25.02 25.60 (+2.3%) 19.62 (-21.6%)
5 31.32 31.74 (+1.3%) 24.05 (-23.2%)
6 37.47 37.92 (+1.2%) 30.41 (-18.8%)
7 43.71 43.95 (+0.5%) 34.11 (-22.0%)
8 50.08 50.14 (+0.1%) 39.73 (-20.7%)
9 56.24 56.15 (-0.2%) 43.63 (-22.4%)
10 62.38 62.33 (-0.1%) 47.50 (-23.9%)
11 68.58 67.81 (-1.1%) 45.09 (-34.3%)
12 74.92 72.13 (-3.7%) 47.98 (-36.0%)
13 81.21 74.11 (-8.7%) 56.62 (-30.3%)
14 87.40 84.06 (-3.8%) 51.77 (-40.8%)
15 93.69 83.56 (-10.8%) 53.61 (-42.8%)
16 99.93 99.88 (-0.1%) 61.26 (-38.7%)

period=250ms, workload=while(1)x16
affinity cfs bandwidth cfs hardlimits
bandwidth %busy %busy (pct%) %busy (pct%)
1 6.28 6.59 (+4.9%) 6.53 (+4.0%)
2 12.49 12.81 (+2.6%) 12.78 (+2.3%)
3 18.74 19.03 (+1.5%) 18.08 (-3.5%)
4 25.11 25.23 (+0.5%) 22.72 (-9.5%)
5 31.26 31.45 (+0.6%) 30.27 (-3.2%)
6 37.58 37.69 (+0.3%) 34.69 (-7.7%)
7 43.79 43.90 (+0.3%) 37.91 (-13.4%)
8 49.99 50.10 (+0.2%) 43.00 (-14.0%)
9 56.34 56.29 (-0.1%) 46.90 (-16.8%)
10 62.45 62.57 (+0.2%) 50.22 (-19.6%)
11 68.70 68.49 (-0.3%) 59.10 (-14.0%)
12 74.86 74.97 (+0.1%) 58.59 (-21.7%)
13 81.17 80.92 (-0.3%) 68.39 (-15.7%)
14 87.34 87.17 (-0.2%) 71.73 (-17.9%)
15 93.66 91.74 (-2.0%) 72.26 (-22.8%)
16 99.89 99.79 (-0.1%) 66.16 (-33.8%)

period=500ms, workload=while(1)x16
affinity cfs bandwidth cfs hardlimits
bandwidth %busy %busy (pct%) %busy (pct%)
1 6.27 6.43 (+2.6%) 6.56 (+4.6%)
2 12.48 12.69 (+1.7%) 12.68 (+1.6%)
3 19.04 18.90 (-0.7%) 18.99 (-0.3%)
4 25.05 25.12 (+0.3%) 25.22 (+0.7%)
5 31.33 31.28 (-0.2%) 31.45 (+0.4%)
6 37.39 37.66 (+0.7%) 37.70 (+0.8%)
7 43.73 43.77 (+0.1%) 43.83 (+0.2%)
8 50.13 50.02 (-0.2%) 49.70 (-0.9%)
9 56.29 56.24 (-0.1%) 56.16 (-0.2%)
10 62.47 62.54 (+0.1%) 61.07 (-2.2%)
11 68.69 68.79 (+0.1%) 67.70 (-1.4%)
12 74.98 74.89 (-0.1%) 75.10 (+0.2%)
13 81.14 81.24 (+0.1%) 72.29 (-10.9%)
14 87.43 86.91 (-0.6%) 85.79 (-1.9%)
15 93.77 93.36 (-0.4%) 82.12 (-12.4%)
16 99.90 99.89 (-0.0%) 91.27 (-8.6%)

Workload 2: sysbench prime computation (100k primes, 16 threads)
Measured: time to completion
period=100ms:
affinity cfs bandwidth cfs hardlimits
bandwidth time(s) time(s) (pct%) time(s) (pct%)
1 422.44 435.19 (+3.0%) 405.66 (-4.0%)
2 211.54 217.40 (+2.8%) 255.73 (+20.9%)
3 140.89 146.98 (+4.3%) 178.67 (+26.8%)
4 105.56 110.94 (+5.1%) 136.83 (+29.6%)
5 84.45 90.88 (+7.6%) 103.81 (+22.9%)
6 70.46 72.34 (+2.7%) 90.22 (+28.0%)
7 60.38 59.97 (-0.7%) 81.56 (+35.1%)
8 52.79 54.81 (+3.8%) 66.93 (+26.8%)
9 46.92 48.33 (+3.0%) 59.08 (+25.9%)
10 42.27 42.82 (+1.3%) 51.38 (+21.6%)
11 38.47 45.50 (+18.3%) 54.27 (+41.1%)
12 35.19 35.88 (+2.0%) 52.35 (+48.8%)
13 32.44 41.01 (+26.4%) 44.11 (+36.0%)
14 30.04 34.95 (+16.4%) 46.40 (+54.5%)
15 27.98 31.13 (+11.2%) 42.98 (+53.6%)
16 26.25 26.23 (-0.1%) 46.69 (+77.9%

period=250ms:
affinity cfs bandwidth cfs hardlimits
bandwidth time(s) time(s) (pct%) time(s) (pct%)
1 422.45 424.09 (+0.4%) 417.32 (-1.2%)
2 211.53 216.60 (+2.4%) 218.47 (+3.3%)
3 140.94 141.91 (+0.7%) 150.92 (+7.1%)
4 105.55 104.59 (-0.9%) 113.43 (+7.5%)
5 84.42 83.61 (-1.0%) 92.40 (+9.5%)
6 70.53 69.64 (-1.3%) 78.60 (+11.4%)
7 60.41 67.96 (+12.5%) 70.19 (+16.2%)
8 52.76 52.21 (-1.1%) 55.40 (+5.0%)
9 46.91 46.43 (-1.0%) 53.49 (+14.0%)
10 42.38 41.77 (-1.4%) 50.64 (+19.5%)
11 38.46 38.39 (-0.2%) 47.19 (+22.7%)
12 35.22 34.94 (-0.8%) 42.02 (+19.3%)
13 32.45 32.12 (-1.0%) 38.57 (+18.9%)
14 30.08 30.20 (+0.4%) 36.33 (+20.8%)
15 27.96 28.43 (+1.7%) 34.41 (+23.1%)
16 26.20 26.22 (+0.1%) 37.20 (+42.0%)

period=500ms
affinity cfs bandwidth cfs hardlimits
bandwidth time(s) time(s) (pct%) time(s) (pct%)
1 422.45 423.16 (+0.2%) 412.34 (-2.4%)
2 211.57 208.95 (-1.2%) 210.60 (-0.5%)
3 140.90 139.31 (-1.1%) 138.48 (-1.7%)
4 105.54 104.45 (-1.0%) 104.14 (-1.3%)
5 84.44 83.45 (-1.2%) 82.91 (-1.8%)
6 70.44 69.41 (-1.5%) 69.33 (-1.6%)
7 60.42 59.37 (-1.7%) 59.43 (-1.6%)
8 52.77 51.92 (-1.6%) 52.41 (-0.7%)
9 46.91 46.13 (-1.7%) 46.29 (-1.3%)
10 42.36 41.69 (-1.6%) 41.60 (-1.8%)
11 38.44 37.95 (-1.3%) 38.31 (-0.3%)
12 35.20 34.82 (-1.1%) 35.03 (-0.5%)
13 32.47 34.07 (+4.9%) 32.17 (-0.9%)
14 30.05 29.89 (-0.6%) 31.30 (+4.1%)
15 28.00 27.98 (-0.1%) 29.42 (+5.1%)
16 26.20 26.22 (+0.1%) 30.04 (+14.6%)

Again, This data is consumable more directly (graphed) at:
http://docs.google.com/View?id=dgjvsb78_6dhktmzhs

Todo:
-----
- hierarchal nr_tasks_running accounting:
This is a deficiency currently shared with SCHED_RT rate limiting. When
entities is throttled the running tasks it owns are not subtracted from
rq->nr_running. This then results in us missing idle_balance() due to
phantom tasks and load balancer weight per task calculations being
incorrect.

This code adds complexity which was both increasing the complexity of the
initial review for this patchset and truly probably best reviewed
independently of this feature's scope. To that end we'll post a separate
patch for this issue against the current RT rate-limiting code and merge any
converged on approach here as appropriate.

- throttle statistics:
Some statistics regarding the frequency and duration of throttling
definitely in order.

- Expiration of quota from a previous period?
The local quota assigned to a cfs_rq represents some 'slack' in the system
currently. While the rate of overal consumption by the system is capped by
the input of bandwidth into the global pool it's possible to locally exceed
bandwidth within a period if there is remaining local quota from a previous
period.

The maximum slack in the system at any time is bounded above by
weight(tg->allowed_cpus) * cfs_bandwidth_slice. There are a number of
approaches possible in managing this slack, from naively resetting local
quota_assigned at period refresh to tracking when quota was issued using
generations.

We have an experimental patch which adds quota generations (the latter) to
this series, however I just realized there's potential rq->lock inversion
within it currently so I'm going to have to defer that to the next version.

Open Questions:
---------------
- per-tg slice tunable?
Does it make sense to allow tuning of slice at the per-tg level or is global
sufficient?
- I suspect 5ms may be a more sensible/better performing default slice, but
the benchmarks above were performed at 10 so.. so be it, maybe in v2 :).
It's also possible some sort of limited step-function may make sense here
(e.g. favour repeated requests with larger slices, infrequent/likely to
yield with smaller)

Acknowledgements:
-----------------
We would like to thank Bharata B Rao and Dhaval Giani for both discussion and
their proposed patches, many elements in our patchset are directly based on
their work in this area.

Thank you to Ken Chen also for early review and comments.

Thanks for reading this far :)

- Paul and Nikhil


---

Paul Turner (4):
sched: introduce primitives to account for CFS bandwidth tracking
sched: accumulate per-cfs_rq cpu usage
sched: throttle cfs_rq entities which exceed their local quota
sched: unthrottle cfs_rq(s) who ran out of quota at period refresh


include/linux/sched.h | 4 +
init/Kconfig | 9 +
kernel/sched.c | 317 +++++++++++++++++++++++++++++++++++++++++++++----
kernel/sched_fair.c | 206 ++++++++++++++++++++++++++++++--
kernel/sched_rt.c | 19 ---
kernel/sysctl.c | 10 ++
6 files changed, 512 insertions(+), 53 deletions(-)


2010-02-13 02:56:13

by Paul Turner

[permalink] [raw]
Subject: [RFC PATCH v1 4/4] sched: unthrottle cfs_rq(s) who ran out of quota at period refresh

From: Paul Turner <[email protected]>

At the start of a new period there are several actions we must take:
- Refresh global bandwidth pool
- Unthrottle entities who ran out of quota as refreshed bandwidth permits

Unthrottled entities have the cfs_rq->throttled flag set and are re-enqueued
into the cfs entity hierarchy.

sched_rt_period_mask() is refactored slightly into sched_bw_period_mask()
since it is now shared by both cfs and rt bandwidth period timers.

The !CONFIG_RT_GROUP_SCHED && CONFIG_SMP case has been collapsed to use
rd->span instead of cpu_online_mask since I think that was incorrect before
(don't want to hit cpu's outside of your root_domain for RT bandwidth).

Signed-off-by: Paul Turner <[email protected]>
Signed-off-by: Nikhil Rao <[email protected]>
---
kernel/sched.c | 17 +++++++++++
kernel/sched_fair.c | 80 ++++++++++++++++++++++++++++++++++++++++++++++-----
kernel/sched_rt.c | 19 +-----------
3 files changed, 90 insertions(+), 26 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 88fd401..a79bb23 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1573,6 +1573,8 @@ static int tg_nop(struct task_group *tg, void *data)
}
#endif

+static inline const struct cpumask *sched_bw_period_mask(void);
+
#ifdef CONFIG_SMP
/* Used instead of source_load when we know the type == 0 */
static unsigned long weighted_cpuload(const int cpu)
@@ -1927,6 +1929,18 @@ static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)
#endif
}

+#ifdef CONFIG_SMP
+static inline const struct cpumask *sched_bw_period_mask(void)
+{
+ return cpu_rq(smp_processor_id())->rd->span;
+}
+#else
+static inline const struct cpumask *sched_bw_period_mask(void)
+{
+ return cpu_online_mask;
+}
+#endif
+
#ifdef CONFIG_CFS_BANDWIDTH
/*
* default period for cfs group bandwidth.
@@ -10769,6 +10783,9 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
cfs_rq->quota_assigned = RUNTIME_INF;
else
cfs_rq->quota_assigned = 0;
+
+ if (cfs_rq_throttled(cfs_rq))
+ unthrottle_cfs_rq(cfs_rq);
raw_spin_unlock_irq(&rq->lock);
}
mutex_unlock(&mutex);
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index da85200..814511a 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -267,6 +267,13 @@ find_matching_se(struct sched_entity **se, struct sched_entity **pse)
#endif /* CONFIG_FAIR_GROUP_SCHED */

#ifdef CONFIG_CFS_BANDWIDTH
+static inline
+struct cfs_rq *cfs_bandwidth_cfs_rq(struct cfs_bandwidth *cfs_b, int cpu)
+{
+ return container_of(cfs_b, struct task_group,
+ cfs_bandwidth)->cfs_rq[cpu];
+}
+
static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
{
return &tg->cfs_bandwidth;
@@ -793,8 +800,9 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
se->vruntime = vruntime;
}

-#define ENQUEUE_WAKEUP 1
-#define ENQUEUE_MIGRATE 2
+#define ENQUEUE_WAKEUP 1
+#define ENQUEUE_MIGRATE 2
+#define ENQUEUE_UNTHROTTLE 4

static void
enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
@@ -803,7 +811,8 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
* Update the normalized vruntime before updating min_vruntime
* through callig update_curr().
*/
- if (!(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_MIGRATE))
+ if (!(flags & (ENQUEUE_WAKEUP | ENQUEUE_UNTHROTTLE)) ||
+ (flags & ENQUEUE_MIGRATE))
se->vruntime += cfs_rq->min_vruntime;

/*
@@ -812,16 +821,15 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
update_curr(cfs_rq);

if (!entity_is_task(se) && (cfs_rq_throttled(group_cfs_rq(se)) ||
- !group_cfs_rq(se)->nr_running)) {
+ !group_cfs_rq(se)->nr_running))
return;
- }

account_entity_enqueue(cfs_rq, se);

- if (flags & ENQUEUE_WAKEUP) {
+ if (flags & (ENQUEUE_WAKEUP | ENQUEUE_UNTHROTTLE))
place_entity(cfs_rq, se, 0);
+ if (flags & ENQUEUE_WAKEUP)
enqueue_sleeper(cfs_rq, se);
- }

update_stats_enqueue(cfs_rq, se);
check_spread(cfs_rq, se);
@@ -1232,6 +1240,26 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
cfs_rq->throttled = 1;
}

+static void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
+{
+ struct sched_entity *se;
+ int flags = ENQUEUE_UNTHROTTLE;
+
+ se = cfs_rq->tg->se[cfs_rq->rq->cpu];
+
+ cfs_rq->throttled = 0;
+ for_each_sched_entity(se) {
+ if (se->on_rq)
+ break;
+
+ cfs_rq = cfs_rq_of(se);
+ enqueue_entity(cfs_rq, se, flags);
+ if (cfs_rq_throttled(cfs_rq))
+ break;
+ flags = ENQUEUE_WAKEUP;
+ }
+}
+
static void account_cfs_rq_quota(struct cfs_rq *cfs_rq,
unsigned long delta_exec)
{
@@ -1254,8 +1282,44 @@ static void account_cfs_rq_quota(struct cfs_rq *cfs_rq,

static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
{
- return 1;
+ int i, idle = 1;
+ u64 delta;
+ const struct cpumask *span;
+
+ if (cfs_b->quota == RUNTIME_INF)
+ return 1;
+
+ /* reset group quota */
+ raw_spin_lock(&cfs_b->lock);
+ cfs_b->runtime = cfs_b->quota;
+ raw_spin_unlock(&cfs_b->lock);
+
+ span = sched_bw_period_mask();
+ for_each_cpu(i, span) {
+ struct rq *rq = cpu_rq(i);
+ struct cfs_rq *cfs_rq = cfs_bandwidth_cfs_rq(cfs_b, i);
+
+ if (!cfs_rq->nr_running)
+ idle = 0;
+
+ if (!cfs_rq_throttled(cfs_rq))
+ continue;
+
+ delta = tg_request_cfs_quota(cfs_rq->tg);
+
+ if (delta) {
+ raw_spin_lock(&rq->lock);
+ cfs_rq->quota_assigned += delta;
+
+ if (cfs_rq->quota_used < cfs_rq->quota_assigned)
+ unthrottle_cfs_rq(cfs_rq);
+ raw_spin_unlock(&rq->lock);
+ }
+ }
+
+ return idle;
}
+
#endif

#ifdef CONFIG_SMP
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index f48328a..cab78f6 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -235,18 +235,6 @@ static int rt_se_boosted(struct sched_rt_entity *rt_se)
return p->prio != p->normal_prio;
}

-#ifdef CONFIG_SMP
-static inline const struct cpumask *sched_rt_period_mask(void)
-{
- return cpu_rq(smp_processor_id())->rd->span;
-}
-#else
-static inline const struct cpumask *sched_rt_period_mask(void)
-{
- return cpu_online_mask;
-}
-#endif
-
static inline
struct rt_rq *sched_rt_period_rt_rq(struct rt_bandwidth *rt_b, int cpu)
{
@@ -296,11 +284,6 @@ static inline int rt_rq_throttled(struct rt_rq *rt_rq)
return rt_rq->rt_throttled;
}

-static inline const struct cpumask *sched_rt_period_mask(void)
-{
- return cpu_online_mask;
-}
-
static inline
struct rt_rq *sched_rt_period_rt_rq(struct rt_bandwidth *rt_b, int cpu)
{
@@ -518,7 +501,7 @@ static int do_sched_rt_period_timer(struct rt_bandwidth *rt_b, int overrun)
if (!rt_bandwidth_enabled() || rt_b->rt_runtime == RUNTIME_INF)
return 1;

- span = sched_rt_period_mask();
+ span = sched_bw_period_mask();
for_each_cpu(i, span) {
int enqueue = 0;
struct rt_rq *rt_rq = sched_rt_period_rt_rq(rt_b, i);

2010-02-13 02:56:17

by Paul Turner

[permalink] [raw]
Subject: [RFC PATCH v1 2/4] sched: accumulate per-cfs_rq cpu usage

From: Paul Turner <[email protected]>

Introduce account_cfs_rq_quota() to account bandwidth usage on the cfs_rq
level versus task_groups for which bandwidth has been assigned. This is
tracked by whether the local cfs_rq->quota_assigned is finite or infinite
(RUNTIME_INF).

For cfs_rq's that belong to a bandwidth constrained task_group we introduce
tg_request_cfs_quota() which attempts to allocate quota from the global pool
for use locally. Updates involving the global pool are currently protected
under cfs_bandwidth->lock, local pools are protected by rq->lock.

This patch only attempts to assign and track quota, no action is taken in the
case that cfs_rq->quota_used exceeds cfs_rq->quota_assigned.

Signed-off-by: Paul Turner <[email protected]>
Signed-off-by: Nikhil Rao <[email protected]>
---
include/linux/sched.h | 4 ++++
kernel/sched.c | 13 +++++++++++++
kernel/sched_fair.c | 50 +++++++++++++++++++++++++++++++++++++++++++++++++
kernel/sysctl.c | 10 ++++++++++
4 files changed, 77 insertions(+), 0 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 78efe7c..8c9d401 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1963,6 +1963,10 @@ int sched_rt_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp,
loff_t *ppos);

+#ifdef CONFIG_CFS_BANDWIDTH
+extern unsigned int sysctl_sched_cfs_bandwidth_slice;
+#endif
+
extern unsigned int sysctl_sched_compat_yield;

#ifdef CONFIG_RT_MUTEXES
diff --git a/kernel/sched.c b/kernel/sched.c
index 6cc4bf4..fb2ffc6 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1923,6 +1923,19 @@ static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)
* default: 0.5s
*/
static u64 sched_cfs_bandwidth_period = 500000000ULL;
+
+/*
+ * default slice of quota to allocate from global tg to local cfs_rq pool on
+ * each refresh
+ * default: 10ms
+ */
+unsigned int sysctl_sched_cfs_bandwidth_slice = 10000UL;
+
+static inline u64 sched_cfs_bandwidth_slice(void)
+{
+ return (u64)sysctl_sched_cfs_bandwidth_slice * NSEC_PER_USEC;
+}
+
#endif

#include "sched_stats.h"
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 7b109ff..f2741ab 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -266,6 +266,16 @@ find_matching_se(struct sched_entity **se, struct sched_entity **pse)
}
#endif /* CONFIG_FAIR_GROUP_SCHED */

+#ifdef CONFIG_CFS_BANDWIDTH
+static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
+{
+ return &tg->cfs_bandwidth;
+}
+
+static void account_cfs_rq_quota(struct cfs_rq *cfs_rq,
+ unsigned long delta_exec);
+#endif
+
/**************************************************************
* Scheduling class tree data structure manipulation methods:
*/
@@ -544,6 +554,9 @@ static void update_curr(struct cfs_rq *cfs_rq)
cpuacct_charge(curtask, delta_exec);
account_group_exec_runtime(curtask, delta_exec);
}
+#ifdef CONFIG_CFS_BANDWIDTH
+ account_cfs_rq_quota(cfs_rq, delta_exec);
+#endif
}

static inline void
@@ -1145,6 +1158,43 @@ static void yield_task_fair(struct rq *rq)
}

#ifdef CONFIG_CFS_BANDWIDTH
+static u64 tg_request_cfs_quota(struct task_group *tg)
+{
+ struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
+ u64 delta = 0;
+
+ if (cfs_b->runtime > 0 || cfs_b->quota == RUNTIME_INF) {
+ raw_spin_lock(&cfs_b->lock);
+ /*
+ * it's possible a bandwidth update has changed the global
+ * pool.
+ */
+ if (cfs_b->quota == RUNTIME_INF)
+ delta = sched_cfs_bandwidth_slice();
+ else {
+ delta = min(cfs_b->runtime,
+ sched_cfs_bandwidth_slice());
+ cfs_b->runtime -= delta;
+ }
+ raw_spin_unlock(&cfs_b->lock);
+ }
+ return delta;
+}
+
+static void account_cfs_rq_quota(struct cfs_rq *cfs_rq,
+ unsigned long delta_exec)
+{
+ if (cfs_rq->quota_assigned == RUNTIME_INF)
+ return;
+
+ cfs_rq->quota_used += delta_exec;
+
+ if (cfs_rq->quota_used < cfs_rq->quota_assigned)
+ return;
+
+ cfs_rq->quota_assigned += tg_request_cfs_quota(cfs_rq->tg);
+}
+
static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
{
return 1;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 8a68b24..cc5ccce 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -364,6 +364,16 @@ static struct ctl_table kern_table[] = {
.mode = 0644,
.proc_handler = proc_dointvec,
},
+#ifdef CONFIG_CFS_BANDWIDTH
+ {
+ .procname = "sched_cfs_bandwidth_slice_us",
+ .data = &sysctl_sched_cfs_bandwidth_slice,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = &one,
+ },
+#endif
#ifdef CONFIG_PROVE_LOCKING
{
.procname = "prove_locking",

2010-02-13 02:56:10

by Paul Turner

[permalink] [raw]
Subject: [RFC PATCH v1 3/4] sched: throttle cfs_rq entities which exceed their local quota

From: Paul Turner <[email protected]>

In account_cfs_rq_quota() (via update_curr()) we track consumption versus a
cfs_rq's local quota and whether there is global quota available to continue
enabling it in the event we run out.

This patch adds the required support for the latter case, throttling entities
until quota is available to run. Throttling dequeues the entity in question
and sends a reschedule to the owning cpu so that it can be evicted.

The following restrictions apply to a throttled cfs_rq:
- It is dequeued from sched_entity hierarchy and restricted from being
re-enqueued. This means that new/waking children of this entity will be
queued up to it, but not past it.
- It does not contribute to weight calculations in tg_shares_up
- In the case that the cfs_rq of the cpu we are trying to pull from is throttled
it is is ignored by the loadbalancer in __load_balance_fair() and
move_one_task_fair().

Signed-off-by: Paul Turner <[email protected]>
Signed-off-by: Nikhil Rao <[email protected]>
---
kernel/sched.c | 12 ++++++++-
kernel/sched_fair.c | 72 +++++++++++++++++++++++++++++++++++++++++++++++----
2 files changed, 77 insertions(+), 7 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index fb2ffc6..88fd401 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -463,6 +463,7 @@ struct cfs_rq {
#endif
#ifdef CONFIG_CFS_BANDWIDTH
u64 quota_assigned, quota_used;
+ int throttled;
#endif
#endif
};
@@ -1691,6 +1692,8 @@ static void update_group_shares_cpu(struct task_group *tg, int cpu,
}
}

+static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq);
+
/*
* Re-compute the task group their per cpu shares over the given domain.
* This needs to be done in a bottom-up fashion because the rq weight of a
@@ -1711,7 +1714,14 @@ static int tg_shares_up(struct task_group *tg, void *data)
usd_rq_weight = per_cpu_ptr(update_shares_data, smp_processor_id());

for_each_cpu(i, sched_domain_span(sd)) {
- weight = tg->cfs_rq[i]->load.weight;
+ /*
+ * bandwidth throttled entities cannot contribute to load
+ * balance
+ */
+ if (!cfs_rq_throttled(tg->cfs_rq[i]))
+ weight = tg->cfs_rq[i]->load.weight;
+ else
+ weight = 0;
usd_rq_weight[i] = weight;

rq_weight += weight;
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index f2741ab..da85200 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -272,8 +272,21 @@ static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
return &tg->cfs_bandwidth;
}

+static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
+{
+ return cfs_rq->throttled;
+}
+
static void account_cfs_rq_quota(struct cfs_rq *cfs_rq,
unsigned long delta_exec);
+
+#else
+
+static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
+{
+ return 0;
+}
+
#endif

/**************************************************************
@@ -797,6 +810,12 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
* Update run-time statistics of the 'current'.
*/
update_curr(cfs_rq);
+
+ if (!entity_is_task(se) && (cfs_rq_throttled(group_cfs_rq(se)) ||
+ !group_cfs_rq(se)->nr_running)) {
+ return;
+ }
+
account_entity_enqueue(cfs_rq, se);

if (flags & ENQUEUE_WAKEUP) {
@@ -833,6 +852,11 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int sleep)
*/
update_curr(cfs_rq);

+ BUG_ON(!entity_is_task(se) && cfs_rq_throttled(group_cfs_rq(se)) &&
+ se->on_rq);
+ if (!entity_is_task(se) && cfs_rq_throttled(group_cfs_rq(se)))
+ return;
+
update_stats_dequeue(cfs_rq, se);
if (sleep) {
#ifdef CONFIG_SCHEDSTATS
@@ -1083,6 +1107,9 @@ static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int wakeup)
break;
cfs_rq = cfs_rq_of(se);
enqueue_entity(cfs_rq, se, flags);
+ /* don't continue to enqueue if our parent is throttled */
+ if (cfs_rq_throttled(cfs_rq))
+ break;
flags = ENQUEUE_WAKEUP;
}

@@ -1102,8 +1129,11 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int sleep)
for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
dequeue_entity(cfs_rq, se, sleep);
- /* Don't dequeue parent if it has other entities besides us */
- if (cfs_rq->load.weight)
+ /*
+ * Don't dequeue parent if it has other entities besides us,
+ * or if it is throttled
+ */
+ if (cfs_rq->load.weight || cfs_rq_throttled(cfs_rq))
break;
sleep = 1;
}
@@ -1181,6 +1211,27 @@ static u64 tg_request_cfs_quota(struct task_group *tg)
return delta;
}

+static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
+{
+ struct sched_entity *se;
+ int sleep = 0;
+
+ se = cfs_rq->tg->se[cfs_rq->rq->cpu];
+
+ for_each_sched_entity(se) {
+ struct cfs_rq *cfs_rq = cfs_rq_of(se);
+
+ BUG_ON(!se->on_rq);
+ dequeue_entity(cfs_rq, se, sleep);
+
+ if (cfs_rq->load.weight || cfs_rq_throttled(cfs_rq))
+ break;
+
+ sleep = 1;
+ }
+ cfs_rq->throttled = 1;
+}
+
static void account_cfs_rq_quota(struct cfs_rq *cfs_rq,
unsigned long delta_exec)
{
@@ -1189,10 +1240,16 @@ static void account_cfs_rq_quota(struct cfs_rq *cfs_rq,

cfs_rq->quota_used += delta_exec;

- if (cfs_rq->quota_used < cfs_rq->quota_assigned)
+ if (cfs_rq_throttled(cfs_rq) ||
+ cfs_rq->quota_used < cfs_rq->quota_assigned)
return;

cfs_rq->quota_assigned += tg_request_cfs_quota(cfs_rq->tg);
+
+ if (cfs_rq->quota_used >= cfs_rq->quota_assigned) {
+ throttle_cfs_rq(cfs_rq);
+ resched_task(cfs_rq->rq->curr);
+ }
}

static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
@@ -1947,9 +2004,10 @@ load_balance_fair(struct rq *this_rq, int this_cpu, struct rq *busiest,
u64 rem_load, moved_load;

/*
- * empty group
+ * empty group or throttled cfs_rq
*/
- if (!busiest_cfs_rq->task_weight)
+ if (!busiest_cfs_rq->task_weight ||
+ cfs_rq_throttled(busiest_cfs_rq))
continue;

rem_load = (u64)rem_load_move * busiest_weight;
@@ -1997,6 +2055,9 @@ move_one_task_fair(struct rq *this_rq, int this_cpu, struct rq *busiest,
cfs_rq_iterator.next = load_balance_next_fair;

for_each_leaf_cfs_rq(busiest, busy_cfs_rq) {
+ /* skip throttled cfs_rq */
+ if (cfs_rq_throttled(busy_cfs_rq))
+ continue;
/*
* pass busy_cfs_rq argument into
* load_balance_[start|next]_fair iterators
@@ -2173,7 +2234,6 @@ static const struct sched_class fair_sched_class = {

.task_waking = task_waking_fair,
#endif
-
.set_curr_task = set_curr_task_fair,
.task_tick = task_tick_fair,
.task_fork = task_fork_fair,

2010-02-13 02:56:39

by Paul Turner

[permalink] [raw]
Subject: [RFC PATCH v1 1/4] sched: introduce primitives to account for CFS bandwidth tracking

From: Paul Turner <[email protected]>

In this patch we introduce the notion of CFS bandwidth, to account for the
realities of SMP this is partitioned into globally unassigned bandwidth, and
locally claimed bandwidth:
- The global bandwidth is per task_group, it represents a pool of unclaimed
bandwidth that cfs_rq's can allocate from. It uses the new cfs_bandwidth
structure.
- The local bandwidth is tracked per-cfs_rq, this represents allotments from
the global pool
bandwidth assigned to a task_group, this is tracked using the
new cfs_bandwidth structure.

Bandwidth is managed via cgroupfs via two new files in the cpu subsystem:
- cpu.cfs_period_us : the bandwidth period in usecs
- cpu.cfs_quota_us : the cpu bandwidth (in usecs) that this tg will be allowed
to consume over period above.

A per-cfs_bandwidth timer is also introduced to handle future refresh at
period expiration. There's some minor refactoring here so that
start_bandwidth_timer() functionality can be shared

Signed-off-by: Paul Turner <[email protected]>
Signed-off-by: Nikhil Rao <[email protected]>
---
init/Kconfig | 9 ++
kernel/sched.c | 275 +++++++++++++++++++++++++++++++++++++++++++++++----
kernel/sched_fair.c | 14 ++-
3 files changed, 273 insertions(+), 25 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index d95ca7c..fb8c7d8 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -496,6 +496,15 @@ config CGROUP_SCHED

endchoice

+config CFS_BANDWIDTH
+ bool "CPU bandwidth provisioning for FAIR_GROUP_SCHED"
+ depends on EXPERIMENTAL
+ depends on FAIR_GROUP_SCHED && CGROUP_SCHED
+ default n
+ help
+ This option allows users to define quota and period for cpu
+ bandwidth provisioning on a per-cgroup basis.
+
menuconfig CGROUPS
boolean "Control Group support"
help
diff --git a/kernel/sched.c b/kernel/sched.c
index 3a8fb30..6cc4bf4 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -190,10 +190,28 @@ static inline int rt_bandwidth_enabled(void)
return sysctl_sched_rt_runtime >= 0;
}

-static void start_rt_bandwidth(struct rt_bandwidth *rt_b)
+static void start_bandwidth_timer(struct hrtimer *period_timer, ktime_t period)
{
- ktime_t now;
+ unsigned long delta;
+ ktime_t soft, hard, now;
+
+ for (;;) {
+ if (hrtimer_active(period_timer))
+ break;
+
+ now = hrtimer_cb_get_time(period_timer);
+ hrtimer_forward(period_timer, now, period);
+
+ soft = hrtimer_get_softexpires(period_timer);
+ hard = hrtimer_get_expires(period_timer);
+ delta = ktime_to_ns(ktime_sub(hard, soft));
+ __hrtimer_start_range_ns(period_timer, soft, delta,
+ HRTIMER_MODE_ABS_PINNED, 0);
+ }
+}

+static void start_rt_bandwidth(struct rt_bandwidth *rt_b)
+{
if (!rt_bandwidth_enabled() || rt_b->rt_runtime == RUNTIME_INF)
return;

@@ -201,22 +219,7 @@ static void start_rt_bandwidth(struct rt_bandwidth *rt_b)
return;

raw_spin_lock(&rt_b->rt_runtime_lock);
- for (;;) {
- unsigned long delta;
- ktime_t soft, hard;
-
- if (hrtimer_active(&rt_b->rt_period_timer))
- break;
-
- now = hrtimer_cb_get_time(&rt_b->rt_period_timer);
- hrtimer_forward(&rt_b->rt_period_timer, now, rt_b->rt_period);
-
- soft = hrtimer_get_softexpires(&rt_b->rt_period_timer);
- hard = hrtimer_get_expires(&rt_b->rt_period_timer);
- delta = ktime_to_ns(ktime_sub(hard, soft));
- __hrtimer_start_range_ns(&rt_b->rt_period_timer, soft, delta,
- HRTIMER_MODE_ABS_PINNED, 0);
- }
+ start_bandwidth_timer(&rt_b->rt_period_timer, rt_b->rt_period);
raw_spin_unlock(&rt_b->rt_runtime_lock);
}

@@ -241,6 +244,15 @@ struct cfs_rq;

static LIST_HEAD(task_groups);

+#ifdef CONFIG_CFS_BANDWIDTH
+struct cfs_bandwidth {
+ raw_spinlock_t lock;
+ ktime_t period;
+ u64 runtime, quota;
+ struct hrtimer period_timer;
+};
+#endif
+
/* task group related information */
struct task_group {
#ifdef CONFIG_CGROUP_SCHED
@@ -272,6 +284,10 @@ struct task_group {
struct task_group *parent;
struct list_head siblings;
struct list_head children;
+
+#ifdef CONFIG_CFS_BANDWIDTH
+ struct cfs_bandwidth cfs_bandwidth;
+#endif
};

#ifdef CONFIG_USER_SCHED
@@ -445,9 +461,76 @@ struct cfs_rq {
*/
unsigned long rq_weight;
#endif
+#ifdef CONFIG_CFS_BANDWIDTH
+ u64 quota_assigned, quota_used;
+#endif
#endif
};

+#ifdef CONFIG_CFS_BANDWIDTH
+static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun);
+
+static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
+{
+ struct cfs_bandwidth *cfs_b =
+ container_of(timer, struct cfs_bandwidth, period_timer);
+ ktime_t now;
+ int overrun;
+ int idle = 0;
+
+ for (;;) {
+ now = hrtimer_cb_get_time(timer);
+ overrun = hrtimer_forward(timer, now, cfs_b->period);
+
+ if (!overrun)
+ break;
+
+ idle = do_sched_cfs_period_timer(cfs_b, overrun);
+ }
+
+ return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
+}
+
+static
+void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b, u64 quota, u64 period)
+{
+ raw_spin_lock_init(&cfs_b->lock);
+ cfs_b->quota = cfs_b->runtime = quota;
+ cfs_b->period = ns_to_ktime(period);
+
+ hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+ cfs_b->period_timer.function = sched_cfs_period_timer;
+}
+
+static
+void init_cfs_rq_quota(struct cfs_rq *cfs_rq)
+{
+ cfs_rq->quota_used = 0;
+ if (cfs_rq->tg->cfs_bandwidth.quota == RUNTIME_INF)
+ cfs_rq->quota_assigned = RUNTIME_INF;
+ else
+ cfs_rq->quota_assigned = 0;
+}
+
+static void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
+{
+ if (cfs_b->quota == RUNTIME_INF)
+ return;
+
+ if (hrtimer_active(&cfs_b->period_timer))
+ return;
+
+ raw_spin_lock(&cfs_b->lock);
+ start_bandwidth_timer(&cfs_b->period_timer, cfs_b->period);
+ raw_spin_unlock(&cfs_b->lock);
+}
+
+static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
+{
+ hrtimer_cancel(&cfs_b->period_timer);
+}
+#endif
+
/* Real-Time classes' related field in a runqueue: */
struct rt_rq {
struct rt_prio_array active;
@@ -1834,6 +1917,14 @@ static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)
#endif
}

+#ifdef CONFIG_CFS_BANDWIDTH
+/*
+ * default period for cfs group bandwidth.
+ * default: 0.5s
+ */
+static u64 sched_cfs_bandwidth_period = 500000000ULL;
+#endif
+
#include "sched_stats.h"
#include "sched_idletask.c"
#include "sched_fair.c"
@@ -9422,6 +9513,9 @@ static void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
tg->cfs_rq[cpu] = cfs_rq;
init_cfs_rq(cfs_rq, rq);
cfs_rq->tg = tg;
+#ifdef CONFIG_CFS_BANDWIDTH
+ init_cfs_rq_quota(cfs_rq);
+#endif
if (add)
list_add(&cfs_rq->leaf_cfs_rq_list, &rq->leaf_cfs_rq_list);

@@ -9594,6 +9688,10 @@ void __init sched_init(void)
* We achieve this by letting init_task_group's tasks sit
* directly in rq->cfs (i.e init_task_group->se[] = NULL).
*/
+#ifdef CONFIG_CFS_BANDWIDTH
+ init_cfs_bandwidth(&init_task_group.cfs_bandwidth,
+ RUNTIME_INF, sched_cfs_bandwidth_period);
+#endif
init_tg_cfs_entry(&init_task_group, &rq->cfs, NULL, i, 1, NULL);
#elif defined CONFIG_USER_SCHED
root_task_group.shares = NICE_0_LOAD;
@@ -9851,6 +9949,10 @@ static void free_fair_sched_group(struct task_group *tg)
{
int i;

+#ifdef CONFIG_CFS_BANDWIDTH
+ destroy_cfs_bandwidth(&tg->cfs_bandwidth);
+#endif
+
for_each_possible_cpu(i) {
if (tg->cfs_rq)
kfree(tg->cfs_rq[i]);
@@ -9878,7 +9980,10 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
goto err;

tg->shares = NICE_0_LOAD;
-
+#ifdef CONFIG_CFS_BANDWIDTH
+ init_cfs_bandwidth(&tg->cfs_bandwidth, RUNTIME_INF,
+ sched_cfs_bandwidth_period);
+#endif
for_each_possible_cpu(i) {
rq = cpu_rq(i);

@@ -10333,7 +10438,7 @@ static int __rt_schedulable(struct task_group *tg, u64 period, u64 runtime)
return walk_tg_tree(tg_schedulable, tg_nop, &data);
}

-static int tg_set_bandwidth(struct task_group *tg,
+static int tg_set_rt_bandwidth(struct task_group *tg,
u64 rt_period, u64 rt_runtime)
{
int i, err = 0;
@@ -10372,7 +10477,7 @@ int sched_group_set_rt_runtime(struct task_group *tg, long rt_runtime_us)
if (rt_runtime_us < 0)
rt_runtime = RUNTIME_INF;

- return tg_set_bandwidth(tg, rt_period, rt_runtime);
+ return tg_set_rt_bandwidth(tg, rt_period, rt_runtime);
}

long sched_group_rt_runtime(struct task_group *tg)
@@ -10397,7 +10502,7 @@ int sched_group_set_rt_period(struct task_group *tg, long rt_period_us)
if (rt_period == 0)
return -EINVAL;

- return tg_set_bandwidth(tg, rt_period, rt_runtime);
+ return tg_set_rt_bandwidth(tg, rt_period, rt_runtime);
}

long sched_group_rt_period(struct task_group *tg)
@@ -10604,6 +10709,120 @@ static u64 cpu_shares_read_u64(struct cgroup *cgrp, struct cftype *cft)

return (u64) tg->shares;
}
+
+#ifdef CONFIG_CFS_BANDWIDTH
+static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
+{
+ int i;
+ static DEFINE_MUTEX(mutex);
+
+ if (tg == &init_task_group)
+ return -EINVAL;
+
+ if (!period)
+ return -EINVAL;
+
+ mutex_lock(&mutex);
+ /*
+ * Ensure we have at least one tick of bandwidth every period. This is
+ * to prevent reaching a state of large arrears when throttled via
+ * entity_tick() resulting in prolonged exit starvation.
+ */
+ if (NS_TO_JIFFIES(quota) < 1)
+ return -EINVAL;
+
+ raw_spin_lock_irq(&tg->cfs_bandwidth.lock);
+ tg->cfs_bandwidth.period = ns_to_ktime(period);
+ tg->cfs_bandwidth.runtime = tg->cfs_bandwidth.quota = quota;
+ raw_spin_unlock_irq(&tg->cfs_bandwidth.lock);
+
+ for_each_possible_cpu(i) {
+ struct cfs_rq *cfs_rq = tg->cfs_rq[i];
+ struct rq *rq = rq_of(cfs_rq);
+
+ raw_spin_lock_irq(&rq->lock);
+ cfs_rq->quota_used = 0;
+ if (quota == RUNTIME_INF)
+ cfs_rq->quota_assigned = RUNTIME_INF;
+ else
+ cfs_rq->quota_assigned = 0;
+ raw_spin_unlock_irq(&rq->lock);
+ }
+ mutex_unlock(&mutex);
+
+ return 0;
+}
+
+int tg_set_cfs_quota(struct task_group *tg, long cfs_runtime_us)
+{
+ u64 quota, period;
+
+ period = ktime_to_ns(tg->cfs_bandwidth.period);
+ if (cfs_runtime_us < 0)
+ quota = RUNTIME_INF;
+ else
+ quota = (u64)cfs_runtime_us * NSEC_PER_USEC;
+
+ return tg_set_cfs_bandwidth(tg, period, quota);
+}
+
+long tg_get_cfs_quota(struct task_group *tg)
+{
+ u64 quota_us;
+
+ if (tg->cfs_bandwidth.quota == RUNTIME_INF)
+ return -1;
+
+ quota_us = tg->cfs_bandwidth.quota;
+ do_div(quota_us, NSEC_PER_USEC);
+ return quota_us;
+}
+
+int tg_set_cfs_period(struct task_group *tg, long cfs_period_us)
+{
+ u64 quota, period;
+
+ period = (u64)cfs_period_us * NSEC_PER_USEC;
+ quota = tg->cfs_bandwidth.quota;
+
+ if (period <= 0)
+ return -EINVAL;
+
+ return tg_set_cfs_bandwidth(tg, period, quota);
+}
+
+long tg_get_cfs_period(struct task_group *tg)
+{
+ u64 cfs_period_us;
+
+ cfs_period_us = ktime_to_ns(tg->cfs_bandwidth.period);
+ do_div(cfs_period_us, NSEC_PER_USEC);
+ return cfs_period_us;
+}
+
+static s64 cpu_cfs_quota_read_s64(struct cgroup *cgrp, struct cftype *cft)
+{
+ return tg_get_cfs_quota(cgroup_tg(cgrp));
+}
+
+static int cpu_cfs_quota_write_s64(struct cgroup *cgrp, struct cftype *cftype,
+ s64 cfs_quota_us)
+{
+ return tg_set_cfs_quota(cgroup_tg(cgrp), cfs_quota_us);
+}
+
+static u64 cpu_cfs_period_read_u64(struct cgroup *cgrp, struct cftype *cft)
+{
+ return tg_get_cfs_period(cgroup_tg(cgrp));
+}
+
+static int cpu_cfs_period_write_u64(struct cgroup *cgrp, struct cftype *cftype,
+ u64 cfs_period_us)
+{
+ return tg_set_cfs_period(cgroup_tg(cgrp), cfs_period_us);
+}
+
+#endif /* CONFIG_CFS_BANDWIDTH */
#endif /* CONFIG_FAIR_GROUP_SCHED */

#ifdef CONFIG_RT_GROUP_SCHED
@@ -10638,6 +10857,18 @@ static struct cftype cpu_files[] = {
.write_u64 = cpu_shares_write_u64,
},
#endif
+#ifdef CONFIG_CFS_BANDWIDTH
+ {
+ .name = "cfs_quota_us",
+ .read_s64 = cpu_cfs_quota_read_s64,
+ .write_s64 = cpu_cfs_quota_write_s64,
+ },
+ {
+ .name = "cfs_period_us",
+ .read_u64 = cpu_cfs_period_read_u64,
+ .write_u64 = cpu_cfs_period_write_u64,
+ },
+#endif
#ifdef CONFIG_RT_GROUP_SCHED
{
.name = "rt_runtime_us",
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 8fe7ee8..7b109ff 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -264,10 +264,8 @@ static inline void
find_matching_se(struct sched_entity **se, struct sched_entity **pse)
{
}
-
#endif /* CONFIG_FAIR_GROUP_SCHED */

-
/**************************************************************
* Scheduling class tree data structure manipulation methods:
*/
@@ -360,6 +358,9 @@ static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)

rb_link_node(&se->run_node, parent, link);
rb_insert_color(&se->run_node, &cfs_rq->tasks_timeline);
+#ifdef CONFIG_CFS_BANDWIDTH
+ start_cfs_bandwidth(&cfs_rq->tg->cfs_bandwidth);
+#endif
}

static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
@@ -1143,6 +1144,13 @@ static void yield_task_fair(struct rq *rq)
se->vruntime = rightmost->vruntime + 1;
}

+#ifdef CONFIG_CFS_BANDWIDTH
+static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
+{
+ return 1;
+}
+#endif
+
#ifdef CONFIG_SMP

static void task_waking_fair(struct rq *rq, struct task_struct *p)
@@ -1172,7 +1180,7 @@ static void task_waking_fair(struct rq *rq, struct task_struct *p)
* We still saw a performance dip, some tracing learned us that between
* cgroup:/ and cgroup:/foo balancing the number of affine wakeups increased
* significantly. Therefore try to bias the error in direction of failing
- * the affine wakeup.
+ * the affie wakeup.
*
*/
static long effective_load(struct task_group *tg, int cpu,

2010-02-16 05:40:06

by Bharata B Rao

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/4] CFS Bandwidth Control

On Fri, Feb 12, 2010 at 06:54:52PM -0800, Paul Turner wrote:
>
> The skeleton of our approach is as follows:
> - As above we maintain a global pool, per-tg, pool of unassigned quota. On it
> we track the bandwidth period, quota per period, and runtime remaining in
> the current period. As bandwidth is used within a period it is decremented
> from runtime. Runtime is currently synchronized using a spinlock, in the
> current implementation there's no reason this couldn't be done using
> atomic ops instead however the spinlock allows for a little more flexibility
> in experimentation with other schemes.
> - When a cfs_rq participating in a bandwidth constrained task_group executes
> it acquires time in sysctl_sched_cfs_bandwidth_slice (default currently
> 10ms) size chunks from the global pool, this synchronizes under rq->lock and
> is part of the update_curr path.
> - Throttled entities are dequeued immediately (as opposed to delaying this
> operation to the put path), this avoids some potentially poor load-balancer
> interactions and preserves the 'verbage' of the put_task semantic.
> Throttled entities are gated from participating in the tree at the
> {enqueue, dequeue}_entity level. They are also skipped for load
> balance in the same manner as Bharatta's patch-series employs.

I did defer the dequeue until next put because walking the se hierarchy
multiple times (from update_curr -> dequeue_entity -> update_curr) appeared
too complex when I started with it.

>
> Interface:
> ----------
> Two new cgroupfs files are added to the cpu subsystem:
> - cpu.cfs_period_us : period over which bandwidth is to be regulated
> - cpu.cfs_quota_us : bandwidth available for consumption per period
>
> One important interface change that this introduces (versus the rate limits
> proposal) is that the defined bandwidth becomes an absolute quantifier.
>
> e.g. a bandwidth of 5 seconds (cpu.cfs_quota_us=5000000) on a period of 1 second
> (cpu.cfs_period_us=1000000) would result in 5 wall seconds of cpu time being
> consumable every 1 wall second.

As I have said earlier, I would like to hear what others say about this
interface. Especially from Linux-vserver project since it is already
using the cfs hard limit patches in their test release. Herbert ?

Thanks for your work. More later when I review the individual patches
in detail.

Regards,
Bharata.

2010-02-16 06:18:31

by Bharata B Rao

[permalink] [raw]
Subject: Re: [RFC PATCH v1 0/4] CFS Bandwidth Control

On Fri, Feb 12, 2010 at 06:54:52PM -0800, Paul Turner wrote:
> Todo:
> -----
> - hierarchal nr_tasks_running accounting:
> This is a deficiency currently shared with SCHED_RT rate limiting. When
> entities is throttled the running tasks it owns are not subtracted from
> rq->nr_running. This then results in us missing idle_balance() due to
> phantom tasks and load balancer weight per task calculations being
> incorrect.
>
> This code adds complexity which was both increasing the complexity of the
> initial review for this patchset and truly probably best reviewed
> independently of this feature's scope. To that end we'll post a separate
> patch for this issue against the current RT rate-limiting code and merge any
> converged on approach here as appropriate.

I had tried updating rq->nr_running in my v2 patchset
(http://lkml.org/lkml/2009/9/30/117, http://lkml.org/lkml/2009/9/30/119)
But since I felt that it added a lot of complexity, I removed it
subsequently in v3 (http://lkml.org/lkml/2009/11/9/65) and kept it similar
to RT.

>
> - throttle statistics:
> Some statistics regarding the frequency and duration of throttling
> definitely in order.

Please take a look at some of the throttling related stats I am collecting
in my patchset.

Regards,
Bharata.

2010-02-25 08:14:58

by Bharata B Rao

[permalink] [raw]
Subject: Re: [RFC PATCH v1 1/4] sched: introduce primitives to account for CFS bandwidth tracking

On Fri, Feb 12, 2010 at 06:54:57PM -0800, Paul wrote:
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -190,10 +190,28 @@ static inline int rt_bandwidth_enabled(void)
> return sysctl_sched_rt_runtime >= 0;
> }
>
> -static void start_rt_bandwidth(struct rt_bandwidth *rt_b)
> +static void start_bandwidth_timer(struct hrtimer *period_timer, ktime_t period)
> {
> - ktime_t now;
> + unsigned long delta;
> + ktime_t soft, hard, now;
> +
> + for (;;) {
> + if (hrtimer_active(period_timer))
> + break;
> +
> + now = hrtimer_cb_get_time(period_timer);
> + hrtimer_forward(period_timer, now, period);
> +
> + soft = hrtimer_get_softexpires(period_timer);
> + hard = hrtimer_get_expires(period_timer);
> + delta = ktime_to_ns(ktime_sub(hard, soft));
> + __hrtimer_start_range_ns(period_timer, soft, delta,
> + HRTIMER_MODE_ABS_PINNED, 0);
> + }
> +}
>
> +static void start_rt_bandwidth(struct rt_bandwidth *rt_b)
> +{
> if (!rt_bandwidth_enabled() || rt_b->rt_runtime == RUNTIME_INF)
> return;
>
> @@ -201,22 +219,7 @@ static void start_rt_bandwidth(struct rt_bandwidth *rt_b)
> return;
>
> raw_spin_lock(&rt_b->rt_runtime_lock);
> - for (;;) {
> - unsigned long delta;
> - ktime_t soft, hard;
> -
> - if (hrtimer_active(&rt_b->rt_period_timer))
> - break;
> -
> - now = hrtimer_cb_get_time(&rt_b->rt_period_timer);
> - hrtimer_forward(&rt_b->rt_period_timer, now, rt_b->rt_period);
> -
> - soft = hrtimer_get_softexpires(&rt_b->rt_period_timer);
> - hard = hrtimer_get_expires(&rt_b->rt_period_timer);
> - delta = ktime_to_ns(ktime_sub(hard, soft));
> - __hrtimer_start_range_ns(&rt_b->rt_period_timer, soft, delta,
> - HRTIMER_MODE_ABS_PINNED, 0);
> - }
> + start_bandwidth_timer(&rt_b->rt_period_timer, rt_b->rt_period);
> raw_spin_unlock(&rt_b->rt_runtime_lock);
> }
>
> @@ -241,6 +244,15 @@ struct cfs_rq;
>
> static LIST_HEAD(task_groups);
>
> +#ifdef CONFIG_CFS_BANDWIDTH
> +struct cfs_bandwidth {
> + raw_spinlock_t lock;
> + ktime_t period;
> + u64 runtime, quota;
> + struct hrtimer period_timer;
> +};
> +#endif
> +
> /* task group related information */
> struct task_group {
> #ifdef CONFIG_CGROUP_SCHED
> @@ -272,6 +284,10 @@ struct task_group {
> struct task_group *parent;
> struct list_head siblings;
> struct list_head children;
> +
> +#ifdef CONFIG_CFS_BANDWIDTH
> + struct cfs_bandwidth cfs_bandwidth;
> +#endif
> };
>
> #ifdef CONFIG_USER_SCHED
> @@ -445,9 +461,76 @@ struct cfs_rq {
> */
> unsigned long rq_weight;
> #endif
> +#ifdef CONFIG_CFS_BANDWIDTH
> + u64 quota_assigned, quota_used;
> +#endif
> #endif
> };
>
> +#ifdef CONFIG_CFS_BANDWIDTH
> +static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun);
> +
> +static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
> +{
> + struct cfs_bandwidth *cfs_b =
> + container_of(timer, struct cfs_bandwidth, period_timer);
> + ktime_t now;
> + int overrun;
> + int idle = 0;
> +
> + for (;;) {
> + now = hrtimer_cb_get_time(timer);
> + overrun = hrtimer_forward(timer, now, cfs_b->period);
> +
> + if (!overrun)
> + break;
> +
> + idle = do_sched_cfs_period_timer(cfs_b, overrun);
> + }
> +
> + return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
> +}
> +
> +static
> +void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b, u64 quota, u64 period)
> +{
> + raw_spin_lock_init(&cfs_b->lock);
> + cfs_b->quota = cfs_b->runtime = quota;
> + cfs_b->period = ns_to_ktime(period);
> +
> + hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> + cfs_b->period_timer.function = sched_cfs_period_timer;
> +}
> +
> +static
> +void init_cfs_rq_quota(struct cfs_rq *cfs_rq)
> +{
> + cfs_rq->quota_used = 0;
> + if (cfs_rq->tg->cfs_bandwidth.quota == RUNTIME_INF)
> + cfs_rq->quota_assigned = RUNTIME_INF;
> + else
> + cfs_rq->quota_assigned = 0;
> +}
> +
> +static void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
> +{
> + if (cfs_b->quota == RUNTIME_INF)
> + return;
> +
> + if (hrtimer_active(&cfs_b->period_timer))
> + return;
> +
> + raw_spin_lock(&cfs_b->lock);
> + start_bandwidth_timer(&cfs_b->period_timer, cfs_b->period);
> + raw_spin_unlock(&cfs_b->lock);
> +}
> +
> +static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
> +{
> + hrtimer_cancel(&cfs_b->period_timer);
> +}
> +#endif

May be you could define some of this functions for !CONFIG_CFS_BANDWIDTH case
and avoid them calling under #ifdef ? I was given this comment during my
initial iterations.

> +
> /* Real-Time classes' related field in a runqueue: */
> struct rt_rq {
> struct rt_prio_array active;
> @@ -1834,6 +1917,14 @@ static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)
> #endif
> }
>
> +#ifdef CONFIG_CFS_BANDWIDTH
> +/*
> + * default period for cfs group bandwidth.
> + * default: 0.5s
> + */
> +static u64 sched_cfs_bandwidth_period = 500000000ULL;
> +#endif
> +
> #include "sched_stats.h"
> #include "sched_idletask.c"
> #include "sched_fair.c"
> @@ -9422,6 +9513,9 @@ static void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
> tg->cfs_rq[cpu] = cfs_rq;
> init_cfs_rq(cfs_rq, rq);
> cfs_rq->tg = tg;
> +#ifdef CONFIG_CFS_BANDWIDTH
> + init_cfs_rq_quota(cfs_rq);
> +#endif
> if (add)
> list_add(&cfs_rq->leaf_cfs_rq_list, &rq->leaf_cfs_rq_list);
>
> @@ -9594,6 +9688,10 @@ void __init sched_init(void)
> * We achieve this by letting init_task_group's tasks sit
> * directly in rq->cfs (i.e init_task_group->se[] = NULL).
> */
> +#ifdef CONFIG_CFS_BANDWIDTH
> + init_cfs_bandwidth(&init_task_group.cfs_bandwidth,
> + RUNTIME_INF, sched_cfs_bandwidth_period);
> +#endif
> init_tg_cfs_entry(&init_task_group, &rq->cfs, NULL, i, 1, NULL);
> #elif defined CONFIG_USER_SCHED
> root_task_group.shares = NICE_0_LOAD;
> @@ -9851,6 +9949,10 @@ static void free_fair_sched_group(struct task_group *tg)
> {
> int i;
>
> +#ifdef CONFIG_CFS_BANDWIDTH
> + destroy_cfs_bandwidth(&tg->cfs_bandwidth);
> +#endif
> +
> for_each_possible_cpu(i) {
> if (tg->cfs_rq)
> kfree(tg->cfs_rq[i]);
> @@ -9878,7 +9980,10 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
> goto err;
>
> tg->shares = NICE_0_LOAD;
> -
> +#ifdef CONFIG_CFS_BANDWIDTH
> + init_cfs_bandwidth(&tg->cfs_bandwidth, RUNTIME_INF,
> + sched_cfs_bandwidth_period);
> +#endif
> for_each_possible_cpu(i) {
> rq = cpu_rq(i);
>
> @@ -10333,7 +10438,7 @@ static int __rt_schedulable(struct task_group *tg, u64 period, u64 runtime)
> return walk_tg_tree(tg_schedulable, tg_nop, &data);
> }
>
> -static int tg_set_bandwidth(struct task_group *tg,
> +static int tg_set_rt_bandwidth(struct task_group *tg,
> u64 rt_period, u64 rt_runtime)
> {
> int i, err = 0;
> @@ -10372,7 +10477,7 @@ int sched_group_set_rt_runtime(struct task_group *tg, long rt_runtime_us)
> if (rt_runtime_us < 0)
> rt_runtime = RUNTIME_INF;
>
> - return tg_set_bandwidth(tg, rt_period, rt_runtime);
> + return tg_set_rt_bandwidth(tg, rt_period, rt_runtime);
> }
>
> long sched_group_rt_runtime(struct task_group *tg)
> @@ -10397,7 +10502,7 @@ int sched_group_set_rt_period(struct task_group *tg, long rt_period_us)
> if (rt_period == 0)
> return -EINVAL;
>
> - return tg_set_bandwidth(tg, rt_period, rt_runtime);
> + return tg_set_rt_bandwidth(tg, rt_period, rt_runtime);
> }
>
> long sched_group_rt_period(struct task_group *tg)
> @@ -10604,6 +10709,120 @@ static u64 cpu_shares_read_u64(struct cgroup *cgrp, struct cftype *cft)
>
> return (u64) tg->shares;
> }
> +
> +#ifdef CONFIG_CFS_BANDWIDTH
> +static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
> +{
> + int i;
> + static DEFINE_MUTEX(mutex);
> +
> + if (tg == &init_task_group)
> + return -EINVAL;
> +
> + if (!period)
> + return -EINVAL;
> +
> + mutex_lock(&mutex);

What is this mutex for ? So you essentially serializing the bandwidth
setting of all groups ? While that iself isn't an issue, just wondering if
cfs_bandwidth.lock isn't suffient ?

> + /*
> + * Ensure we have at least one tick of bandwidth every period. This is
> + * to prevent reaching a state of large arrears when throttled via
> + * entity_tick() resulting in prolonged exit starvation.
> + */
> + if (NS_TO_JIFFIES(quota) < 1)
> + return -EINVAL;

Return with mutex held ?

> +
> + raw_spin_lock_irq(&tg->cfs_bandwidth.lock);
> + tg->cfs_bandwidth.period = ns_to_ktime(period);
> + tg->cfs_bandwidth.runtime = tg->cfs_bandwidth.quota = quota;
> + raw_spin_unlock_irq(&tg->cfs_bandwidth.lock);
> +
> + for_each_possible_cpu(i) {
> + struct cfs_rq *cfs_rq = tg->cfs_rq[i];
> + struct rq *rq = rq_of(cfs_rq);
> +
> + raw_spin_lock_irq(&rq->lock);
> + cfs_rq->quota_used = 0;
> + if (quota == RUNTIME_INF)
> + cfs_rq->quota_assigned = RUNTIME_INF;
> + else
> + cfs_rq->quota_assigned = 0;
> + raw_spin_unlock_irq(&rq->lock);
> + }
> + mutex_unlock(&mutex);
> +
> + return 0;
> +}
> +
> static void task_waking_fair(struct rq *rq, struct task_struct *p)
> @@ -1172,7 +1180,7 @@ static void task_waking_fair(struct rq *rq, struct task_struct *p)
> * We still saw a performance dip, some tracing learned us that between
> * cgroup:/ and cgroup:/foo balancing the number of affine wakeups increased
> * significantly. Therefore try to bias the error in direction of failing
> - * the affine wakeup.
> + * the affie wakeup.

Unintended change ?

Regards,
Bharata.

2010-02-25 10:30:53

by Paul Turner

[permalink] [raw]
Subject: Re: [RFC PATCH v1 1/4] sched: introduce primitives to account for CFS bandwidth tracking

On Thu, Feb 25, 2010 at 12:14 AM, Bharata B Rao
<[email protected]> wrote:
> On Fri, Feb 12, 2010 at 06:54:57PM -0800, Paul wrote:
>> --- a/kernel/sched.c
>> +++ b/kernel/sched.c
>> @@ -190,10 +190,28 @@ static inline int rt_bandwidth_enabled(void)
>> ? ? ? return sysctl_sched_rt_runtime >= 0;
>> ?}
>>
>> -static void start_rt_bandwidth(struct rt_bandwidth *rt_b)
>> +static void start_bandwidth_timer(struct hrtimer *period_timer, ktime_t period)
>> ?{
>> - ? ? ktime_t now;
>> + ? ? unsigned long delta;
>> + ? ? ktime_t soft, hard, now;
>> +
>> + ? ? for (;;) {
>> + ? ? ? ? ? ? if (hrtimer_active(period_timer))
>> + ? ? ? ? ? ? ? ? ? ? break;
>> +
>> + ? ? ? ? ? ? now = hrtimer_cb_get_time(period_timer);
>> + ? ? ? ? ? ? hrtimer_forward(period_timer, now, period);
>> +
>> + ? ? ? ? ? ? soft = hrtimer_get_softexpires(period_timer);
>> + ? ? ? ? ? ? hard = hrtimer_get_expires(period_timer);
>> + ? ? ? ? ? ? delta = ktime_to_ns(ktime_sub(hard, soft));
>> + ? ? ? ? ? ? __hrtimer_start_range_ns(period_timer, soft, delta,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?HRTIMER_MODE_ABS_PINNED, 0);
>> + ? ? }
>> +}
>>
>> +static void start_rt_bandwidth(struct rt_bandwidth *rt_b)
>> +{
>> ? ? ? if (!rt_bandwidth_enabled() || rt_b->rt_runtime == RUNTIME_INF)
>> ? ? ? ? ? ? ? return;
>>
>> @@ -201,22 +219,7 @@ static void start_rt_bandwidth(struct rt_bandwidth *rt_b)
>> ? ? ? ? ? ? ? return;
>>
>> ? ? ? raw_spin_lock(&rt_b->rt_runtime_lock);
>> - ? ? for (;;) {
>> - ? ? ? ? ? ? unsigned long delta;
>> - ? ? ? ? ? ? ktime_t soft, hard;
>> -
>> - ? ? ? ? ? ? if (hrtimer_active(&rt_b->rt_period_timer))
>> - ? ? ? ? ? ? ? ? ? ? break;
>> -
>> - ? ? ? ? ? ? now = hrtimer_cb_get_time(&rt_b->rt_period_timer);
>> - ? ? ? ? ? ? hrtimer_forward(&rt_b->rt_period_timer, now, rt_b->rt_period);
>> -
>> - ? ? ? ? ? ? soft = hrtimer_get_softexpires(&rt_b->rt_period_timer);
>> - ? ? ? ? ? ? hard = hrtimer_get_expires(&rt_b->rt_period_timer);
>> - ? ? ? ? ? ? delta = ktime_to_ns(ktime_sub(hard, soft));
>> - ? ? ? ? ? ? __hrtimer_start_range_ns(&rt_b->rt_period_timer, soft, delta,
>> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? HRTIMER_MODE_ABS_PINNED, 0);
>> - ? ? }
>> + ? ? start_bandwidth_timer(&rt_b->rt_period_timer, rt_b->rt_period);
>> ? ? ? raw_spin_unlock(&rt_b->rt_runtime_lock);
>> ?}
>>
>> @@ -241,6 +244,15 @@ struct cfs_rq;
>>
>> ?static LIST_HEAD(task_groups);
>>
>> +#ifdef CONFIG_CFS_BANDWIDTH
>> +struct cfs_bandwidth {
>> + ? ? raw_spinlock_t ? ? ? ? ?lock;
>> + ? ? ktime_t ? ? ? ? ? ? ? ? period;
>> + ? ? u64 ? ? ? ? ? ? ? ? ? ? runtime, quota;
>> + ? ? struct hrtimer ? ? ? ? ?period_timer;
>> +};
>> +#endif
>> +
>> ?/* task group related information */
>> ?struct task_group {
>> ?#ifdef CONFIG_CGROUP_SCHED
>> @@ -272,6 +284,10 @@ struct task_group {
>> ? ? ? struct task_group *parent;
>> ? ? ? struct list_head siblings;
>> ? ? ? struct list_head children;
>> +
>> +#ifdef CONFIG_CFS_BANDWIDTH
>> + ? ? struct cfs_bandwidth cfs_bandwidth;
>> +#endif
>> ?};
>>
>> ?#ifdef CONFIG_USER_SCHED
>> @@ -445,9 +461,76 @@ struct cfs_rq {
>> ? ? ? ?*/
>> ? ? ? unsigned long rq_weight;
>> ?#endif
>> +#ifdef CONFIG_CFS_BANDWIDTH
>> + ? ? u64 quota_assigned, quota_used;
>> +#endif
>> ?#endif
>> ?};
>>
>> +#ifdef CONFIG_CFS_BANDWIDTH
>> +static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun);
>> +
>> +static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
>> +{
>> + ? ? struct cfs_bandwidth *cfs_b =
>> + ? ? ? ? ? ? container_of(timer, struct cfs_bandwidth, period_timer);
>> + ? ? ktime_t now;
>> + ? ? int overrun;
>> + ? ? int idle = 0;
>> +
>> + ? ? for (;;) {
>> + ? ? ? ? ? ? now = hrtimer_cb_get_time(timer);
>> + ? ? ? ? ? ? overrun = hrtimer_forward(timer, now, cfs_b->period);
>> +
>> + ? ? ? ? ? ? if (!overrun)
>> + ? ? ? ? ? ? ? ? ? ? break;
>> +
>> + ? ? ? ? ? ? idle = do_sched_cfs_period_timer(cfs_b, overrun);
>> + ? ? }
>> +
>> + ? ? return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
>> +}
>> +
>> +static
>> +void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b, u64 quota, u64 period)
>> +{
>> + ? ? raw_spin_lock_init(&cfs_b->lock);
>> + ? ? cfs_b->quota = cfs_b->runtime = quota;
>> + ? ? cfs_b->period = ns_to_ktime(period);
>> +
>> + ? ? hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
>> + ? ? cfs_b->period_timer.function = sched_cfs_period_timer;
>> +}
>> +
>> +static
>> +void init_cfs_rq_quota(struct cfs_rq *cfs_rq)
>> +{
>> + ? ? cfs_rq->quota_used = 0;
>> + ? ? if (cfs_rq->tg->cfs_bandwidth.quota == RUNTIME_INF)
>> + ? ? ? ? ? ? cfs_rq->quota_assigned = RUNTIME_INF;
>> + ? ? else
>> + ? ? ? ? ? ? cfs_rq->quota_assigned = 0;
>> +}
>> +
>> +static void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
>> +{
>> + ? ? if (cfs_b->quota == RUNTIME_INF)
>> + ? ? ? ? ? ? return;
>> +
>> + ? ? if (hrtimer_active(&cfs_b->period_timer))
>> + ? ? ? ? ? ? return;
>> +
>> + ? ? raw_spin_lock(&cfs_b->lock);
>> + ? ? start_bandwidth_timer(&cfs_b->period_timer, cfs_b->period);
>> + ? ? raw_spin_unlock(&cfs_b->lock);
>> +}
>> +
>> +static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
>> +{
>> + ? ? hrtimer_cancel(&cfs_b->period_timer);
>> +}
>> +#endif
>
> May be you could define some of this functions for !CONFIG_CFS_BANDWIDTH case
> and avoid them calling under #ifdef ? I was given this comment during my
> initial iterations.
>

Was it for init or run-time functions? We try to maintain the empty
def style for most of our run-time functions (e.g. cfs_throttled); for
init it seems more descriptive (and in keeping with the rest of sched
init) to ifdef specific initialization.

Regardless, I will definitely give this a pass-over to see what I can clean up.

>> +
>> ?/* Real-Time classes' related field in a runqueue: */
>> ?struct rt_rq {
>> ? ? ? struct rt_prio_array active;
>> @@ -1834,6 +1917,14 @@ static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)
>> ?#endif
>> ?}
>>
>> +#ifdef CONFIG_CFS_BANDWIDTH
>> +/*
>> + * default period for cfs group bandwidth.
>> + * default: 0.5s
>> + */
>> +static u64 sched_cfs_bandwidth_period = 500000000ULL;
>> +#endif
>> +
>> ?#include "sched_stats.h"
>> ?#include "sched_idletask.c"
>> ?#include "sched_fair.c"
>> @@ -9422,6 +9513,9 @@ static void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
>> ? ? ? tg->cfs_rq[cpu] = cfs_rq;
>> ? ? ? init_cfs_rq(cfs_rq, rq);
>> ? ? ? cfs_rq->tg = tg;
>> +#ifdef CONFIG_CFS_BANDWIDTH
>> + ? ? init_cfs_rq_quota(cfs_rq);
>> +#endif
>> ? ? ? if (add)
>> ? ? ? ? ? ? ? list_add(&cfs_rq->leaf_cfs_rq_list, &rq->leaf_cfs_rq_list);
>>
>> @@ -9594,6 +9688,10 @@ void __init sched_init(void)
>> ? ? ? ? ? ? ? ?* We achieve this by letting init_task_group's tasks sit
>> ? ? ? ? ? ? ? ?* directly in rq->cfs (i.e init_task_group->se[] = NULL).
>> ? ? ? ? ? ? ? ?*/
>> +#ifdef CONFIG_CFS_BANDWIDTH
>> + ? ? ? ? ? ? init_cfs_bandwidth(&init_task_group.cfs_bandwidth,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? RUNTIME_INF, sched_cfs_bandwidth_period);
>> +#endif
>> ? ? ? ? ? ? ? init_tg_cfs_entry(&init_task_group, &rq->cfs, NULL, i, 1, NULL);
>> ?#elif defined CONFIG_USER_SCHED
>> ? ? ? ? ? ? ? root_task_group.shares = NICE_0_LOAD;
>> @@ -9851,6 +9949,10 @@ static void free_fair_sched_group(struct task_group *tg)
>> ?{
>> ? ? ? int i;
>>
>> +#ifdef CONFIG_CFS_BANDWIDTH
>> + ? ? destroy_cfs_bandwidth(&tg->cfs_bandwidth);
>> +#endif
>> +
>> ? ? ? for_each_possible_cpu(i) {
>> ? ? ? ? ? ? ? if (tg->cfs_rq)
>> ? ? ? ? ? ? ? ? ? ? ? kfree(tg->cfs_rq[i]);
>> @@ -9878,7 +9980,10 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
>> ? ? ? ? ? ? ? goto err;
>>
>> ? ? ? tg->shares = NICE_0_LOAD;
>> -
>> +#ifdef CONFIG_CFS_BANDWIDTH
>> + ? ? init_cfs_bandwidth(&tg->cfs_bandwidth, RUNTIME_INF,
>> + ? ? ? ? ? ? ? ? ? ? sched_cfs_bandwidth_period);
>> +#endif
>> ? ? ? for_each_possible_cpu(i) {
>> ? ? ? ? ? ? ? rq = cpu_rq(i);
>>
>> @@ -10333,7 +10438,7 @@ static int __rt_schedulable(struct task_group *tg, u64 period, u64 runtime)
>> ? ? ? return walk_tg_tree(tg_schedulable, tg_nop, &data);
>> ?}
>>
>> -static int tg_set_bandwidth(struct task_group *tg,
>> +static int tg_set_rt_bandwidth(struct task_group *tg,
>> ? ? ? ? ? ? ? u64 rt_period, u64 rt_runtime)
>> ?{
>> ? ? ? int i, err = 0;
>> @@ -10372,7 +10477,7 @@ int sched_group_set_rt_runtime(struct task_group *tg, long rt_runtime_us)
>> ? ? ? if (rt_runtime_us < 0)
>> ? ? ? ? ? ? ? rt_runtime = RUNTIME_INF;
>>
>> - ? ? return tg_set_bandwidth(tg, rt_period, rt_runtime);
>> + ? ? return tg_set_rt_bandwidth(tg, rt_period, rt_runtime);
>> ?}
>>
>> ?long sched_group_rt_runtime(struct task_group *tg)
>> @@ -10397,7 +10502,7 @@ int sched_group_set_rt_period(struct task_group *tg, long rt_period_us)
>> ? ? ? if (rt_period == 0)
>> ? ? ? ? ? ? ? return -EINVAL;
>>
>> - ? ? return tg_set_bandwidth(tg, rt_period, rt_runtime);
>> + ? ? return tg_set_rt_bandwidth(tg, rt_period, rt_runtime);
>> ?}
>>
>> ?long sched_group_rt_period(struct task_group *tg)
>> @@ -10604,6 +10709,120 @@ static u64 cpu_shares_read_u64(struct cgroup *cgrp, struct cftype *cft)
>>
>> ? ? ? return (u64) tg->shares;
>> ?}
>> +
>> +#ifdef CONFIG_CFS_BANDWIDTH
>> +static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
>> +{
>> + ? ? int i;
>> + ? ? static DEFINE_MUTEX(mutex);
>> +
>> + ? ? if (tg == &init_task_group)
>> + ? ? ? ? ? ? return -EINVAL;
>> +
>> + ? ? if (!period)
>> + ? ? ? ? ? ? return -EINVAL;
>> +
>> + ? ? mutex_lock(&mutex);
>
> What is this mutex for ? So you essentially serializing the bandwidth
> setting of all groups ? While that iself isn't an issue, just wondering if
> cfs_bandwidth.lock isn't suffient ?
>

The idea isn't to synchronize quota updates for all groups, but to
synchronize it within a single group. Consider the case of 2 parallel
writes, one setting infinite bandwidth the other setting finite.
Depending on rq->lock contention it's possible for both values to
propagate to some of the cpus.

You sit on the bandwidth lock because then there is inversion with
update_curr, e.g.

cpu1 -> rq->lock held, update_curr -> request bandwidth -> acquire
cfs_bandwidth.lock
cpu2 -> tg_set_cfs_bandwidth, hold cfs_bandwidth -> try to acquire cpu1 rq->lock

This mutex could be per-cgroup but users shouldn't be updating fast
enough to the point where they require it, it also reduces rq->lock
slamming when users update several cgroups in parallel.

>> + ? ? /*
>> + ? ? ?* Ensure we have at least one tick of bandwidth every period. ?This is
>> + ? ? ?* to prevent reaching a state of large arrears when throttled via
>> + ? ? ?* entity_tick() resulting in prolonged exit starvation.
>> + ? ? ?*/
>> + ? ? if (NS_TO_JIFFIES(quota) < 1)
>> + ? ? ? ? ? ? return -EINVAL;
>
> Return with mutex held ?

woops this should be below the mutex_lock, fixed thanks!

>
>> +
>> + ? ? raw_spin_lock_irq(&tg->cfs_bandwidth.lock);
>> + ? ? tg->cfs_bandwidth.period = ns_to_ktime(period);
>> + ? ? tg->cfs_bandwidth.runtime = tg->cfs_bandwidth.quota = quota;
>> + ? ? raw_spin_unlock_irq(&tg->cfs_bandwidth.lock);
>> +
>> + ? ? for_each_possible_cpu(i) {
>> + ? ? ? ? ? ? struct cfs_rq *cfs_rq = tg->cfs_rq[i];
>> + ? ? ? ? ? ? struct rq *rq = rq_of(cfs_rq);
>> +
>> + ? ? ? ? ? ? raw_spin_lock_irq(&rq->lock);
>> + ? ? ? ? ? ? cfs_rq->quota_used = 0;
>> + ? ? ? ? ? ? if (quota == RUNTIME_INF)
>> + ? ? ? ? ? ? ? ? ? ? cfs_rq->quota_assigned = RUNTIME_INF;
>> + ? ? ? ? ? ? else
>> + ? ? ? ? ? ? ? ? ? ? cfs_rq->quota_assigned = 0;
>> + ? ? ? ? ? ? raw_spin_unlock_irq(&rq->lock);
>> + ? ? }
>> + ? ? mutex_unlock(&mutex);
>> +
>> + ? ? return 0;
>> +}
>> +
>> ?static void task_waking_fair(struct rq *rq, struct task_struct *p)
>> @@ -1172,7 +1180,7 @@ static void task_waking_fair(struct rq *rq, struct task_struct *p)
>> ? * We still saw a performance dip, some tracing learned us that between
>> ? * cgroup:/ and cgroup:/foo balancing the number of affine wakeups increased
>> ? * significantly. Therefore try to bias the error in direction of failing
>> - * the affine wakeup.
>> + * the affie wakeup.
>
> Unintended change ?
>

most definitely, fixed!

> Regards,
> Bharata.
>

Thanks for the comments!

2010-02-26 11:52:39

by Bharata B Rao

[permalink] [raw]
Subject: Re: [RFC PATCH v1 1/4] sched: introduce primitives to account for CFS bandwidth tracking

On Thu, Feb 25, 2010 at 02:30:44AM -0800, Paul Turner wrote:
> On Thu, Feb 25, 2010 at 12:14 AM, Bharata B Rao
> <[email protected]> wrote:
> > On Fri, Feb 12, 2010 at 06:54:57PM -0800, Paul wrote:
> >> --- a/kernel/sched.c
> >> +++ b/kernel/sched.c
> >> @@ -190,10 +190,28 @@ static inline int rt_bandwidth_enabled(void)
> >> ? ? ? return sysctl_sched_rt_runtime >= 0;
> >> ?}
> >>
> >> -static void start_rt_bandwidth(struct rt_bandwidth *rt_b)
> >> +static void start_bandwidth_timer(struct hrtimer *period_timer, ktime_t period)
> >> ?{
> >> - ? ? ktime_t now;
> >> + ? ? unsigned long delta;
> >> + ? ? ktime_t soft, hard, now;
> >> +
> >> + ? ? for (;;) {
> >> + ? ? ? ? ? ? if (hrtimer_active(period_timer))
> >> + ? ? ? ? ? ? ? ? ? ? break;
> >> +
> >> + ? ? ? ? ? ? now = hrtimer_cb_get_time(period_timer);
> >> + ? ? ? ? ? ? hrtimer_forward(period_timer, now, period);
> >> +
> >> + ? ? ? ? ? ? soft = hrtimer_get_softexpires(period_timer);
> >> + ? ? ? ? ? ? hard = hrtimer_get_expires(period_timer);
> >> + ? ? ? ? ? ? delta = ktime_to_ns(ktime_sub(hard, soft));
> >> + ? ? ? ? ? ? __hrtimer_start_range_ns(period_timer, soft, delta,
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?HRTIMER_MODE_ABS_PINNED, 0);
> >> + ? ? }
> >> +}
> >>
> >> +static void start_rt_bandwidth(struct rt_bandwidth *rt_b)
> >> +{
> >> ? ? ? if (!rt_bandwidth_enabled() || rt_b->rt_runtime == RUNTIME_INF)
> >> ? ? ? ? ? ? ? return;
> >>
> >> @@ -201,22 +219,7 @@ static void start_rt_bandwidth(struct rt_bandwidth *rt_b)
> >> ? ? ? ? ? ? ? return;
> >>
> >> ? ? ? raw_spin_lock(&rt_b->rt_runtime_lock);
> >> - ? ? for (;;) {
> >> - ? ? ? ? ? ? unsigned long delta;
> >> - ? ? ? ? ? ? ktime_t soft, hard;
> >> -
> >> - ? ? ? ? ? ? if (hrtimer_active(&rt_b->rt_period_timer))
> >> - ? ? ? ? ? ? ? ? ? ? break;
> >> -
> >> - ? ? ? ? ? ? now = hrtimer_cb_get_time(&rt_b->rt_period_timer);
> >> - ? ? ? ? ? ? hrtimer_forward(&rt_b->rt_period_timer, now, rt_b->rt_period);
> >> -
> >> - ? ? ? ? ? ? soft = hrtimer_get_softexpires(&rt_b->rt_period_timer);
> >> - ? ? ? ? ? ? hard = hrtimer_get_expires(&rt_b->rt_period_timer);
> >> - ? ? ? ? ? ? delta = ktime_to_ns(ktime_sub(hard, soft));
> >> - ? ? ? ? ? ? __hrtimer_start_range_ns(&rt_b->rt_period_timer, soft, delta,
> >> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? HRTIMER_MODE_ABS_PINNED, 0);
> >> - ? ? }
> >> + ? ? start_bandwidth_timer(&rt_b->rt_period_timer, rt_b->rt_period);
> >> ? ? ? raw_spin_unlock(&rt_b->rt_runtime_lock);
> >> ?}
> >>
> >> @@ -241,6 +244,15 @@ struct cfs_rq;
> >>
> >> ?static LIST_HEAD(task_groups);
> >>
> >> +#ifdef CONFIG_CFS_BANDWIDTH
> >> +struct cfs_bandwidth {
> >> + ? ? raw_spinlock_t ? ? ? ? ?lock;
> >> + ? ? ktime_t ? ? ? ? ? ? ? ? period;
> >> + ? ? u64 ? ? ? ? ? ? ? ? ? ? runtime, quota;
> >> + ? ? struct hrtimer ? ? ? ? ?period_timer;
> >> +};
> >> +#endif
> >> +
> >> ?/* task group related information */
> >> ?struct task_group {
> >> ?#ifdef CONFIG_CGROUP_SCHED
> >> @@ -272,6 +284,10 @@ struct task_group {
> >> ? ? ? struct task_group *parent;
> >> ? ? ? struct list_head siblings;
> >> ? ? ? struct list_head children;
> >> +
> >> +#ifdef CONFIG_CFS_BANDWIDTH
> >> + ? ? struct cfs_bandwidth cfs_bandwidth;
> >> +#endif
> >> ?};
> >>
> >> ?#ifdef CONFIG_USER_SCHED
> >> @@ -445,9 +461,76 @@ struct cfs_rq {
> >> ? ? ? ?*/
> >> ? ? ? unsigned long rq_weight;
> >> ?#endif
> >> +#ifdef CONFIG_CFS_BANDWIDTH
> >> + ? ? u64 quota_assigned, quota_used;
> >> +#endif
> >> ?#endif
> >> ?};
> >>
> >> +#ifdef CONFIG_CFS_BANDWIDTH
> >> +static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun);
> >> +
> >> +static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
> >> +{
> >> + ? ? struct cfs_bandwidth *cfs_b =
> >> + ? ? ? ? ? ? container_of(timer, struct cfs_bandwidth, period_timer);
> >> + ? ? ktime_t now;
> >> + ? ? int overrun;
> >> + ? ? int idle = 0;
> >> +
> >> + ? ? for (;;) {
> >> + ? ? ? ? ? ? now = hrtimer_cb_get_time(timer);
> >> + ? ? ? ? ? ? overrun = hrtimer_forward(timer, now, cfs_b->period);
> >> +
> >> + ? ? ? ? ? ? if (!overrun)
> >> + ? ? ? ? ? ? ? ? ? ? break;
> >> +
> >> + ? ? ? ? ? ? idle = do_sched_cfs_period_timer(cfs_b, overrun);
> >> + ? ? }
> >> +
> >> + ? ? return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
> >> +}
> >> +
> >> +static
> >> +void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b, u64 quota, u64 period)
> >> +{
> >> + ? ? raw_spin_lock_init(&cfs_b->lock);
> >> + ? ? cfs_b->quota = cfs_b->runtime = quota;
> >> + ? ? cfs_b->period = ns_to_ktime(period);
> >> +
> >> + ? ? hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> >> + ? ? cfs_b->period_timer.function = sched_cfs_period_timer;
> >> +}
> >> +
> >> +static
> >> +void init_cfs_rq_quota(struct cfs_rq *cfs_rq)
> >> +{
> >> + ? ? cfs_rq->quota_used = 0;
> >> + ? ? if (cfs_rq->tg->cfs_bandwidth.quota == RUNTIME_INF)
> >> + ? ? ? ? ? ? cfs_rq->quota_assigned = RUNTIME_INF;
> >> + ? ? else
> >> + ? ? ? ? ? ? cfs_rq->quota_assigned = 0;
> >> +}
> >> +
> >> +static void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
> >> +{
> >> + ? ? if (cfs_b->quota == RUNTIME_INF)
> >> + ? ? ? ? ? ? return;
> >> +
> >> + ? ? if (hrtimer_active(&cfs_b->period_timer))
> >> + ? ? ? ? ? ? return;
> >> +
> >> + ? ? raw_spin_lock(&cfs_b->lock);
> >> + ? ? start_bandwidth_timer(&cfs_b->period_timer, cfs_b->period);
> >> + ? ? raw_spin_unlock(&cfs_b->lock);
> >> +}
> >> +
> >> +static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
> >> +{
> >> + ? ? hrtimer_cancel(&cfs_b->period_timer);
> >> +}
> >> +#endif
> >
> > May be you could define some of this functions for !CONFIG_CFS_BANDWIDTH case
> > and avoid them calling under #ifdef ? I was given this comment during my
> > initial iterations.
> >
>
> Was it for init or run-time functions? We try to maintain the empty
> def style for most of our run-time functions (e.g. cfs_throttled); for
> init it seems more descriptive (and in keeping with the rest of sched
> init) to ifdef specific initialization.
>
> Regardless, I will definitely give this a pass-over to see what I can clean up.

Even for init functions.

>
> >> +
> >> ?/* Real-Time classes' related field in a runqueue: */
> >> ?struct rt_rq {
> >> ? ? ? struct rt_prio_array active;
> >> @@ -1834,6 +1917,14 @@ static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)
> >> ?#endif
> >> ?}
> >>
> >> +#ifdef CONFIG_CFS_BANDWIDTH
> >> +/*
> >> + * default period for cfs group bandwidth.
> >> + * default: 0.5s
> >> + */
> >> +static u64 sched_cfs_bandwidth_period = 500000000ULL;
> >> +#endif
> >> +
> >> ?#include "sched_stats.h"
> >> ?#include "sched_idletask.c"
> >> ?#include "sched_fair.c"
> >> @@ -9422,6 +9513,9 @@ static void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
> >> ? ? ? tg->cfs_rq[cpu] = cfs_rq;
> >> ? ? ? init_cfs_rq(cfs_rq, rq);
> >> ? ? ? cfs_rq->tg = tg;
> >> +#ifdef CONFIG_CFS_BANDWIDTH
> >> + ? ? init_cfs_rq_quota(cfs_rq);
> >> +#endif
> >> ? ? ? if (add)
> >> ? ? ? ? ? ? ? list_add(&cfs_rq->leaf_cfs_rq_list, &rq->leaf_cfs_rq_list);
> >>
> >> @@ -9594,6 +9688,10 @@ void __init sched_init(void)
> >> ? ? ? ? ? ? ? ?* We achieve this by letting init_task_group's tasks sit
> >> ? ? ? ? ? ? ? ?* directly in rq->cfs (i.e init_task_group->se[] = NULL).
> >> ? ? ? ? ? ? ? ?*/
> >> +#ifdef CONFIG_CFS_BANDWIDTH
> >> + ? ? ? ? ? ? init_cfs_bandwidth(&init_task_group.cfs_bandwidth,
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? RUNTIME_INF, sched_cfs_bandwidth_period);
> >> +#endif
> >> ? ? ? ? ? ? ? init_tg_cfs_entry(&init_task_group, &rq->cfs, NULL, i, 1, NULL);
> >> ?#elif defined CONFIG_USER_SCHED
> >> ? ? ? ? ? ? ? root_task_group.shares = NICE_0_LOAD;
> >> @@ -9851,6 +9949,10 @@ static void free_fair_sched_group(struct task_group *tg)
> >> ?{
> >> ? ? ? int i;
> >>
> >> +#ifdef CONFIG_CFS_BANDWIDTH
> >> + ? ? destroy_cfs_bandwidth(&tg->cfs_bandwidth);
> >> +#endif
> >> +
> >> ? ? ? for_each_possible_cpu(i) {
> >> ? ? ? ? ? ? ? if (tg->cfs_rq)
> >> ? ? ? ? ? ? ? ? ? ? ? kfree(tg->cfs_rq[i]);
> >> @@ -9878,7 +9980,10 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
> >> ? ? ? ? ? ? ? goto err;
> >>
> >> ? ? ? tg->shares = NICE_0_LOAD;
> >> -
> >> +#ifdef CONFIG_CFS_BANDWIDTH
> >> + ? ? init_cfs_bandwidth(&tg->cfs_bandwidth, RUNTIME_INF,
> >> + ? ? ? ? ? ? ? ? ? ? sched_cfs_bandwidth_period);
> >> +#endif
> >> ? ? ? for_each_possible_cpu(i) {
> >> ? ? ? ? ? ? ? rq = cpu_rq(i);
> >>
> >> @@ -10333,7 +10438,7 @@ static int __rt_schedulable(struct task_group *tg, u64 period, u64 runtime)
> >> ? ? ? return walk_tg_tree(tg_schedulable, tg_nop, &data);
> >> ?}
> >>
> >> -static int tg_set_bandwidth(struct task_group *tg,
> >> +static int tg_set_rt_bandwidth(struct task_group *tg,
> >> ? ? ? ? ? ? ? u64 rt_period, u64 rt_runtime)
> >> ?{
> >> ? ? ? int i, err = 0;
> >> @@ -10372,7 +10477,7 @@ int sched_group_set_rt_runtime(struct task_group *tg, long rt_runtime_us)
> >> ? ? ? if (rt_runtime_us < 0)
> >> ? ? ? ? ? ? ? rt_runtime = RUNTIME_INF;
> >>
> >> - ? ? return tg_set_bandwidth(tg, rt_period, rt_runtime);
> >> + ? ? return tg_set_rt_bandwidth(tg, rt_period, rt_runtime);
> >> ?}
> >>
> >> ?long sched_group_rt_runtime(struct task_group *tg)
> >> @@ -10397,7 +10502,7 @@ int sched_group_set_rt_period(struct task_group *tg, long rt_period_us)
> >> ? ? ? if (rt_period == 0)
> >> ? ? ? ? ? ? ? return -EINVAL;
> >>
> >> - ? ? return tg_set_bandwidth(tg, rt_period, rt_runtime);
> >> + ? ? return tg_set_rt_bandwidth(tg, rt_period, rt_runtime);
> >> ?}
> >>
> >> ?long sched_group_rt_period(struct task_group *tg)
> >> @@ -10604,6 +10709,120 @@ static u64 cpu_shares_read_u64(struct cgroup *cgrp, struct cftype *cft)
> >>
> >> ? ? ? return (u64) tg->shares;
> >> ?}
> >> +
> >> +#ifdef CONFIG_CFS_BANDWIDTH
> >> +static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
> >> +{
> >> + ? ? int i;
> >> + ? ? static DEFINE_MUTEX(mutex);
> >> +
> >> + ? ? if (tg == &init_task_group)
> >> + ? ? ? ? ? ? return -EINVAL;
> >> +
> >> + ? ? if (!period)
> >> + ? ? ? ? ? ? return -EINVAL;
> >> +
> >> + ? ? mutex_lock(&mutex);
> >
> > What is this mutex for ? So you essentially serializing the bandwidth
> > setting of all groups ? While that iself isn't an issue, just wondering if
> > cfs_bandwidth.lock isn't suffient ?
> >
>
> The idea isn't to synchronize quota updates for all groups, but to
> synchronize it within a single group. Consider the case of 2 parallel
> writes, one setting infinite bandwidth the other setting finite.
> Depending on rq->lock contention it's possible for both values to
> propagate to some of the cpus.
>
> You sit on the bandwidth lock because then there is inversion with
> update_curr, e.g.
>
> cpu1 -> rq->lock held, update_curr -> request bandwidth -> acquire
> cfs_bandwidth.lock
> cpu2 -> tg_set_cfs_bandwidth, hold cfs_bandwidth -> try to acquire cpu1 rq->lock
>
> This mutex could be per-cgroup but users shouldn't be updating fast
> enough to the point where they require it, it also reduces rq->lock
> slamming when users update several cgroups in parallel.

I get it. cfs_bandwidth.lock was suffienct for me since I have per cfs_rq
locks protecting the runtime related fields of cfs_rq. When I started,
I too had a simple scheme like yours where a per-rq lock protected the
runtime related fields of all cfs_rqs under it, but when I added runtime
rebalancing, I found the need for per cfs_rq locks and hence followed
rt code more closely. But I can see that since you don't do runtime rebalancing
you can keep the locking simple with just per rq lock protecting the
runtime related fields of all cfs_rqs under it.

Regards,
Bharata.